Hi, I'm Dave shade. Today we're going to talk about data management in the context of clinical trials. We're going to cover a fair number of definitions and core concepts. Then we'll talk briefly about different data management systems that are often or sometimes used in the field of clinical trials and other clinical research projects. We're going to talk at some length about the use of Excel or other spreadsheet programs as a specific example of data management techniques. Partly because there are very widely used, especially for smaller projects, and also because they help illustrate principles even as applied to much larger data management systems. We'll talk briefly about the incorporation of external data into a trial management system. We'll spend some time talking about data integrity, which will incorporate features of both data security and also data redundancy or data preservation. Starting with a few core definitions and general topics, we're going to briefly describe terms such as data collection instrument, records, variables. What is a database itself, a subset of database structural technologies referred to often as a relational database, will briefly discuss something called SQL. We'll discuss a database approach called a transaction and how it might apply to the use of a database in a trial management context. I will use the term data collection instrument, or sometimes I'll abbreviate it as a DCI that refers to perhaps the smallest unit of data collection. The term I'll use it interchangeably with the terms form or case report forms, sometimes abbreviated CRF or even ECRF or an electronic case report form. These are examples of terms to describe a data collection instrument. They could be referring to a paper-based hardcopy DCI or an electronic DCI. Certainly, projects might incorporate both modes of data collection into their trial data management system. We have a separate module in this course describing and discussing the design of data collection instruments. That's a separate module within the course. Many people in this field will use the terms record or observation or row interchangeably, each of them could refer to a single collection of data storage, for units from a single data collection instrument or a single entity. In a clinical trial, they might represent a collection of data elements from one DCI as administered at a single encounter. For example, an instrument that asks people about symptoms and blood pressure at each of a weekly visits might all be stored within independent rows within a data management structure. There would be one row for that collection of data at the first-week visit and another row for the same collection of data at a second-week visit. This is not the only way that data are stored, especially at the data storage phase. But this is most often the way that data will be extracted and presented to an analyst for analysis towards the end of a study or during a study for some interim reporting or some interim analysis. We'll talk briefly about the difference between data storage and data preparation. In many storage models, all the data for a single observation, again, sometimes referred to as a record or row, are stored actually in a structure that looks like a row. A great example of this would be the way data are stored in Excel. If you are using Excel or another spreadsheet for data storage, you would have all of the data regarding symptoms and blood pressure in a single row within that file. There'll be additional rows for either other patients or other visits or encounters for that same patient. Another corresponding trio of terms that are used interchangeably is variable or field or column. Each of these refers to a single data element within some collection of data. For example, a patient's ID or a patient's blood pressure as measured at one encounter. Some variables are measured more than once, often on different patients or participants in a study and sometimes the same participant at different time points. Even occasionally for duplicate measurements at the same time point. Blood pressure is an excellent example of that. Studies for which blood pressure is a very important measurement. Sometimes you measure the blood pressure twice, five or 10 minutes apart, and then figure out an analytic plan for how to make use of those duplicate measurements. Each time you measure the blood pressure, that might be a single variable. Or some studies might break the blood pressure into two variables. One for the systolic component and one for the diastolic component. In many database products, this single data element will be referred to as a field. That's the term most often used in the database world. Analysts and statisticians are probably more commonly going to refer to that as a variable, meaning the exact same thing. Depending on how the data are stored, often both of those terms might be referred to instead as a column because again, in a typical spreadsheet model where rows represent observations, each of the columns represents a single data element or a variable or a field. All those terms mean essentially the same thing usually when referring to data management in the trials contexts. A database itself is actually a fairly non-specific term and is, in my experience, often misused by clinicians that are working on trials. For the most part and generally it's used to refer to the collection of all the data that have been collected or incorporated from a single trial or project along with the software programs, the interfaces, the other tools that are used to characterize, manipulate, and report the data. Although sometimes people use the term database to refer just to the data collection and not any of the other tools or interfaces used to manipulate those data. Purposes of this lecture, it's not going to really matter very much. A particular approach to database structure is the term relational database. This is often used to describe a data organization approach in which collections of observations are organized together in something called a table, table is the term that's most often used to describe a collection of observations, and different tables are organized within the same database and are related to one another through something called linking variables. The most common linking variable might be a participant ID, and they're called relational databases because the tables are related to each other through these linkage variables. Because of structures built into the relational database model, it's easy to link the data, perform analysis or reporting of data based on these linked variables. An example might help make things a little clearer. Imagine a very simple trial measuring blood pressure at a beginning point prior to any intervention, and then an additional period, maybe three weeks after treatment with some new medication. At that first encounter, you might record things about the participant, their gender, their age, socioeconomic status, educational level, et cetera. You might record some symptoms. You might ask them what other medications they're taking, and you might measure a starting blood pressure. Then they come back in three weeks after taking the medication, and you might again record symptoms. You might ask them about any medical events that have occurred during the previous three weeks, and then you might again measure their blood pressure. If I were to organize these data, I'd create a table that contained that demographic information for every participant in the study and I'd create another table, maybe I would call it symptoms that contained all that symptom collecting data at both the baseline and at that follow up three-week visit. Create a third table to collect the medication data that were collected only at the baseline visit. Some people might think to put the medication data in the same table as the demographics table, since they're both collected only at baseline, and that would be fine. Some people prefer to keep tables smaller if possible, other people don't follow that same approach. I might create a table for the blood pressure measurements, which again would combine data from the baseline and that follow-up encounter, and I might create a separate table for the events that are reported to me, the medical events. I would have to make sure that every one of those tables contained at least one, if not more than one linkage variable, in this case, the participant ID. If I wanted to be able to report the change in blood pressure from the beginning to the end, I'd have to make sure that the data were properly identified in both of those encounters. If I wanted to combine that with demographic information which would be common, such as gender, then I'd have to make sure that that demographics table also contained that linkage variable, a participant ID. I probably would also want to make sure that the data set contains some variable for the visit ID so that in the blood pressure table or the symptoms table, I could properly distinguish which records or rows or observations contain the data pre-treatment, and which ones contain the data post treatment. If I were to make a diagram that described how these data were, I might have two tables as an example, on the left here I'm showing the demographics table, then there's a column marked ID, that might be the participant ID, and each of the other columns in a single row contains some of the demographic information for that participant. To the right, I might show an example of a blood pressure table. In this particular example, I'm imagining a study that has more than just one follow-up visit and the visits are in the second column that says visit, and maybe V01 represents that first baseline encounter. Then the other visits, like V02 and V03, maybe there visits that are a week apart or a month apart or maybe they're not equally spaced. I'd have to look at a study protocol to know what those visit ID is represented. But you can see there's a column that's marked ID, and it's the linkage back to the demographic table. That linkage is what makes this a relational structure. I'm relating two tables, one to the other. You can see here there's a column that's headed BP in the first row and that would be the blood pressure measurement at each of those different encounters and the visit column and marked by the ID for the participant ID. This relational structure would allow me to separate different collections of data into separate tables, but maintain the relationships between the tables. Hence, a relational database. Relational database products, actual pieces of software, are designed to support this relational structure. They're able to define those relationships and make it easier to combine the data according to those linkage variables. This became known as a relational database and is often thought of as being an important element of proper data management. I'm going to make the claim however, that structured relational databases are not really as critical as they once were. This is mostly due to the fact that relational database advantages were often to make smaller databases that could be accessed more quickly. Today, the concerns about space and speed are not as important as they used to be thanks to advances in computing technology, and all of us are frequently using very large databases that are not especially structured in the relational model, but obviously work just fine. The reason I bring this up, however, is that there are plenty in the trials community who think that using a relational database is necessary and that anyone not using a relational database and not in describing their study database as relational, is perhaps in competent or not knowing how to do things. I personally have received comments in grant application review that have referred to our study data management system as lacking something important since we didn't call it a relational database. Sometimes, we'll just describe our database as having relational components in order to satisfy that concern, and sometimes we'll try to engage in a little discussion. In the end, all data that are stored in any form of a separate structure or even in a similar structure, but with separate identifiers, have to be able to be related. Therefore, I think it's fair to say that those might be relational in nature, but I think database professionals might not describe all modern data structures as a classic relational database. These are probably arguments that are not worth very much unless the person you're discussing it with, really he thinks it's important in which case you might have to choose to agree. But it is important to understand the necessary linkages between different data elements that are stored separately or maybe not stored separately, but are still not in the same record or observation. That relational nature is very important. The term SQL, widely used in the data world refers to Structured Query Language. It's a language that has been used in the management and shaping of data for decades now, it's often used within products as you're interacting with them, even if you never knew that SQL was the language in the background. I'm not positive, but I would be willing to bet that there are some SQL at work, for example, when you search something in Google. I've given at the bottom of the screen here just a few very simple examples of SQL statements. The first example says select star from demographics. That's amongst a simple list of statements that is asking a database to return to me all of the fields, which means all of the variables, in a table that's named demographics. Just return all of those, every record or every observation. The second example, select ID, visited, BP, from blood pressure, is going to return just three identified fields, the fields being ID, visited, and BP from a different table called blood pressure. SQL statements can get very long and complex. They can be many lines, many pages, they can have lots of parentheses and complex looking structures. It's an extremely powerful language, well known and well used by database professionals. This is one of the primary tools used to manage databases themselves and perhaps a raw as to possible form. When you're interacting with a database and you click a button to generate a report, often, the clicking of the button is executing SQL statements in the background. There are lots of courses and materials available to learn more about SQL. At this level, I think it's important that people in the trials world that have something to do with data management, at least, be aware that there's this term SQL, that it has something to do with a computer programming language that specifically targets data. Beyond that, might not be necessary for everyone. It's a somewhat standard language, but different databases do implement SQL slightly differently, and therefore sometimes, the exact words that work for one database product have to be changed slightly in order to work with another database product. For the most part, the concepts of SQL transfer across different database products. It's just sometimes small differences in the terms used or how the statements are put together. An analogy for how SQL differences might be perceived would be similar to how loading a website in one browser versus a different browser might show the same website in slightly different ways, but not meaningfully different ways for the most part. Another term that's often an important in the database world is the term transaction. A transaction is a sequence of steps, all of which need to be completed in order for a desired change to work successfully. When working in the world of a transaction, it's important to remember that if along the way, during the sequence, one of the steps fails, for some reason, all the steps that have already been completed have to be reversed or else the database is left in an uncertain state. A classic example of the importance of a transaction is in the world of banking. If you imagined a person trying to transfer money from one account to another account, a sequence of steps might be forced to check the balance of the first account, then deduct the proper amount from that first account, then access the second account, then add the balance to that second account. Well, if the deduction successfully completed from the first account but for some reason it couldn't be added to the second account, there would be many unhappy people if the deduction was not reversed, since the money would have just disappeared. Very similar things can happen in the world of trials. A perfect example might be a trial data management system that incorporates functions to randomly assign participants to one of several treatment groups. Perhaps, the process for that involves first obtaining the next treatment to be assigned, then marking that treatment as having been assigned, then going to a separate table and marking the proper treatment for the participant who is to have just been assigned. If the treatment assignment table is marked as used, but something goes wrong with marking the participants table as having been assigned, it would be a problem if that could not be on done or at least wasn't accounted for. The problems that can make one of these steps fails can be as simple as incorrect permission assignments. Maybe the program is not set up correctly, it has permission to obtain the next assignment, but it somehow doesn't have permission to update the participant table, or maybe there's a program error, or maybe there's a hardware error and the hard disk that stores the data for the participant table has a problem at just the wrong time, or even sometimes there can be collision errors. These are things that happen when multiple users try to access the same piece of data at the same time. These are all just examples of transactions. Some sophisticated database products understand what a transaction is when you ask it to complete a sequence of steps that are marked as a transaction, they keep track of each one of those steps. They keep track of the state of the database before the step was attempted and after the step was attempted, and if a later step is not completed successfully, the product actually knows how to reverse each of the intervening steps. Some trials will use these sophisticated products and therefore those transactions if packaged properly, for example, for a treatment assignment in a randomized trial, then the database product can help manage that part of it. My experience has been that most trials are not using that database product, and therefore, the users, and the programmers, and the coordinators have to remember that there's sometimes a sequence of steps and that if one of the steps fails, there may have been a problem in reversing some of the previous steps, and this can often present some problems if not remembered and corrected manually. A little bit more of an overview about trial data management. One thing important to remember, this is an extremely complex subjects. In the School of Public Health at Johns Hopkins, we teach a completely separate course just devoted to trial data management. A single lecture is going to provide only an introduction and some background, but probably not the full breadth of the topic. The goals of having an introductory talk, however, are quite important and we want to help people working in the world of trials make good design decisions and to avoid bad design decisions, and most importantly, to avoid major mistakes, and also to help people working on trials communicate effectively with computer programmers and database administrators and other data management professionals to perhaps don't have a full appreciation of clinical research, but have a great handle on computer programming and database design. You'll understand some of their terminology now and you'll be able to help guide them into designing a study database in an effective way.