Hello again. Once a team has refined their requirements about the problems that they want to address and they have created an analytical plan, a next step is to carefully document the concepts that will be required to complete the project. Once domain experts document and explain what analytical concepts are required, database experts and informatics specialists can help map the concepts to specific fields within the databases. Most health care databases have a huge number of tables and many of the tables have many rows per patient. As a result, specific rows have to be extracted based on coded values that categorize why the rows were created. This complexity can lead to time consuming work for the database programmers who were tasked to extract the data. These include managing duplicate rows, creating analytical files with fewer or single rows per patient, and complex transformation tasks that required due to data inconsistencies through time. At the end of this lesson you will be able to articulate the processes required to map an analytical plan into sets of concepts, and then map concepts to specific database fields. With more knowledge about clinical database issues, you will be able to communicate to an analytical team some of the problems that might occur when extracting the data. Let's begin with the first step in the process. The first step with data extraction is to define what concepts need to be extracted from the database. I say concepts here rather than specific data fields because often subject matter or domain experts will be driving the initial project goals, and these people may or may not know exactly what data is available. Moreover, sometimes getting too deep into the data too early can cause people to forget about the overall goals of the analytical work, and then possibly, they can start to focus on the solution before the problem is defined. The key idea is to carefully think about what processes or workflows need to be improved, and then consider what measurements could help solve the problem. One tool kit that can help is to use concept mapping. This involves an effort to document all of the modules or ideas involving problems and solutions along with all of their relationships. For example, if a team is trying to figure out how to reduce hospital readmissions, they might start with a concept map that list out concepts such as patients, admissions, transfers, provider types, and how all these interacts to impact readmissions. With the resulting list of concepts and concept relationships from the concept mapping process, the next step is to approach the database and analytical teams to start to think about the best data that can be used to represent specific concepts. Remember, that health care data as a complex domain and some concepts can have mapped to dozens or more fields. It can take time and thinking to figure out which fields really might measure the intended concept. For example, I've worked in electronic health record databases with over 10 thousand fields. We have hundreds of fields related to diagnoses captured in various workflows and dozens of date and time fields related to hospital discharges. In another example, clinical databases have huge numbers of non-standard fields captured it non-standard formats. Often, people create a new form for the same concept and in the end there are dozens of different forms and associated fields for the exact same concept. It often takes careful work with domain experts to figure out which specific fields are the same measure and which are different. These are clinical examples but the same idea applies to claims data. I've spent hundreds of hours painfully trying to figure out which of a dozen or more fields that seeming relate to a pay claim amount really is the amount that was paid. Some of the other fields are related to a subtly different aspect of the payment. Yes a data dictionary would help solve the problem, but often these do not exist, are error prone or another team member may or may not want to share these critical metadata with you. Internal documentation helps a lot, but programmers must develop a culture of sharing to best achieve shared aims. Most claims and clinical electronic health record databases are transactional systems where new records are added for new events and existing rows are kept. For example, as claims go through the adjudication process, sometimes dozens of claims will be submitted and resubmitted each new submission resulting a new row. In a clinical example, as medications are reconciled or reviewed by various providers, many rows are created. In addition, the medication dispensing process may result in many rows of data as a drug was ordered and processed by the pharmacy In other words, each click and the system is an action that is documented. In these cases, new rows will be added to show the updated transactions while leaving the original rows as a record of what happened in the past. These database structures are not problematic given that each row will have a key to identify that it is a separate row. However, problems can emerge when analyst's query the data and perform joins that keep all of the copies of the data in a final dataset. The problem can become magnified when tables with many records per event or joined with other tables with many records per event, the result can be files with hundreds or even thousands of duplicated rows. This can clearly be fixed by selecting just one row and thus the data are de-duplicated before merging them into an analytical file. So, you can see the importance of a health care data analysts being aware of such potential problems and taking required actions to remedy any confusion. Sometimes dealing with multiple rows per patient is as simple as using a distinct statement within a SQL query to remove duplicate rows, but there will often be times when analysts need to work closely with the domain experts to determine which specific rows should be included in the final output file. For example, often data scientists and statisticians want an analytical file that is one row per patient or encounter. Thus, the data retrieval programmer has to create a query that moves from multiple rows per patient to a single row. For a patient with 20 labs for hemoglobin A1C, the programmer could take the average of these and get one row per patient. Other times, it might be more complex and there needs to be discussion about what events the domain experts really want to keep. Is it the first, the last, or the one closest to time x, the one that is not missing for example. In addition, it can take a lot of thought and decisions to aggregate categorical data. If a patient has five different race and ethnicity values in the database, which is not uncommon, it can be tricky to think about which is the best value to pick. I've done a lot of data retrieval work over the years and often investigators request specific types of labs, medications or procedures based on standard codes such as CPT or RxNorm. But they do not know the local or nonstandard codes that reference the data in the electronic health record system. For example, the clinical database could have a code called LAB- SC-026-. The only way to figure out what this means is to reference the descriptive label associated with the code. However, these labels are sometimes missing or incomplete and as a result, it can take a lot of time to find the standard concept that the researcher or analytical team is really interested in. There can be hundreds of possible matches, and sometimes you need to have a clinical team select the row that they think match their clinical concepts. Overall, this is tedious and frustrating work that can slow down analytical projects but it is important nonetheless. Okay. We are ready to consider some additional data extraction issues. There are a lot of issues that come up when you try to extract concepts from complex clinical or administrative health care datasets. I cannot stress enough how common it is for analytical teams to discount the challenges associated with this step. As a result, projects can get under budgeted and people can get frustrated with how much time the data extraction process takes. Here's just a short list of some examples of how data extraction can be tricky. First, consider data differences through time. Data within EHR systems often changes through time, usually for the better, but we should be careful to include significant changes in our analyses. For example, on October first, 2015, there was a transition from ICD-9 to ICD-10. Therefore, when spanning this time and queries, it is necessary to get both types of codes. But often investigators are researchers forget about this transition. As a result, it takes more time to summarize data from these different sources. Second, analyses often require conditional logic to fix bad or missing data. EHR data are collected for similar medical concepts in many locations and thus when data are missing or problematic in one field it can make sense to supplement or augment the missing or bad data from another field that might be more complete. This is usually done with programming codes such as SQL CASE/WHEN statements. As an example, smoking data was captured in a half dozen places where I once worked. However, we were able to create a complete field using data from all the sources. Overall if high quality data is required it is best to combine data from all sources. Use conditional logic to start with the most valid data and if missing, provide values from the next most valid source. Finally, I've often struggled to extract complex inpatient data. Often, clinical data from the inpatient setting is extremely granular and complex. As a result, numerous clinical details are stored in a variety of places such as flow sheets or smart forums as they were called at least by one vendor. It can be very challenging to find all of the clinical events stored in various tables and it is sometimes necessary to search flow sheet IDs. It is then necessary for the analysts to learn how these various concepts might be aggregated so that actionable information can be extracted. Excellent. We now know about some of the important issues associated with extracting data from complex health care data sets. In the next lesson we will move to the equally labor intensive task of transforming data for analytical structures. See you soon.