Keyword windows is a text processing technique that extracts the text surrounding a keyword to provide the context about how that keyword is used. Let's look at two examples. First an identification task. Our goal is to identify patients who experienced muscle pain while taking a statin. To find these patients, we can search clinical notes that likely would record this information, things like history and physical notes or even clinical communications. Remember that clinical communications are the messages between the care team and between patients and providers. Each of these notes, we look for one or more keywords. Since we are trying to identify side effects with medication, our keywords are the names of those medications. In this case, we can search for each drug name, including both the generic and the brand name. For example, we had searched for atorvastatin and lipitor, simvastatin and zocor. We can also use the class name statin. At every place in the note that we find the keyword, we then extract the text around the keyword. This is called the window. Window sizes can vary. Sometimes the window will be smaller, say maybe five words before and five words after the keyword, or it may be longer, maybe two sentences before and after the keyword. We'll talk more about how to pick a window size a bit later. Once you have extracted the text window, you start searching for other words and phrases that indicate the condition you are looking for. In this case, we want to look for words like pain or, in the clinical term, myopathy. Think of this like an inclusion criteria. If they have one of these terms, you are including them in the group you consider cases, those that had the reaction. Develop a list of keywords and phrases based on your clinical knowledge and of manual review of the text windows. You'll likely notice as you build the inclusion word and phrase list that sometimes those terms are actually not indicating the condition of interest. For example, in this case, we can see that the provider noted that the patient does not have myopathy. In another note, the provider is warning the patient to contact them if they develop muscle pain. These phrases will become part of your exclusion list. This is a list of words and phrases that you apply to your case windows to exclude them from the case list. This list is created primarily through review of the text windows. At the end of this process, you'll have a list of text windows that do and do not indicate the patient having muscle pain while on a statin. Any patient with at least one case window are labeled a case, while patients who have no case windows are considered controls. Extraction tasks have a similar overall process. Our extraction task is to extract the location of an intracranial aneurysm. We will search radiology reports of the brain and search for the keyword aneurysm, extracting the 10 words before and after our keyword. Now, instead of an inclusion list, we build an extraction list, the words and phrases indicating the location of an aneurysm we want to extract. For example, terms like internal carotid artery or ICA. We will still build an exclusion list that will remove windows from consideration if they have phrases like, no evidence of aneurysm in the internal carotid artery. Like with the identification task, the extraction lists is created through a combination of expert knowledge and window review, while the exclusion list is created primarily through window review. The final question is how to select your window size. In traditional text mining, you would try a number of window sizes and compare the performances using a ROC curve, taking the window size with the best desired performance. In our case, since we were developing the inclusion, extraction, and exclusion list, partially based on the text and the window, this data-driven approach does not work well. In practice, I typically go through an iterative process, as I'm starting to develop the inclusion or extraction list. I'll keep the full note open with the text window in a second column. This lets me get a sense of if I'm just missing things outside the window. If there seems to consistently be important data outside the window, I go and I make the window bigger and review again. If instead I'm noticing that there are a lot of off-target matches in the window that are further away from the keyword, then I shrink the window size. I keep this process going until it seems like the majority of my windows have the information that I need. Now that you know the basics of keyword windows, lets get started applying this technique.