Welcome to Module 3 of this MOOC on information extraction on health data. We'll start with what is medical Named-entity Recognition. The outline for this week then this module is going to be really what Named Entity Recognition is and then how do you recognize these fields in text? And how we can build on what we've seen in Modules 1 and 2. Named-entity recognition is the task of labeling a sequence of words in text, which are names of things from a predefined category of items. For example, we want to identify certain components like names of people. That would be a sequence of words in text and one of the categories of interests would be the name of person. Named Entity Recognition has two concepts, one is to recognize this string of text and then to classify that string as being a name of a person. You not only identify where it is but then, what are you identifying? Two parts. For example, if you have a sentence like this, Dr. Shaun Murphy works at San Jose St Bonaventure hospital. You'd see that this particular sentence has two entities, two named entities. One is Dr. Shaun Murphy, that is the name of a doctor and the second one is the San Jose St. Bonaventure Hospital. Two strings, they are separated by words that are not relevant for any of the classes that we have identified because in this particular case we are looking at name of a doctor and name of a hospital. Another way to say the same thing is maybe with tags. Dr. Shaun Murphy is a doctor and so it starts before doctor and it ends after Murphy and then the hospital name starts before the San and ends after the word hospital. This tool can be used for question answering. For example, if the question was, where does Dr. Shaun Murphy work? Once you've identified this and classified the elements in the sentence, you can answer that question. The other would be to summarize or to de-duplicate. One sentence is Dr. Shaun Murphy and another sentences is doctor S Murphy, it's likely that they are the same, so that would be the de-duplication component. It is also useful for information retrieval and to index documents. For example, if you want to identify hospital names, instead of just the word hospital, maybe it makes sense to index the San Jose St. Bonaventure Hospital as one unit so that whenever you search for that hospital, you can find all documents or all sentences that have this hospital mentioned. It is also important in another task that is very specific to medical data and that is protection of identifiable information. There are a lot of country-specific laws that help prevent misuse of patient information. In the US, there are rules like the HIPAA Act, that is the Health Insurance Portability and Accountability Act of 1996 and more recently on the Health Information Technology for Economic and Clinical Health, that's the HITECH Act of 2009. Both of these are really talking about how medical data or clinical data that is in electronic form should be protected before it's shared with others who are not participant in the collection of data. How do you protect the patient's information? But European Union has something similar in the general data protection regulations. Canada has something similar in Personal Information Protection and Electronic Document Act and India has something similar in the DISHA Act, the Digital Information Security in Health Care Act. I'm sure that multiple countries around the world have some act or some laws to protect the privacy of patient information. What does that mean? What is that identifiable information and clinical data? We have in HIPAA, 18 categories of personal information that is identified. For example, you have demographic information, patient name, age, profession, organization, this patient belongs to and so on. You have clinical information. Who's the doctors? Doctor name. Where did the patient go to get the care, hospital name? When did the patient go? So date of visit. Then you have location information like address information, street, city, state, country and so on. You also have unique numbers and handlers. For example, medical record number or social security number or a tax identification number or passport number. It could also be license number or phone and fax and e-mail, URLs for personal homepage, for example. If these are collected as part of a clinical care, you would want to consider them as identifiable information. With this variety of fields of identifiable information, we need to have different approaches to tackle them, different approaches to identify them. We could use handcrafted rules like patterns or using a list of items. Some things that we have seen in Module 1. Names of different, let's say proficient, that might be there in a standardized list that we saw in Module 2. You could use automated approaches as well. For example, some statistical approaches or machine learning-based approaches, could be used to identify names of patients or names of doctor and so on. That will be the focus of this week. You can also have hybrid approaches, something that starts with handcrafted rules and then builds an automated approach or something that starts with an automated machine learning approach and then we do some sort of cleaning or post-processing using handcrafted rules. That is also a possible way to identify these. For formatted text in Module 1, we have seen that if fields follow particular format, for example, things like age, phone number or fax number, emails, URLs, medical record number, social security number, tax number, passport number and so on. We could use fields or something like regular expressions. For closed list, that means when you have a set of names and only those can appear, an example would be state. For example, all the states in the United States or all the states in India or all the provinces in China, that would be a closed list and you can just work out of that list. You could have all the names of countries. For example, a list of a 194 countries that are recognized by the United Nations Organization. City names or street names are others that could be closed list, especially if you're working with a focused area. If your clinic is in a particular city, only the city names and street names in that vicinity might be of relevance because your patients might come from there. Profession names are also close lists so you could get a list of all different professions somebody could have. Demonym would be names assigned to people coming from particular countries so you could have some names that way. How would you identify these, you could start with gazettes and gazetteer which is this list of known and verified names. For example, all the different states in a country would be from a gazette. You could also use patterns to build on these lists and create hybrid approaches. For example, street names themselves might be a long list but you know, that street name always ends with something like a street or a role or a lane or a drive or a boulevard. All of these could be a good endings for street names and then you can have a closed list to use then. Good. Once we know there are these different approaches, let's look at these 18 categories again. Age could be identified using regular expressions. Profession could be identified using lists. Same way we have the ones that are highlighted here, are identifiable using either regular expressions or lists. But what about person names? That is this category that we have not yet looked into. Or organization names or hospital names. What do you do with those? How do we identified those? The way to do that would be either using hybrid approaches that we have seen, using lists of patterns and rules or use automated approaches, which is the supervised machine learning approaches. In this week, we are going to focus on supervised machine learning as an approach to identify certain components.