[MUSIC] Welcome back, everyone. It is sometimes tempting to get excited by fancy and cool modeling techniques. And indeed, these tools are fun and often exciting. As I have discussed, it is important to first think about the business question that needs to be addressed. In addition, it is helpful to understand what data are available and in what structures these are in before getting too far ahead with thoughts about models. In addition, it is very important for analytical teams to speak the same language with regards to technical terms associated with data science. Thus, in this lesson, I want to talk briefly about data mining algorithms and definitions. I will be using the book titled Data Science for Business to review data mining algorithms that will be used for specific application areas in the course. As this is a slightly longer conversation, we have divided this lesson into two parts. At the end of this lesson, you will be able to use concise definitions and examples to identify categories of data mining algorithms. These terms and categories will allow you to work in diverse teams that involve healthcare domain experts, clinicians, database experts, informaticists, and data scientists. Let's dive in. So how do we describe data sets? Before I start discussing types of analytical models, let's review some important synonyms for describing data. We'll also look at how data are being conceptualized for various types of analyses. In this image, we have a simple table with columns and rows. You might also call this table a dataset or a worksheet. The top row of the column has labels. These are the variables, features, fields or attributes. The rows of the table other than the top label row, might be called records, instances, cases or individuals. Starting in the top left, what is an individual? This is often a term for rows that form the unitive analysis within data mining. In this example, the individual could be the first column that represents people or patients as shown by the column labelled individual ID. In data science, analysts use various terms to describe the same thing. Let's review these synonyms. First, it is very important to think about what people mean by the term data set. This is usually the same thing as table or worksheet in which there are rows and columns with specific data elements in each. This is by no means clear cut. A dataset could be an analytic file created by a programmer who uses SQL to query multiple tables from a relational database. Sometimes people also have an Excel worksheet that may or may not have come from a more complex relational database. Next, look at the list at the left. People use different terms for roads or records within a data set or table. Additional terms include instances, examples, or cases. These tend to be popular among computer scientists. Regardless of the terms that are used, analysts will have to decide what unit of analysis will be used to produce the data. Many data scientists use the term individuals to refer to the entities within that data that are likely to be the focus of the algorithms. Individuals could be people or they could be other entities, such as hospitals or even states. Next, there are many synonyms for columns within a data set. These include features, fields, attributes, and variables. If not already confusing enough with all these names, we need to be clear about how analysts are using columns in their analyses. Within a method of just cluster analyses, a dozen or more columns might be used together to group these so called individuals. Once we move into the area of predictive modelling, various columns can be put into specific categories that also have different names. For example, in predictive modelling, we need to have a target column or the column that we're using to try to estimate the likelihood of outcomes or events as defined by that target. Thus, the target is what we're trying to predict. For example, readmissions, adverse events, or length of stay. Computer scientists and data minors are more likely to use the term target, but statisticians often refer to dependent variables or outcome variables. In predictive modeling, we also need columns that are correlated with or predict the target. Synonyms here include independent variables, predictors, explanatory variables, or inputs. The term prediction in the field of data science is estimating unknown values. So this is a general term for a wide variety of approaches. In everyday language, this means a future event, and thus it can confuse people. For example, I was once working with a researcher who was not very familiar with data mining or statistics. He was confused as to why I was using the term predictive modeling when our goals were to create a model to explain variation associated with healthcare costs in the current or past years. The research had nothing to do with predicting future costs. Thus, it is important to remember that predictive modeling does not have to be future oriented, although often it is. Thus, it is important to remember than predictive modeling does not have to be future oriented, even though sometimes it often is. Next, a predictive model can be contrasted with descriptive modeling. In descriptive modeling, there is no estimation estimation of values. For example, an analyst does not attempt to quantify the value for a particular outcome or assign a particular classification label. The purpose of descriptive modeling is simply to describe the underlying phenomenon or processes. Supervised and unsupervised are terms that came from the subfield of machine learning within the broader field of computer science. These terms are now common in a wide variety of discussions related to data science, data mining, and statistics. The analogy between learning and education is that supervised learning is the method that carefully supervises the target information along with the examples. Supervised learning has some of the following traits. First, classification and regression modeling is a form of supervised learning. Second, the dataset has a clear target label or outcome variable. As an example, each row or individual in the dataset has a target, such as cancer or no cancer. In contrast with unsupervised learning, there are examples to learn from, but there is no clear target information. Thus, there is a need to find the labels groups and classes, especially if one later wants to perform supervised learning methods. Clustering and association mining are examples of approaches that do not require a target. So that's a lot to take in. Let's take a break and take up the rest of the related definitions that are important to data mining and predictive modeling in the next lesson, which is part two of this topic.