Hello, everyone. Today we're going to continue our discussion of data mining pipeline. We're going to scholar with the first of core component, which is about data understanding. For this particular topic, what we're trying to understand is that how do we describe the key characteristics of data. This is about understanding data. Also how can we apply different types of techniques which you can characterize the different types of datasets. As we have introduced earlier, we are looking at the whole data mining pipeline. We want to pick the raw data, but then before we do any other processing or modeling, we really want to just spend our time and understand what kind of data we're working with. We're talking about data understanding. We want to look at what kind of objects do we have in this kind of dataset, what kind of attributes are used to describe those data objects. Also we're looking at various ways to calculate the statistics of your dataset and also visualization are very powerful and very effective in terms of conveying certain types of characteristics of your data. Then we'll also look at how we measure similarity. Starting point, you are provided with a dataset. What is in a dataset? Of course, we usually say a dataset is a collection of data objects. Objects would be the main entity. The objects could be employee records. Say the individual objects would be each individual inquiry or this could be a particular set of products you are interested in, or any online posts, so each one would be one of the object that you're interested in. You have many of them. But then the other part that you don't just have those, obvious you need a way of describing them. That gets to those notion of attributes class. To describe each object, you are using a number of attributes. Here attributes can also referred to as features, variables, dimensions. Those are typically used when people try to say that how many attributes you're using to describe your objects. Let's look at some concrete examples. In your dataset that contains employee record, what information you want to use to describe each employee. Of course, you usually need to have their first name, last name, or middle name as well, date of births, or the position they're in, or any particular related information, for example salary. This could be monthly salary, hourly pay rate or something. Or let's look another example, online posts. I think about it online social media, then you could say, each post will be defined by say who post it. You would have the user account, and then you would have usually something like timestamp, when this was posted. If we suppose in a particular subthread or subtopic, that's also information you can leverage. Of course you have the raw content of this particular post. Then you can get responses to this particular post, if it's a tweet, you would say how many times it has been retweeted or who has retweeted or who has commented on that one, how many people like that one. All those are information you can use to describe your objects. Now, when you look at the different attributes, of course, those attributes can be very different. I think about our early example, if you're trying to describe your employee, they have their name information, but you also have their salary information. Apparently they have different types. Typically when we talk about attributed types, we consider them as either categorical. Those are nominal information, in a way that's more like a finite number of potential values for those particular attributes. This could be general nominal if you're talking about what is your major, like undergrad or graduate to major, then you may have a few possible terms like computer science being one, or data science being another one. Then you could also have this binary scenario. That means it's 22 possible information. You could have either in a simple, it'd be yes or no. Are you or CS major or not? That's a yes or no question or do you have a particular symptom or not? Yes or no. That's a binary information. But ADOS is also, there are a few finite potential values but then they are ordered in some way. Think about academic ranks, so you could go from assistant professor or associate professor or full professor or you're talking about the degree, like high school, college, graduate school. Those are different nominal values but they are ordinal in the sense that they have a certain order, associative wisdom. Then in contrast to the categorical values, then you are typically seeing a lot of numerical values. This could be integers or real values attributes. Think about again, your employees, like a salary. That's usually the real value because you may have very specific values for what a particular employee is getting, or you could have say time or date information, those are usually discreet values. That have say this is the year of 2020 and then you have a particular date to use. Those are all numbers, they're numeric, but then they have again difference in terms of whether they are continuous value; real value numbers versus discrete values. Another one that is important to consider is whether this particular attribute is interval-scaled or ratio-scaled. By that what we're referring to is that whether this particular attribute has a true zero point. What does it mean? Think about you're talking about salary. Apparently your salary, if it is a zero that means you're not getting anything. That's the true zero. While if it is 1,000, 10,000, then that's the relative difference on top of it. But on the other hand if you look at time, which year it is in. But year zero is not exactly a true zero point, because it's just a way we define that particular year. We'll say year 2000 versus in 2020. Basically you see the difference there but you don't have a true zero point. That means you cannot just use a ratio scale. For example, is my salary two times higher than somebody. That's a true zero point, but then you can say this year of 2000 is two times of year 1000. That doesn't really make sense. When you're looking at your attributes types. First, you really want to ask this question about what types of attribute do I have? What kind of values am I dealing with, because with that then you have more information in terms of generally understanding the statistics. Here the idea is that I have a lot of data points, these data points or objects are being defined by many attributes. If it's only a small number you can probably manually exam them. But many times with the new ways, millions of data points or billions of data points. The idea of course you want to use some statistical measures, to gain a general understanding of your dataset. Here you can start by asking how big is my dataset? That's just the number of objects that you have. For dataset you say it might be new ways, so say 1 million data points or it might be new ways like are 1000 data points. Also how many attributes or how many dimensions or features I have. Then as you look at the specific attribute, here you can have a look out at the age information or you can look at the say major information or salary information. Those are the specific attribute. Then you say okay for each dimensional attribute, you have all these different objects and all of them have a corresponding value for that particular attribute. Now you are trying to look at some notion in terms of what kind of values do I see for this particular attribute? Again, you're looking at a categorical values, then you will see, okay, there are a few possible values. Then you look at how many times those values occur in your dataset. For example, if you look at a major. Say how many of my employees are CS major or versus ADOS. You can look at the exact number or you can look at the percentage. But if you're dealing with a numerical value, there's a bit of more analysis you can do. Generally you're looking at some kind of statistical distribution. We will get to the term, but there is a central tendency and also dispersion. But many times it's very useful for you to use this statistical analysis to compare across datasets, because for example you may have two different populations and you're trying to compare whether they are similar or dissimilar along certain dimensions. Then if you have your statistical information for the corresponding attributes you're looking at then it's easier for you to compare. All right. First, let us look at central tendency. These are all four numerical values because you need some kind of way to calculate some arithmetic values. The central tendency generally talks about the norm or what is the typical value. You would see in a particular attribute. If you look at the salary information, then say, what is a typical salary value you would see in your employees or in a certain job market. The first one could use the mean. That's really just the average. You take all the possible values and divided by the total number you have. That's just a mean value. A related one is called a median. In this case, you're not trying to sum up all the values and divide it by the number rather you're looking at a sort data one. You can say, I have say, 1000 employees in this data set, so what I'll do is, I'll just sort their salary value in sorted order and pick the middle point, so that's the medium. While mode refers to the most frequent value. Of course in this case you would say if I look at age information. If then you can still be able to say, oh, well somehow age 25 or 30 is most frequent in my data set. Again, you're talking about what's the general? What's the central value of your data set. Another term related to that is mid range, so that was actually easy. What you do is you take your max value, you take your mean value. Pick the difference, divide it by two. That then gives you a sense about the middle part of your data set. To illustrate, the different measures. Here, as I said, if you have some distribution, so your distribution can have a particular form but once you are giving that particular distribution, while you look on the mode. Mode, as we have said, is the most frequent value. You just look for this highest point. Here, this is the red one that's pointing to the mode value. Then median, median is more like the middle point. If you look at the whole distribution, you go to the middle point and say, okay, that is where my median value is. The mean, because it's trying to sum up every sense you look out there or the overall distribution and then calculate the mean value. Then you will see depending on how your distribution looks like, like those three values, may be positioned differently. Here's another example. If you look at this particular distribution. Your mode will be here because that's the highest point. Then the green is somewhere here, that's your median value, and you will see your mean being skewed upward because you have a pretty longer tail, leading to covering the higher values. In the real world, depending on how your distribution looks like, you may be able to see in one scenario, this is like a perfectly symmetric distribution then you will see mean, median, mode all being lined up at the same point. But then if your distribution is skewed in any way, on the positive side or negative side, then you will see that those three measures are four at a different particular point. That's really like the central tendency gives you a quick understanding of the most likely or most common values in your data set. Now, besides the central tendency, which is the common value or more likely values. Dispersion is also very important. That basic shows how your data set or the values are spread out or squeezed. As illustrated in this particular example, you have two different distributions, the red one and the blue one. They actually have the same average or mean value. There are mean values, all out of 100. That means if you sum up all the values and divide it by the total number of data points, they actually have the same value. They're both at the same point, but if you look at the distribution, they're quite different. The red distribution is much narrow. That means, most data points are close in terms of their range, while the blue ones spreads out a lot more. Your average,maybe the same, but other than the spread is very different. This is particularly important, again, using the salary example. Are you're looking at a data set, a population where most of your employees or people have similar/close values in terms of salary or you see a pretty big difference in terms of how much people earn. To measure dispersion you can use a range. Again, that's an easy metrics, you take the max value min value and you just look at the difference. The bigger the range of course there's a better [inaudible] spread, but it doesn't tell you exactly how things look like within that range. If you're going to go a little more final granularity, you can compute this so-called quartiles. The idea is that the quartiles basically even if you sort your values, then you look how to your say 25th percentile, 50th percentile, 75th percentile. That's the Q1, Q2, Q3 values. Instead of a single range value, now you're looking at a few data points and you're checking where the corresponding values are along those intervals. With that you can also compute this IQR, which is the interquartile range. What you do is you take your Q3 which is the 75th percentile, and the minus Q1 which is the 25th percentile. That'll give you the 25th-75th, that's the middle half of your values. The other one, the statistical calculation of metrics are variance, standard deviation. Those are also broader use. To give you a statistical calculation in terms of how spread out or how squeezed your distribution look like. Next, let's look at visualization. As I said to understand what kind of data you have and how they look like. Apparently, visualization can be really powerful. There are many different mechanisms so we're going to look at some examples. The first one is referred to as a boxplot. The name says that is really using boxes. Using a box to show the distribution. What is contained in a box? You have Q1, Q2, Q3. Those are the three quartiles, and they have IQR. That is in the middle of Q1, Q2, Q3, that's half of your data points. If you look at those boxes, then you have actually a very good way of quickly visualizing where your particular attribute value lies. Here if you look at this as a feedback rating by project or quality assessment, you quickly then catch and look at the distribution but also the compare across different categories. Also what you can notice is those whiskers. That means you not only show the box because that's the middle 50 percent, but then also you look at the range to something. You're extending outside of your box. You have those whiskers extending outside from both ends of the box, and basically showing then min or max values. In this, the first all I want the red one, you can see the max is five and the min is one. That means that the value goes from 1-5 and that is you the middle portion. Boxplot can be actually very useful to show those outliers. That means if you have values that are way outside your middle ones. Usually we use this 1.5 times IQR as the threshold, that means if you have data points that's beyond 1.5 IQR either below or above that threshold then those data points actually are shown individually as outliers. In this case you can quickly see that the others are boxes, they actually have quite a bit outlier. These are much lower. Then they'll typical value. That then gives you a very quick visualization of the gender distribution, but also the potential extremely values in your data set. In this case then of course you can look further in terms of comparing across different categories, but also be able to identify those anonymous. Then be able to say what are those things that I shouldn't remove or I shouldn't look further. That is a boxplot. The other one that's very commonly used is histograms. You probably have seen histograms of various kinds. The general idea with the histogram is you're using those bars or buckets to some instance. You're dividing up your value into certain intervals, and then you just account the frequencies within each interval. Here this is the number of arrivals per minute. You can look at this like, 1, 2, 3, and all that. You can have some way of showing how the differ across the different interval values. Another example, could it be age group distribution. You look at among my user population. Population, how many of them are within a certain age group? This could be from 20-29, 30-39, 40-49 so then you can quickly visualize and compare across the different intervals you're interested in. Generally, the main part was the histogram is out of one defining your X range, so these are the intervals, the sub-range you want to consider, and then just count the frequency which is then illustrated as the height of those bars. Another example is referred to as quantile plot. We talked about quartiles before. Quantiles read about the percentiles. If you have a particular data set or attribute, what you do is that you can spread [inaudible] and then just compute though the quantiles. This could be one percentile, two percentile, sixth percentile, tense, whatever, depending on what do you want to illustrate. In this case, my example here, I'm showing the 25th percentile, 50th percentile, 75th percentile. But then you can actually show that individual percentiles for your particular data set. This also makes it very easy for you to compare across the different categories. For example, here you could have three different populations, or three different products or something. By showing this quantile plot, then you can quickly see how they look like, and maybe one's much higher than the other, or how things may be different across the different quantile points. Taking those quantile plots, you can also can actually combine that, and actually make it into a Q-Q Plot, which is a quantile-quantile plot. What is showing is that, now you're comparing two dimensions. But instead of comparing raw values, you are looking at the percentile values. Then trying to compare whether they have a similar distribution. Because if you look at this 45 degree, this is a reference line. That is when those two attributes or dimensions, have the same spread across their percentile points. But then if you look at any deviation from that reference line, then you can see areas where one distribution or one attribute tends to have lower or higher skewed values. That means whether they have more concentration on the certain percentile values or they have higher or lower value along specific percentile ranges. That just again gives us a very quick variations, especially for comparison across two different attributes. Another plotting mechanism that's actually widely used and a fairly easy to do is referred to as scatter plot. What do you do is you just take two attributes, you have X and the Y, and just plot them. Each data point corresponding to the X and Y values. By just showing those data points up there, then you can quickly visualize any potential relationship between these two attributes. If you look at my example on the left, those green points, you can already see a bit of correlated. That means you have this downward correlation. That's a negative correlations as you have the higher X value, and then you have actually lower Y value. That's actually a nice pattern you can quickly visualize using a scatter plot. Or you can use in this other example on the right, here you have those blue points but you probably don't see a clear pattern. That means just say that these two attributes are not potentially related or they're a better random or something. You can take this further scatter plot, usually a particular scatter plot is two attributes, so that's X versus Y. But then you can easily combine this into array of scatter plots. You may have, say four different attributes, then here you can generally use 4 by 4 scatter plot array. Each sub figure, is a scatter plot comparing two specific attributes. Then by looking that quickly, then in some dimensions you see a bit of more correlation while others are more random. Or if you have further category confirmation, for example, here you can actually have different colors indicating different types of data. Then that would give you actually quick way of visualizing potential relationships like any class or in your any correlations that you maybe able to see across multiple dimensions. Generally, data visualization is actually very powerful and active like in explored area for data mining because a lot of the scenarios when you talk about data mining, talk about automated origins or modeling, you would find some patterns. But visualization is really powerful in terms of quickly conveying some pattern. Think about how you can use the different types of visualization capabilities to compare, to contrast, or identify general patterns versus extreme values, and also, you can use this for different colors or different shape or different sizes. All those actually very useful things and there's very active research in the field of data visualization and visual analytics in general, in terms of how you can effectively use various kinds of visualization capabilities to facilitate or help list the whole data mining process. If you get all of the specifics, you can look at what kind of plots you can do. You can do charts, this could be line plots, like pie plots, stack, the bars, a bubble ones. Heatmaps are very useful to show some gradient in a certain area. Word cloud is also very useful if you're looking at texts that are data, or if you compare quickly across a different set of text data. Networks also give you a good sense of how things are connected. When I think about, for example, a mapping, so if you apply visualizing data on top of a national map or word map, that could be very helpful for you to quickly see differences or concentrations along different areas. As we have said, consider using color, size, how you lay out your plots, and whether there are any hierarchical organization of your plots. All that can be very useful. Always think about how you can leverage the different types of visualization capabilities for your particular types of data and analysis. Another thing to keep in mind is that the visualization can be used differently for different settings. Generally, think about this exploration versus explanation. What it means is that if you're exploring your data set, you're looking for patterns. You don't know what is in your data set. You're basically using different visualization capabilities to just help you to project or present the data in certain ways so that you can identify potential patterns in your data set versus explanation. This is a more likely scenario that you know what they are, but you're trying to convey it effectively so that others can quickly capture the main characteristics you're trying to describe. Visualize your data so that the pattern or the information you're trying to convey is clearly communicated to your audience. Those two are of course related, but the focus can be very different, and that the techniques may be different depending on what you're trying to do in terms of your data visualization. Also, when you talk about visualization on one side, how you can automate that process? Because you don't want to be manually creating all those process that's being very time-consuming and may not really scale well for larger-scale data mining tasks. Automation is important, there's a lot of investigation on that front, but also there are also good usage scenarios when you're trying to provide interactive data visualization. This is the usual scenario we are trying to convey our knowledge or work jointly with a domain expert. You can explore your data set. Again, in this process, with always this trade-off. Think about the efficiency of your approach, but also think about how effective your visualization can be depending on what you're trying to do. Sometimes they say I can't generate very simple ones, so it's very efficient, but maybe good enough, versus a case where we say I really need a better more complex visualization capabilities which takes a little bit of time but can be really useful for my particular settings. All those are very exciting. It just sounded like there's a lot of active research and work on visualization or visual analytics. Definitely continue exploring that as you are playing with your data.