Welcome back. You've covered many topics and quizzes and have learned a lot. In this learning segment, we'll discuss data analytics and data science. You may know that data analytics and data science are connected, and you likely know the two foundational components of data analytics and data science, statistics and probability. Let's touch on the basics. Data analytics is the use of data and analytical methods to derive insights to help inform decision making. Data science is an interdisciplinary field, computer science, software engineering, statistics and product design are all part of what makes a data scientist successful. This lesson will not go into too much detail about the technical components of data science, but I will highlight key concepts and demonstrate the concepts through examples. Two key reasons to understand the linkage between data analytics and data science are, one, in organizations, data analytics is often part of data science. With increasing digital transformation in and around organizations, management desires data driven answers that are strategic, actionable, and scalable. To provide holistic solutions to data problems, the efforts are increasingly data analytics driven. For example, suppose the data analytics project yields useful metrics to track across the organization. In that case, the organization may want the metrics to be in a web application format, so the metrics can be provided digitally. A data analytics project could also be considered a data science project. And two, data analytics and data science are sometimes used interchangeably depending on how data driven an organization is. In some organizations that are extremely data driven and technology forward, data analytics may be the same as data science. Statistics and probability are foundational in data science and are applied in data analytics projects. If you work with a data scientist on a complex data analytics project, you can benefit from learning these terms. Because data scientists use these concepts in their everyday work. Statistics is a field of study involving the collection, analysis, interpretation and presentation of data. Statistics are used in every industry and deals with all aspects of data. From understanding raw data, to designing survey, to designing statistical models to characterize natural phenomena and presenting data. All data analytics projects involve some form of statistics. Statistical methods include descriptive statistics, which are used to explore and communicate data. Descriptive statistics are used to learn about a population or a subset of a population in a study or a data sample. Descriptive statistics are used to understand measures of variability and how close or far apart the observations or data points are. In descriptive statistics, these measures are used, maximum, the observation with the largest value. Minimum, the observation with the smallest value, median, the middle observation when observations are ranked from lowest to highest. Mean, average, and mode, the value that appears most frequently in a data set. Graphs are commonly used in statistics to present and explore data. Different types of graphs have different strengths in terms of what data and insights they are best used to represent. A line chart can be used to see key trends in data over a period. In this line graph, you can see that the bear population has been steadily growing and is expected to keep growing. You also gather that the dolphin population seems to be steadily declining over time. For whales, there is greater fluctuation. The graph gives an initial sense of how the data looks and how it can be described. A bar chart can be used if you want to understand quantity or volume associated with discrete categories. This graph suggests that note pads are the most sold item across major cities, followed by pens and envelopes. You can also see that pens and envelopes seem to have similar sales volumes. You may or may not intuitively expect this to be the case when you receive the data set. Plotting the initial graph gives you a sense of what might be going on with the data set. A scatter plot can be used to explore the relationship between two sets of data. This graph gives you a sense that an apartment's living area is positively correlated with the price of the apartment. The more living space there is in an apartment, the more likely the apartment will have a higher sale price. But as you can see from the scatter plot, the data set also includes outliers, observations that do not conform with the general trend. Two apartments with ground living space over 4000 square feet were sold at prices comparable to apartments in the 1000 to 2000 square feet range. With this information, you might want to learn more about the outliers, understand how the data was gathered, how ground space was defined. Measure these outlying observations, and determine whether the sale prices for those apartments were affected by other variables. Are you familiar with the terms, outcome or dependent variable and independent variable? An outcome or dependent variable is a measurement or result that depends on other variables or attributes in an experiment or statistical test. For example, Lenny decides to discontinue using a food delivery service because they are displeased with the speed of delivery, the quality of food, friendliness of staff and price of the food. Lenny deciding to no longer be a customer is dependent on those variables. But the variables themselves, speed of delivery, the quality of food, friendliness of staff and price of the food are independent. Because none of them are affected by the values and attributes of the other. There are many tests in statistics to ensure that outcomes or dependent variables are properly defined, and that independent variables are not connected. Before we move to probability, I want to briefly mention other concepts in statistics. Population, all the values you are interested in in a data set. Sample, one or more observations drawn from the population, and range, the difference between the smallest and biggest values in a data set. Probability is another foundational component of statistics, and therefore, a foundation of data science. In the simplest terms, probability is a branch of mathematics that studies the chance of an event occurring. The event could be anything, a human activity, a raindrop hitting the roof, a company achieving a successful quarter, or a patient recovering from a bad prognosis. The study of probability ranges from probability theory to the study of outcomes associated with natural phenomena. When you read a report about the likelihood of something occurring or not occurring, you're reading about the probability of the occurrence of an outcome. Why is this important to accounting and finance? Probability is a key element of predictive and advanced analytics, which can offer significant value to accounting and finance. Predicting the number of customers who may default in the future, which could affect account receivables is a probability calculation. You want to know the likelihood of default by a customer. Another probability problem could be, given various conditions from company productivity to macroeconomic conditions, what is the likelihood of revenue growth next year being higher than this year? This is an example of a conditional probability because one event, revenue growth next year, is dependent on events that happen this year, such as company productivity and macroeconomic conditions. As you'll recall, probability is the likelihood of an event occurring, and probability has a key concept, random variables. A random variable is a variable whose range of values is associated with potential outcomes with different probabilities. In other words, a random variable occurs when the value of the variable is dependent on a random event. This random event could be anything. As an example, a random variable could be the outcome of a coin flip. If the coin flip comes up heads, then the random variable value could be assigned as one. If the coin flip comes up tails, then the random variable value could be assigned as zero. The value of the random variable of the revenue growth problem is similar. You want to understand whether revenue growth for next year will be above this year's revenue growth or not. The value of the random variable in this case would be discreet. You can assign one for above this year's revenue growth, or zero for below this year's revenue growth. However, what if the question is tweaked a bit and instead asks, what is the probability of the revenue figure being above $100 million next year, given that this year's revenue is $100 million? The random variable here, which is where the revenue figure will be, has a continuous distribution. The actual revenue figure can land anywhere and there is a probability associated with every possible outcome. This is called a continuous random variable. A probability distribution describes the probability of an outcome occurring given a set of observations or data from an experiment, a situation, or a random phenomenon. There are different types of probability distributions describing different situations. Let's go over a well known probability distribution to help you understand what a distribution represents. You may have heard of a normal distribution or a bell curve. A normal distribution is a type of continuous probability distribution that describes a specific type of chance events occurring in a random phenomenon. The bell curve's peak is the mean. The standard deviation describes the variation or dispersion of the values or observations. If the standard deviation is low, there is little variation, and most values are close to the mean. If the standard deviation is high, the values are more spread out and tend to deviate more from the mean. For the normal distribution, most of the values will fall in a cluster near the peak, creating the distinctive bell shaped curve, with very few observations far from the mean. So you see two tails. An example is what time people fall asleep each night. Most observations will likely be around 10PM to midnight, with very few observations before 10PM or after midnight. The next learning segment will cover modeling in data analytics and data science.