Hi, nice to see you again. We covered a lot of materials so far, let's keep going. In this learning segment we'll discuss how data is described and interpreted. Along with learning the common ways of describing data you'll learn about considerations associated with different data sets. When you understand data as defined by practitioners in data analytics and data science you can tackle various problems. When you understand data, you can successfully implement data analytics projects. Let's begin with key terms and taxonomy. Data is factual information or facts. Data can be represented as a group of values that can be quantitative or qualitative. Datum is a singular form of the word data and is sometimes used to refer to a single data point. Data is sometimes synonymous with a data set. A data set is a collection of datums with a specific structure reflecting the characteristics of an object. Data can be numbers, words or text, images, videos or clips of videos, observations, measurements or descriptions. Datum can be used to convey a single meaning. For example, the measurement of your height is a datum. Data taxonomy is a structure or classification of data separating data according to common characteristics. For example, in a data set describing an inventory of books in a library, a data taxonomy can be established as presented in this chart. Data taxonomy is important because it helps individuals and teams inside an organization align on the relevant hierarchy of data. There's also the term data ontology. A data ontology is a conceptual representation of knowledge about data. A data ontology establishes how categories, concepts and entities relate to each other. There are formalized data ontologies in different domains establishing specifically known relationships among categories, concepts and entities used in a subject domain. For example, there is a data ontology for prescription drug compounds in health care demonstrating how these prescription drugs and their compounds relate to one another. Here's an accounting example. The ontology demonstrates how various elements on a balance sheet relate to one another. The difference between a data taxonomy and a data ontology is that a data taxonomy establishes the data hierarchy, a data ontology demonstrates the relationships between a set of concepts. The term big data is something you may hear often in your day to day activity. The concept of big data is rooted in the volume or size of the data. For data to be big, the data points can come from all kinds of sources such as a click or tap on social media, a transaction, the temperature of a specific location at a specific time. Traffic data at a given intersection in a particular city at a specific time. With all these data points being generated and accumulated organizations and individuals are surrounded by big data. Big data is growing every second as individuals and organizations generate more data everywhere globally. Small data is equally important, but it's not discussed as much as big data. Small data in the simplest terms means data that is smaller in volume and that can be easily stored and managed. The data is small enough that it can be stored on our phones or computers. Although small data definitions may vary, the value of small data is widely recognized. Small data can be the exercise data on your fitness app, how many steps you take, how many hours of sleep you get or how far you travel for work every day. When collected over time, these examples of small data inform us about our personal well being. Another example of small data is primary data gathered from focus groups, customer surveys and polls. Such data may not seem influential on its own but with the right problem framing and scope these sets of small data could help answer important questions. The rise of big data and small data provides additional data that was not available before for accountants and finance professionals. Besides traditional financial data, there may be additional non traditional data sources for you to understand and potentially used as part of data analytics projects. Let's take a moment to familiarize ourselves with the six V's of data, volume, variety, velocity, veracity, value and variability. How much data exists? Volume of data is the amount of data stored, managed and used for analysis. The volume of data has implications for deciding how to store the data, the cost of storage, how to manage the data securely and how to provision the data? Volume of data is often described in terms of the number of records. Examples of data volume might include the number of customer records, the number of images and the time horizon of the data. Depending on the data analytics projects problem and scope you might ask the following about the volume of data. Is it enough data for the analysis? How is data optimally stored and processed? With the volume of data available and the stakeholders that might need access to the data, how do you set permissions and safely share data? The answers to these questions will determine whether you can proceed with the data analytics project. Or if more data needs to be collected to increase the volume for data analysis. Variety of data means, you guessed it, different types of data. For instance, let's look again at the example of a rapidly growing athletic clothing company from previous learning segments. The CEO, Sharon, asked the company's CFO, Octavia, to prepare for an annual investors briefing on the financial plan and growth forecast. Sharon would like to show the investors data driven insights and share qualitative perspectives on how the business is doing. In this case, Octavia has financial data that is reported monthly. This financial data is numerical and structured data. There may also be data from customers feedback about what they like or don't like about the products. The transcripts of customer calls are in text form and are an example of unstructured data which is a different type of data from the numerical and structured data. In some instances, having different types and sources of data can help enrich the overall analysis. It's worth noting that when using various data sources in different formats and with different attributes, it may be necessary to take time to cleanse, integrate and generally makes sense of the data before attempting to glean any insights. Velocity of data refers to the speed at which a data point is generated. High velocity means the data point is generated rapidly which requires a specific way to ingest, process and analyze the data. Because high velocity data almost always includes new information. An example of high velocity data is stock prices, every second a new stock price is a new data point. A facebook post is generally considered high velocity data because there is always a new post. Streaming services generate high velocity data because while you watch a stream show each second of the content streamed on your TV or your laptop is a high velocity data point. Also consider data from sensors on a farm, at a weather station or in your car, those sensors could be collecting high velocity data. Low velocity data could be revenue data for reporting purposes. There's no need for revenue data to be reported or tracked by the second. In fact most businesses track revenue for each month, each quarter and each year. Those are meaningful data points for comparison, measurements and bench marking. Veracity is the trustworthiness of the data. When you have a data set can you trust that it is accurate, truthful and complete. Veracity of data relies on strong systems of records and data governance plus auditing models and practices that ensure the data can be trusted. The value of the data depends on how the data can be used ethically. The value of the data also depends on the problem that a data analytics project needs to solve. Drawing on the example of the athletic clothing company, data on competitors growth from publicly available information would be valuable and informative. Because those metrics would benchmark the company's performance. An example you've probably encountered is advertising companies use of aggregated demographic information to help them provide ads for products they believe customers want to see. For an advertiser, demographic information is very valuable. Personal health information is highly valuable to patients, doctors and insurers. Because of the value of this data, regulations are put in place to protect our health information so the value of the data is not wrongly placed or monetized. This 6th v is variability, variability and data can be interpreted in two main ways. First, variability could mean the versatility of the data. For example, whether data can be used to learn how to dress for the day. Weather data could also be important for insurance companies as they underwrite insurance for farmers, homeowners and businesses. Finance departments can use customer purchasing data to analyze customer spend, growth and top line growth. Accountants and finance professionals can use customer purchasing data to audit revenue recognition. Similarly marketing departments can use customer purchasing data to understand customer decisions and use those insights to improve marketing messages and operations. The second way variability and data can be interpreted is from a statistics perspective. Variability of data refers to the dispersion of data. Are there many data points far away from the average or the mean of the data set? How are these data points different from each other? There are statistical methods that help describe the variability of data sets from a technical or mathematical sense. It is helpful to know that variability of data could convey these two different concepts in different contexts. The six V's help with understanding the nature of the data. And you can use the six v framework in a variety of situations such as data analytics projects, auditing data used in projects and setting processes in areas such as data governance. Next, we'll discuss dimensions of data, data quality, data reliability and data cleansing.