Welcome to the self service lesson. This is all about being able to find the right data, validate it, and transform it for use. Data preparation is the act of manipulating or pre-processing raw data, which may come from disparate data sources, into a form that can readily and accurately be analyzed, for example, for business purposes. It is the first step in data analytics projects and can include many discrete tasks, such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery. In this lesson, we will cover finding the right data using a variety of search criteria and filtering the results, and reviewing candidate data assets by profiling their content and establishing trust by viewing data lineage. We cover how the data consumer can wrangle the data using data visualization techniques to work with the data, and joining, and transforming multiple data assets into one, thus developing the recipe that can be used by data engineers for potential regular data movement where required. Let's think about how we would find just the book we're looking for. Let's say we're looking for a book from a library or an online store. We want to ensure that the book that we have chosen is relevant, so we want to be able to search using common language. It is likely that a number of texts will meet the criteria, so we want to be able to filter the results. Once we've found a number of texts that we think are of interest, we want to find out more; read more details of the author, the category, more details on the content, availability, and related books, and perhaps even preview the book, and read reviews from other readers and purchasers. Once you've made a decision on what books are best, you want to take action. Borrow the book from the library, buy the book online, or download it to your electronic reading device. This is a targeted and satisfying book search experience, and we want the same for data. Similarly, for self service of data, we want to remove the bottlenecks that hamper data consumers in getting the data that they need. The work we've been doing in establishing the data catalog, both as part of the established phase of the methodology and in the earlier tasks of iterate, will really help our data consumers to find what they're looking for. We're having an intelligent data catalog, which will allow the data consumer to either choose a particular catalog, such as the marketing data or across all catalogs they have access to for the data. We can help them refine their choice by either recommending data assets that fit their particular pattern of data access or allow the consumer to review recently added data assets. They can take advantage of the opinions of other data consumers by looking at reviews of data assets. Or they can filter the search results using a range of criteria, such as business terms, type of asset, tags, and data source. Now the consumer can examine the potential data assets that have been recommended for him or her. For this data sprint, we can provide samples and profiling information that the subject matter expert can examine to ensure this is what they need. We may not have done any data integration at this early stage, but this ability to make the data and the metadata that describes it available to the consumer quickly ensures that we don't waste time preparing data that is not what the consumer wants. We can take advantage of the policies and rules that we created in previous steps to protect any sensitive data. Part of understanding whether a data asset is suitable is understanding where it comes from or its provenance. Data lineage is an important aspect of understanding the journey a data asset has taken: what sources it came from, what transformations have been applied to it, and how it has been used. Data lineage that provides a high level overview and which can be drilled into more detail where required, will help the data consumer to explore how the data flows through the data pipeline and is a key aspect of establishing trust in the data. Now the consumer has the ability to choose what data they have to work with and could use visualization available to understand the data fully to really play around with the data to ensure it fits their purposes. Providing the ability for the data consumer to directly work with the data immediately seeing the results of the operations they apply to the data assets they're working with ensures that they will be able to fully ensure the data requirements being delivered in this sprint are appropriate. The data consumers are truly the experts in what is required, and ensuring that a full self service experience is available to them. Building on and using the artifacts that have been provided to them provides both fast access to validation and a short feedback loop when improvements are needed. Allowing the data consumer to directly work with the data and shape it as required provides a rapid prototyping approach that ensures data requirements are met without spending time on where the data will be hosted or how quickly it needs to be delivered on a repeating schedule. It also allows the data consumer to play around with the data to ensure that the transformations and combining of multiple sources can meet their needs, allowing them to learn by experimentation with real data values and to create a recipe that can be used for data integration and movement. Once data meets the needs of the data consumer, the data engineers can plan to make this data available on a regular basis and plan the data pipeline to be provided on a schedule.