A data engineer’s ecosystem includes the infrastructure, tools, frameworks, and processes for extracting data from disparate sources, architecting and managing data pipelines for transformation, integration, and storage of data, architecting and managing data repositories, automating and optimizing workflows and flow of data between systems; and developing applications needed through the data engineering workflow. Let’s look at data first. Based on how well-defined the structure of the data is, data can be categorized as structured, semi-structured, or unstructured. Data that follows a rigid format and can be organized neatly into rows and columns is structured data. This is the data that you see typically in databases and spreadsheets, for example. Semi-structured data is a mix of data that has consistent characteristics and data that doesn’t conform to a rigid structure. For example, emails. An email has a mix of structured data, such as the name of the sender and recipient, but also has the contents of the email, which is unstructured data. And then there is unstructured data—data that is complex, and mostly qualitative information that is impossible to reduce to rows and columns. For example, photos, videos, text files, PDFs, and social media content. The type of data drives the kind of data repositories that the data can be collected and stored in, and also the tools that can be used to query or process the data. Data also comes in a wide-ranging variety of file formats being collected from a variety of data sources, ranging from relational and non-relational databases to APIs, web services, data streams, social platforms, and sensor devices. A data engineer’s ecosystem also includes data repositories. There are two main types of data repositories—Transactional and Analytical. Transactional systems, also known as Online Transaction Processing (or OLTP) systems, are designed to store high-volume day-to-day operational data. Such as online banking transactions, ATM transactions, and airline bookings. While OLTP databases are typically relational, they can also be non-relational. Analytical systems, also known as Online Analytical Processing (OLAP) systems, are optimized for conducting complex data analytics. These include relational and non-relational databases, data warehouses, data marts, data lakes, and big data stores. The type, format, sources of data, and context of use influence which data repository is ideal. Once data from disparate sources has been collated, it needs to be processed, cleansed, and integrated so that it can be accessed via a single interface by users. Data Integration tools combine data from disparate sources into a unified view, accessed by users to query and manipulate the data. This brings us to data pipelines, a set of tools and processes that cover the entire journey of data from source to destination systems. Data is integrated within a data pipeline using processes such as the Extract-Transform-and-Load Process or the Extract-Load-and Transform process. The ecosystem also includes languages that can be classified as query languages, programming languages, and shell and scripting languages. From querying and manipulating data with SQL to developing data applications with Python and writing shell scripts for repetitive operational tasks, these are important components in a data engineer’s workbench. BI and Reporting tools are used to collect data from multiple data sources and present them in a visual format, such as interactive dashboards. Using these tools, you can connect and visualize your data in real-time and on a pre-defined schedule. These are drag and drop products that do not require the users to know any programming. While these tools are typically used by Data and BI Analysts, they are enabled and managed by Data Engineers. Automated tools, frameworks, and processes for all stages of the data analytics process are part of the Data Engineer’s ecosystem. It‘s a diverse, rich, and challenging ecosystem. Further on in the course we will explore the different parts of this ecosystem in greater detail.