A key concept in data engineering is the data pipeline. The data pipeline ushers data from one place to another all the while undergoing transformations aimed at making the data more usable—as in a pipe moving wastewater through the process of reclamation to end up as purified faucet water.
Specifically, data pipelines are data processing events connected in series, where a previous stage produces output as the input for the next stage. In this way, five main stages broadly divide the data pipeline, beginning with source data furthest upstream. Those stages are:
1. Data Sourcing — Raw data can come from a variety of sources, including batch data and streaming data from real-time operations. Other sources include data from data marketplaces and 3rd party vendors, as well as any internal systems, like transactional data and customer data.
2. Data Ingestion — Ingestion is the process of collecting raw data from sources. While the step seems simple, when considering that on average the volume, variety, and velocity of ingested data is growing for individual enterprises, then maintaining data integrity while drinking from the firehose of big data presents resource, security, and bandwidth challenges.
3. Data Processing — Following ingestion data is transformed from raw formats to something usable for storage. Two data integration methods are common, Extract Transform and Load (ETL), and Extract Load and Transform (ELT). In ETL, data is transformed on the way to storage, and in ELT data is transformed after being stored. These two are architectural choices, and so not a question of which is better, but which is appropriate for desired usage.
4. Data Storage — The data storing phase is focused on storing processed data in a fast and efficient way, ready for analysis. ETL is a popular processing and storage pattern because it stores already transformed and cleansed data. Data scientists can immediately analyze this data from a data warehouse, produce reports, and discover insights.
5. Data Access — Data access or consumption encompasses the stage of making data in storage usable to end users. On one end of a usability spectrum, data can be made available from a warehouse to data scientists who can slice it, change it, and combine it into new views, but also these views can be made standard and available to downstream data consumers who can access them in reporting tools or data marts.
A sixth data pipeline component is often added, namely the Governance and Monitoring component that surrounds each stage.
6. Data Governance — Data governance is not a sequence in a data pipeline per se, but rather an overarching framework that outlines data safeguards, such as access control, encryption, network security, usage monitoring, and auditing mechanisms. These mechanisms are natively integrated into the entire pipeline’s security and governance layers. Data governance is more than enforcing privacy requirements, but it creates data trust, stewardship, and organizational meaning.