Understanding Data Engineering
Data has been coined as the new resource that can build gigantic enterprises. So it’s not a stretch that data roles have emerged, like Data Engineer, Data Scientists, and Data Architect. And while they all work with data, each discipline takes a unique view and responsibility towards data that is both different and complementary to the others.
Defining Data Engineering
Data Engineering is the set of processes used to turn raw data into usable data for data scientists and data consumers. Sometimes it's referred to as Information Engineering to emphasize the importance of drawing meaning and significance from source data.
The data pipeline
A key concept in data engineering is the data pipeline. The data pipeline ushers data from one place to another all the while undergoing transformations aimed at making the data more usable—as in a pipe moving wastewater through the process of reclamation to end up as purified faucet water.
Specifically, data pipelines are data processing events connected in series, where a previous stage produces output as the input for the next stage. In this way, five main stages broadly divide the data pipeline, beginning with source data furthest upstream. Those stages are:
1. Data Sourcing — Raw data can come from a variety of sources, including batch data and streaming data from real-time operations. Other sources include data from data marketplaces and 3rd party vendors, as well as any internal systems, like transactional data and customer data.
2. Data Ingestion — Ingestion is the process of collecting raw data from sources. While the step seems simple, when considering that on average the volume, variety, and velocity of ingested data is growing for individual enterprises, then maintaining data integrity while drinking from the firehose of big data presents resource, security, and bandwidth challenges.
3. Data Processing — Following ingestion data is transformed from raw formats to something usable for storage. Two data integration methods are common, Extract Transform and Load (ETL), and Extract Load and Transform (ELT). In ETL, data is transformed on the way to storage, and in ELT data is transformed after being stored. These two are architectural choices, and so not a question of which is better, but which is appropriate for desired usage.
4. Data Storage — The data storing phase is focused on storing processed data in a fast and efficient way, ready for analysis. ETL is a popular processing and storage pattern because it stores already transformed and cleansed data. Data scientists can immediately analyze this data from a data warehouse, produce reports, and discover insights.
5. Data Access — Data access or consumption encompasses the stage of making data in storage usable to end users. On one end of a usability spectrum, data can be made available from a warehouse to data scientists who can slice it, change it, and combine it into new views, but also these views can be made standard and available to downstream data consumers who can access them in reporting tools or data marts.
A sixth data pipeline component is often added, namely the Governance and Monitoring component that surrounds each stage.
6. Data Governance — Data governance is not a sequence in a data pipeline per se, but rather an overarching framework that outlines data safeguards, such as access control, encryption, network security, usage monitoring, and auditing mechanisms. These mechanisms are natively integrated into the entire pipeline’s security and governance layers. Data governance is more than enforcing privacy requirements, but it creates data trust, stewardship, and organizational meaning.
Data engineering vs data science
Both data engineering and data science have developed together as disciplines, and at some point in the past they diverged from a single school of thought described in the information engineering methodology from the 1970/80’s. As data grew and the information engineering discipline was challenged by greater loads of data management and analysis, new roles formed in two directions, those handling the data infrastructure, and those analyzing data content and producing reports and insights from data. Today, those two roles are Data Engineer and Data Scientists, respectively. Today, big data has all but ensured the separate need for these two roles.
While both roles have widely overlapping knowledge bases, data engineers are software engineers who focus on data flows, data pipelines, and the infrastructures that support them through an organization. In a way data scientists pick up where the engineers leave off, and take over analyzing data content. Data scientists are subject-specific analysts, and bring analytical, statistical, programmatic skills to the data itself to derive use from it.
Data engineering vs data architecture
Data engineering and data architecture are two more terms that present confusion, but are fundamentally different disciplines. Data engineering concerns the data pipeline and infrastructure. But data architecture is concerned with the framework used to govern data. Specifically, data models, policies, rules, standards, that comprise the overall data vision. It's these standards and models that are referenced by other data scientists and engineers when data must be integrated between systems.
Notably, data architects define the target state of the data architecture, which is conceptualized in three data architecture layers: conceptual layer, logical layer, and physical layer.
- Conceptual layer — Conceptual models begin with overall content as inspiration for the structure of the data model. The concept informs the definition of the data structures and the entities that will form foundational elements. The conceptual data model’s focus is on entities, their characteristics, and relationship between them.
- Logical layer — The logical model represents the elements described in the conceptual model with greater technical detail, like defining data structures, and details on keys, data types, and attributes. These details do not include technical specifications for any database. At this stage, the logical model can be used as a blueprint to build the data structures in any database product.
- Physical layer — The logical model is then translated into a physical model of the database application. The physical model specifies a blueprint fit for the implementing database.
Value of Data Engineering
When working with big data, the mere fact that data pipelines must be built implies the need for data engineers. Data engineering is a requirement, and it's concerned with optimizing the collection, storage, and data preparation to maximize business usability.
But not all data engineers are equal. Pulling together a data engineering team requires understanding the value of three clear data engineering roles that have evolved in the industry: generalists, pipeline focused, and database focused.
- Generalist — Generalists have a wide breadth of responsibilities and work on small teams. Generalists may be entry level data engineers or data scientists that have switched to an engineering track. There are many tasks, such as building dashboards and forecasts, that can be fulfilled with generalists who don’t have the requisite knowledge of systems architecture to focus on the pipeline or database.
- Pipeline focused — Data engineers with a pipeline focus create initial data pipelines, as well as update them as the system grows with demand. They typically will be responsible for complex projects, such as predictive analytics across distributed systems. And you will find them in medium to larger companies that are data-centric, and integrated.
- Database focused — Complementing the pipeline focused data engineers are database focused data engineers who are responsible for the implementation, maintenance, and growing of an enterprise's analytics database. Both specialized engineers work with the pipeline. The database focused role aims at optimizing the database, and the extraction, transformation, and load process of the data pipeline.
The right mix of data engineers that create optimized and useful data infrastructures benefits companies in the following ways.
- Create Greater Visibility and Control over Company Data — Data locked away or ignored in databases cripples an enterprises ability to self-monitor. But creating a single source of truth across a company’s data exposes threats and opportunities.
- Establish Data-informed Decision Making Capabilities — Data-driving decisions stems from full data visibility and control. Leaders that can not only access data, but have analytics strip away the noise have the best chance of making effective decisions.
- Maintain an Accurate and Reliable Historical Business Context —Remembering business metrics from last year, or even five years ago is not as good as knowing them. A single source of truth well maintained chronicles the performance and history of an enterprise. Further analysis can then uncover hidden trends and patterns for market opportunities that simply would never have been found otherwise.
An Example of Data Engineering Solutions
Dodge Construction Network worked with Reltio to deliver a 360-degree view of all the master data. The result was the implementation of a cloud-based master data management system that turned Dodges data ecosystem into a valuable customer engagement tool.
Read the full case study, or the challenge solution summary below.
Dodge Construction Network leverages Reltio and AWS to differentiate with data
Dodge Construction Network (Dodge) has a rich 100+ year heritage with deep, long-standing relationships that have been built and maintained over decades. The company powers four trusted industry solutions – Dodge Data & Analytics, The Blue Book Network, Sweets, and IMS – to connect the dots across the entire commercial construction ecosystem. Dodge provides premier and highly differentiated access to private and early stage construction information and with the Reltio partnership Dodge is able to deliver a 360-degree view of customers, contacts, projects, and products providing unparalleled data quality and the ability to bring new and enhanced services to market faster.
Challenge: Bidding goodbye to data silos and duplication
In April 2021, Dodge Data & Analytics acquired the industry’s most comprehensive directory of suppliers, general contractors, and specialty trade contractors through the purchase of The Blue Book Building and Construction Network. The merger of these two companies created data silos, content duplication, and a much larger data set. Dodge realized if they wanted to understand customer needs, create new services quickly, and provide reliable data, automating their data management in the cloud was a must.
Solution: Constructing a data platform in the cloud
Dodge knew what it wanted—a cloud-native master data management solution that was easy to use and could be up and running quickly on Amazon Web Services (AWS), where Dodge’s data resides. Dodge uses various AWS services including Amazon Simple Storage Service (Amazon S3), Amazon RDS Service, Amazon ElastiCache, Amazon Elasticsearch, Amazon DocumentDB, Amazon SageMaker, and others.
High on Dodge’s list of requirements was the ability to offer their data stewards an interface that would allow manual matching and leverage machine learning so that matching could be automated in the future. Three products were evaluated for data matching, merging, and more. After a positive engagement experience, Dodge selected the Reltio Connected Data Platform to anchor their data ecosystem.