Designing Data-Intensive Applications

Chapter 1. Trade-Offs in Data Systems Architecture

There are no solutions; there are only trade-offs. […] But you try to get the best trade-off you can get, and that’s all you can hope for.

data-intensive if data management is one of the primary challenges

compute-intensive systems the challenge is parallelizing a very large computation

Handle events and data changes as soon as they occur (stream processing)

Periodically crunch a large amount of accumulated data (batch processing)

Operational systems consist of the backend services and data infrastructure where data is created

Analytical systems serve the needs of business analysts and data scientists

operational and analytical systems are often kept separate

Data engineers are the people who know how to integrate the operational and analytical systems

Analytics engineers model and transform data to make it more useful for the business analysts and data scientists

An operational system typically looks up a small number of records by a key (this is called a point query)

this access pattern became known as online transaction processing (OLTP)

product analytics or real-time analytics, include Pinot, Druid, and ClickHouse

This process of getting data into the data warehouse is known as extract–transform–load (ETL)

Sometimes the order of the transform and load steps is swapped (i.e., the transformation is done in the data warehouse, after loading), resulting in ELT

ETL for SaaS APIs is often implemented by specialist data connector services such as Fivetran, Singer, or Airbyte.

hybrid transactional/analytical processing (HTAP)

distributed analytics frameworks such as Spark

a data lake: a centralized data repository that holds a copy of any data that might be useful for analysis

The difference from a data warehouse is that a data lake simply contains files, without imposing any particular file format, data model, or schema

The data lake contains data in the “raw” form produced by the operational systems, without the transformation into a relational data warehouse schema.

sushi principle: “raw data is better”

stream processing allows analytical systems to respond to events much faster, on the order of seconds.

Machine learning models can be deployed to operational systems by using specialized tools such as TFX, Kubeflow, or MLflow.

systems of record and derived data systems.

A system of record, also known as a source of truth, holds the authoritative or canonical version of data.

cloud native is used to describe an architecture that is designed to take advantage of cloud services.

Snowflake is a cloud-based analytical database (data warehouse) that relies on S3 for data storage

shared responsibility for both backend services and data infrastructure; the DevOps philosophy has guided this trend. Site reliability engineers (SREs) are Google’s implementation of this idea

Tracing tools such as OpenTelemetry, Zipkin, and Jaeger

When combined with single-node databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node.

Serverless, or function as a service (FaaS), is another approach to deploying services, in which the management of the infrastructure is outsourced to a cloud vendor

BigQuery and various Kafka offerings have adopted “serverless” terminology to signal that their services autoscale