Part III · Data Engineering & Systems · Chapter 07

Governance, quality, and metadata, the discipline that makes a growing data platform trustworthy instead of merely large.

The previous six chapters built the pipes: collection, storage, pipelines, streaming, distributed compute, cloud. This chapter is about the other half of the problem — how an organisation knows what data it has, what it means, whether it is correct, who can see it, and who is responsible when it goes wrong. The answers come from a loose but coherent discipline known as data governance, and from three bodies of practice that have matured around it: data quality (dimensions, tests, contracts, observability), metadata (catalogues, lineage, active-metadata platforms), and policy (classification, access control, privacy techniques, and the legal regimes they answer to). The goal is not paperwork; the goal is a platform where a new analyst can find a trusted dataset in ten minutes, a broken upstream table is caught before the dashboard lies, and a regulator's question about a specific record can be answered in hours rather than weeks.

How to read this chapter

The first section motivates the whole effort: why governance is not optional once more than a handful of people depend on the same data. Section two breaks quality into its canonical dimensions — accuracy, completeness, consistency, timeliness, uniqueness, validity — because disagreements about "bad data" are usually disagreements about which of these is failing. Sections three through six are the quality toolchain in order of escalation: data contracts between producers and consumers, data tests that run in the pipeline, data observability that watches production freshness and volume, and data lineage that traces an error to its source. Sections seven through nine are the metadata layer: the catalogue where humans find datasets, the active-metadata platforms (DataHub, OpenMetadata, Amundsen, Atlas) that have replaced last decade's passive ones, and schema management as the contract that actually reaches the wire. Section ten is master data management — the flavour of governance that tries to resolve "which customer record is the real one" across systems. Sections eleven and twelve are policy: classification (what kind of data is this) and access control (who can see what). Sections thirteen and fourteen are the protections that sit on top: privacy techniques (masking, tokenisation, differential privacy) and the compliance regimes (GDPR, CCPA, HIPAA, SOC 2) that the whole stack ultimately answers to. Section fifteen is ethics — the practices that exist because "legal" is not the same as "acceptable". Section sixteen is the human layer: roles, ownership, the data-governance operating model that ties the technology to a functioning organisation. Section seventeen closes with where governance compounds specifically for ML: training-data provenance, feature lineage, model cards, and the regulatory landscape that now applies to models as well as to data.

A note on tone. Governance has a long and deserved reputation for being the part of a data stack that is discussed mostly in the past tense, after something broke. The modern practice is deliberately the opposite — it treats quality, metadata, and policy as engineering concerns with their own tools, their own tests, their own on-call rotations, and their own roadmaps. This chapter follows that practice. Where possible it names the specific open-source or commercial tools that embody each idea (Great Expectations, Soda, dbt tests, Monte Carlo, Bigeye, DataHub, OpenLineage, Marquez, Collibra, Alation, Atlas) not as endorsements but as anchor points for the concept. This is also the final chapter of Part III, so Section seventeen is written as a bridge into Part IV: classical machine learning inherits every quality, lineage, and access problem described here, with the added complication that the errors now propagate into trained parameters that are much harder to audit than a row in a warehouse.

Why governance is not optional at scaleTrust, risk, regulatory surface
The six dimensions of data qualityAccuracy, completeness, consistency, timeliness, uniqueness, validity
Data contractsProducer–consumer agreements with teeth
Data testing in the pipelineGreat Expectations, Soda, dbt tests
Data observabilityFreshness, volume, schema, distribution, lineage
Data lineage and provenanceColumn-level, OpenLineage, Marquez
The data catalogueDiscovery, documentation, ownership
Active-metadata platformsDataHub, OpenMetadata, Amundsen, Atlas
Schemas, registries, and evolutionAvro, Protobuf, JSON Schema, compatibility
Master data managementGolden records and entity resolution
Data classificationSensitivity, PII, PHI, confidentiality tiers
Access controlRBAC, ABAC, row- and column-level policies
Privacy techniquesMasking, tokenisation, k-anonymity, differential privacy
Compliance regimesGDPR, CCPA, HIPAA, SOC 2, ISO 27001
Data ethics in practiceBeyond what the law requires
The governance operating modelStewards, owners, councils, RACI
Where it compounds in MLTraining data, feature lineage, model cards

Section 01

Governance is the price of being trusted with data at any scale larger than a single team

A five-person startup can get by without governance; everyone knows what every table means and who owns it. A fifty-person company cannot, and a five-thousand-person company cannot even come close. Governance is the discipline by which an organisation keeps its data assets understandable, correct, and defensible as the number of producers, consumers, and systems grows. It is unglamorous work, but it is almost always the gap between a data platform that accelerates the business and one that quietly poisons it.

Three forces demand it

Three forces make governance unavoidable once a platform crosses a certain size. The first is scale of use: once the number of analysts, pipelines, and dashboards depending on a table is large, a silent change in definition or a quiet correctness regression propagates widely and invisibly. The second is regulatory surface: GDPR, CCPA, HIPAA, SOC 2, and an expanding set of sector-specific regimes impose concrete obligations on how personal and sensitive data is stored, transferred, accessed, and deleted — obligations that cannot be satisfied without a catalogue, classifications, access records, and audit trails. The third is trust: executive dashboards, external reports, and ML models trained on production data all require a claim of correctness, and that claim cannot be made without machinery to back it up.

Bad governance has a distinctive shape

The platforms in trouble tend to look alike. Multiple tables claim to be the source of truth for the same concept and disagree by a few percent. Pipelines silently double-count on the day the clocks change. A column quietly changes units from dollars to cents and nobody notices for a month. A departed engineer was the only person who understood a critical transformation. An auditor asks who has access to a specific customer's records and the honest answer is "we don't really know". None of these failures require the organisation to be careless — they require only that it has grown faster than its ability to keep track of what it built.

What governance actually is

It helps to define the term, because it is often used loosely. Data governance is the set of practices by which an organisation exercises authority and control over its data assets — deciding who owns what, what quality standard applies, how access is granted, how lineage is tracked, and who is accountable when something breaks. The practices fall into three bundles that the rest of this chapter unpacks: quality (is this data correct and fit for purpose), metadata (what is this data, where did it come from, who owns it), and policy (who is allowed to see it, and under what constraints). A mature programme addresses all three; an immature one usually addresses one and hopes the others take care of themselves.

Governance is engineering, not paperwork

The governance practices that succeed are the ones that live in the pipeline alongside the code — automated tests, lineage emitted by the orchestrator, classifications enforced by the access layer, catalogue entries that update when tables do. The practices that fail are the ones that live in spreadsheets and slide decks maintained by a separate team. This chapter is written on the assumption that governance is an engineering discipline; every section names the tools and mechanisms that make the practice real.

Section 02

Data quality is six different things and most arguments about "bad data" are arguments about which one

"The data is bad" is too coarse to act on. The professional discipline of data quality breaks the concept into a small number of orthogonal dimensions, each with its own tests and its own remedies. The canonical list varies slightly between authorities — DAMA-DMBOK, ISO 8000, academic treatments — but the working set used in practice is six: accuracy, completeness, consistency, timeliness, uniqueness, and validity.

The six working dimensions

Accuracy asks whether values match reality: is the customer address in the record actually where the customer lives. Completeness asks whether the values that should be present are present: is every order row missing a total. Consistency asks whether related values agree across a row, across a table, or across systems: does the sum of line items equal the invoice total; does the customer in the CRM have the same ID as in the warehouse. Timeliness asks whether data is fresh enough for the purpose: a one-day-old feed can be fine for weekly reporting and unusable for real-time fraud detection. Uniqueness asks whether the same real-world entity is represented exactly once: one customer, one row, or else duplicate-resolved before use. Validity asks whether values conform to their declared domain: email addresses that parse, postcodes that exist, enum fields that only hold enumerated values.

Why the taxonomy matters

The taxonomy matters because different dimensions have different fixes. A completeness problem is usually a pipeline-coverage problem: a source is silently dropping records, or a join is losing rows. A validity problem is usually a schema-and-contract problem: the upstream system is emitting values the downstream system does not expect. A uniqueness problem is often an entity-resolution problem (Section 10). A consistency problem frequently indicates that two systems disagree about a shared identifier or a shared definition, which is often the deepest and most organisational failure. Naming the dimension is the first step toward addressing the right cause.

Fitness for purpose beats a platonic ideal

It is tempting, and usually wrong, to treat quality as an absolute. The mature framing is fitness for purpose: data is "good enough" when it is suitable for the decisions and models that will consume it. A marketing segmentation can tolerate a few percent duplicate customers; a regulatory filing cannot. The quality bar should be set per use, and the same asset can be "clean enough" for one pipeline and "not clean enough" for another. Catalogues and contracts (Sections 3 and 7) make this explicit; they publish the quality claim the producer is willing to stand behind, and consumers decide whether it meets their bar.

Quality has a cost curve

Each dimension gets more expensive as it approaches the asymptote. Raising completeness from 95 percent to 99 percent is usually cheap; raising it from 99 percent to 99.99 percent often requires a new data source, additional joins, or a human review step. A governance programme that pretends every dimension should be at 100 percent for every asset will simply be ignored; one that sets a defensible target per dimension per asset — and measures against it — earns the attention of the engineers who have to meet the targets.

A checklist, not a strategy

The six-dimension framework is a diagnostic checklist, not a strategy. Its job is to make quality conversations precise — "we have a completeness problem on orders.customer_id" — so that the right test, the right alert, and the right owner can be put in place. Sections 3 through 6 are the engineering practices that convert those diagnoses into automated checks.

Section 03

Data contracts push correctness upstream, where the producer is

For most of the last decade, data teams caught problems downstream — in the warehouse, after the pipeline had already loaded questionable data. Data contracts invert that stance. A contract is an explicit, machine-checked agreement between a producer (usually an application service) and its consumers (usually analytics pipelines and ML features) about the schema, semantics, quality, and change-management rules of a dataset. The idea rose to prominence in 2021–2023 through a series of widely read essays by Chad Sanderson and others, and through the data-contract tooling that appeared in its wake.

What a contract specifies

A useful contract covers at least four things. The schema: the exact set of fields, their types, their nullability, and their semantics. The quality guarantees: what completeness, uniqueness, validity, and freshness the producer will maintain. The change policy: how breaking changes are announced, how long the deprecation window is, and which changes require a new version of the contract. The ownership: who is accountable on the producer side, who is accountable on the consumer side, and what the escalation path is when a violation is detected. The contract is committed to source control, reviewed like code, and enforced by automated checks at the producer boundary.

Contracts live at the producer boundary

The distinguishing feature of contracts, compared with older schema-registry and data-quality practices, is that they are enforced at the point where the data is produced — before it ever lands in the warehouse. A production service that emits a Kafka event, writes to an operational database, or publishes a CDC stream validates its output against the contract at emit-time; violations are surfaced to the producing team, not to the downstream analytics team. This matters because the cheapest time to fix a data bug is when the engineer who wrote the emit-code is still looking at their own pull request.

Schemas alone are not contracts

Protobuf and Avro schemas, and Kafka Schema Registry, predate the contract movement by years, and they are often cited as if they solve the same problem. They do not. A schema ensures the payload parses; a contract additionally ensures the payload means what the consumer expects, at the quality level the consumer requires, with a change-management process the consumer can rely on. A schema says "this field is an int64"; a contract says "this field is the monetary amount in USD cents, never negative, present on 100% of rows, and any change to its definition requires a two-week deprecation and a new major version".

Adoption is the hard part

The technology for contracts is straightforward (a YAML schema, a validator, a CI check). The cultural work is the hard part: product teams do not automatically see themselves as responsible for the downstream analytics and ML use of the events they emit, and getting them to take on that responsibility requires executive support, incentives, and usually a catalyst event — a high-profile bad-data incident that made the CEO unhappy. Contracts work in organisations that have decided to take data seriously; they do not by themselves make an organisation take data seriously.

APIs for data, not just for services

The productive analogy is that a data contract is to a dataset what a service API is to a microservice. Nobody would seriously propose running a microservice architecture without explicit, versioned, reviewed APIs; a modern data platform treats its internal datasets with the same seriousness. The terminology is recent, the underlying idea — explicit producer-consumer contracts, enforced at the boundary — is not.

Section 04

Data tests run the same assertion on every pipeline run and fail loudly when it breaks

Contracts describe what should be true; data tests verify it automatically, every run. The genre is directly analogous to unit tests in software, and like unit tests it is cheap to write, cheap to run, and enormous in the amount of silent failure it prevents. The three tools that have become standard — dbt tests, Great Expectations, and Soda — cover the space with different accents, but the idea is the same: attach named, executable assertions to specific datasets and fail the pipeline if they do not hold.

Four test kinds cover most of the surface

Most useful data tests fall into a small number of families. Column-level tests assert properties of a single column: not-null, uniqueness, value-in-set, value-in-range, regex match, referential integrity to another table. Row-level tests assert cross-column properties of a single row: end_date >= start_date, line_total = unit_price * quantity. Distribution tests assert statistical properties of the column as a whole: mean within a range, standard deviation stable, null rate below a threshold, cardinality within expected bounds. Relationship tests assert properties across tables: every order.customer_id has a matching customers.id; the sum of line-item totals in orders equals the aggregate total in the invoice table. Together these four kinds catch a large majority of the preventable correctness regressions.

The three standard tools

dbt tests are the dominant entry point for analytics-engineering teams. They live alongside the SQL models, are written in YAML, and run as part of the same DAG that builds the models; the result is that testing and transformation are inseparable in a way that earlier generations of tooling never achieved. Great Expectations is the older, more general library — a Python-first framework with a much larger vocabulary of built-in expectations, integrations to most pipeline orchestrators, and a first-class notion of a data-docs site that publishes test results. Soda is the newest of the three, designed from the start for an observability use case: write checks in a compact DSL (SodaCL), run them against warehouses and lakes, ship failures to alerts and dashboards. Many mature platforms use all three in different layers.

Where tests run is as important as what they test

A test written and never run is decoration. The practical question is where in the pipeline a test executes. Good practice is to test at three points: at ingest, immediately after raw data lands, so that upstream regressions are caught before they contaminate the warehouse; at transformation, inside dbt or the equivalent, so that derived models are validated against the assumptions they depend on; and at the consumer boundary, on the specific tables that feed dashboards, ML features, and external reports. Each layer catches a different class of failure.

Test failure policy deserves real thought

What should happen when a test fails is a design decision, not a default. Some failures should stop the pipeline and page an engineer: duplicate primary keys in a finance table, null rates above a dangerous threshold on an ML feature. Others should alert but proceed: a small distribution drift that is plausibly a real-world signal. Others should simply record the fact and continue: a tolerable anomaly with no corrective action available. Treating every failure as fatal trains the team to disable tests; treating every failure as a warning trains the team to ignore them. The policy should match the cost of the downstream failure, as Section 2's fitness-for-purpose framing requires.

The cheapest quality investment

Data tests are usually the single highest-leverage investment a governance programme can make in its first year. They are cheap to author, they catch real regressions, their output is legible to both engineers and auditors, and they compound: every test written today will catch a regression that would otherwise have been debugged painfully in production next year.

Section 05

Data observability is monitoring for the shape of data, not just the health of the jobs that move it

Job-level monitoring — was the pipeline run green, how long did it take, did the task retry — is necessary and insufficient. A pipeline can finish on time, with no errors, and still deliver a table whose row count has silently halved, whose freshness has slipped by hours, whose schema has quietly drifted, or whose primary business metric has moved outside its historical band. Data observability is the practice of monitoring the data itself, continuously, against profiles of what "normal" looks like.

The five pillars

The canonical taxonomy, popularised by Monte Carlo and widely adopted, names five pillars. Freshness: when was this table last updated, and is it within the expected cadence? Volume: how many rows does each run produce, and is that within the expected band? Schema: has the set of columns and their types changed, and was the change announced? Distribution: are the column-level statistics (mean, null rate, cardinality, histogram) within historical tolerances? Lineage: when a metric changes, which upstream table is the likely cause? The five cover most of the silent-failure surface that job-level monitoring misses.

Commercial and open-source tools

The commercial category is led by Monte Carlo, Bigeye, Acceldata, and Metaplane; their pitch is turnkey profiling of the warehouse plus automatic anomaly detection. Open-source alternatives include Elementary (built on dbt tests), Soda Core, and instrumentation via OpenLineage with lineage-aware dashboards. The commercial tools are typically cheaper to adopt and faster to show value; the open-source tools are cheaper over time and integrate more flexibly with homegrown platforms.

Distinct from unit-style testing

Observability and testing (Section 4) are complementary but different. A test is an assertion written by a human who knew what to check; observability is a profile learned from historical behaviour that detects deviations no one thought to write an assertion for. A test says "customer_id must never be null"; observability says "the null rate on customer_id has been 0.1% for six months and jumped to 3% yesterday — someone should look". Teams need both: tests for known-dangerous conditions, observability for the unknown unknowns.

Alerting is the hard problem

Data observability's hardest engineering problem is not detection; it is alerting. A naive system that fires on every two-sigma deviation will drown the team in noise within a week. A mature system uses layered thresholds, dependency-aware suppression (don't alert on 20 downstream tables when the upstream source is the cause), per-asset criticality tiers, and business-hours sensitivity. Anomaly detection with no alerting hygiene is a research project; anomaly detection with alerting hygiene is an on-call rotation the team can actually sustain.

The pipeline is healthy; the data is not

The slogan inside the data-observability community captures the gap: green pipelines do not imply correct data. Until a platform is watching the shape of its outputs as carefully as the health of its jobs, it has a persistent blind spot. Observability closes that gap, and for many teams it is the second-highest-leverage governance investment after the test suite of Section 4.

Section 06

Lineage answers the question "where did this number come from?"

When a dashboard reports the wrong revenue number, the real question is not what is wrong but where. Data lineage is the graph that answers that question: for any column in any table in any dashboard, lineage tells you the chain of upstream tables, transformations, and source systems that produced it. A platform with accurate lineage can debug a number in minutes; a platform without it can spend days.

Table-level vs column-level lineage

Lineage has two resolutions. Table-level lineage records that table B was produced from tables A1 and A2; it is straightforward to emit from any orchestrator and covers most pipeline-operations use cases. Column-level lineage additionally records which specific columns of A1 and A2 flowed into which specific columns of B; it is much more useful for debugging, change-impact analysis, and regulatory work, and it is much harder to compute accurately because it requires parsing the SQL of the transformation. The modern catalogue and observability platforms (Section 8) almost all support column-level lineage; the question is how accurately they extract it from the transformation code.

Emit-time vs infer-time lineage

There are two ways to build a lineage graph. Infer-time approaches parse SQL, Spark DAGs, and warehouse query logs after the fact and reconstruct lineage from what actually ran; they work on any platform without modifications but can miss edge cases in dynamic SQL or external transformations. Emit-time approaches instrument the orchestrator and execution engines to emit lineage events as they run; they are more accurate but require adoption of an emission standard. OpenLineage is the dominant open standard for emit-time lineage, with producers for Airflow, Spark, dbt, and Flink, and consumers including Marquez, DataHub, and the commercial observability vendors.

What lineage is used for

Lineage serves at least four distinct use cases, and the value of each one is what justifies the investment. Debugging: trace a wrong number back to a specific upstream change. Change impact: before dropping a column or renaming a table, identify every downstream consumer. Regulatory response: when asked where a given personal datum lives, produce the full map of derived assets and who has access to each. Quality attribution: when a downstream test fails, narrow the search to the specific upstream asset whose distribution drifted. A platform that cannot do these four things is one that is slow and expensive to change safely.

Lineage is only as good as its coverage

The failure mode of a lineage programme is partial coverage. If lineage is accurate for the 70% of pipelines that run in dbt, and absent for the 30% that run in hand-written Spark or SaaS integrations, the lineage graph is not merely incomplete — it is misleading, because it suggests the missing 30% does not exist. Getting to near-complete lineage usually requires work in two directions: expanding the set of systems that emit lineage, and parsing query logs for everything else. It is not a feature; it is a capability that accretes over time.

Lineage is the map; the catalogue is the gazetteer

Lineage and the catalogue (Section 7) are adjacent but different. Lineage is the graph — how data moves. The catalogue is the index — what each dataset is, who owns it, how it is used. A modern active-metadata platform (Section 8) publishes both against the same underlying metadata store; that is what makes the combination valuable in a way that either piece alone is not.

Section 07

The catalogue is the place people go to find out what data the organisation has

For most of the people who use a data platform, the first question is not how to query a table but how to find one. A data catalogue is the searchable inventory of an organisation's data assets — tables, views, dashboards, feature stores, ML models — with enough metadata attached to each entry that a consumer can decide whether it is the right asset for their purpose without reading the pipeline code. A platform without a catalogue forces that discovery to happen by word of mouth; a platform with a working one makes it self-service.

What belongs in a catalogue entry

A minimally useful entry has a name, a plain-English description, a schema, an owner, a freshness indication, a usage signal (who queries this, how often), tags that place it in the business taxonomy, and a link to the upstream pipeline that produces it. A richer entry adds sample values, documented business-logic notes, quality-test results, known caveats, and the lineage graph that connects it to its sources. The richer the entry, the less often a consumer needs to ask the owner a question before using the asset.

The generations of catalogue tooling

Cataloguing tools have gone through three roughly distinguishable generations. The first was enterprise metadata repositories like Informatica and Collibra, focused on top-down compliance use cases. The second was the discovery-first tools that emerged from large consumer-internet teams in the late 2010s: Airbnb's Dataportal, Lyft's Amundsen, LinkedIn's DataHub, Netflix's Metacat — all designed to help thousands of analysts find the right table. The third, which this chapter's Section 8 treats separately, is the active-metadata platforms that extended the second-generation catalogues into control-plane tools rather than passive indexes. Modern practice runs on the second- and third-generation descendants.

Crowdsourced vs curated

A long-running design tension is between catalogues that rely on producers to document everything formally (curated, high-quality entries, low coverage) and catalogues that let consumers edit, tag, and comment (crowdsourced, uneven quality, high coverage). The working answer in practice is both: a baseline of machine-extracted metadata for every asset, a curated layer for the business-critical ones, and a crowdsourced layer of tags, comments, and usage examples on top. A catalogue that covers only the curated set will never find most of the platform; one that covers everything with no curation will be full of abandoned and misleading entries.

Adoption is the test

A catalogue is successful when people use it instinctively. The organisational signal is that internal chat questions like "does anyone know where the X table is?" trail off because the answer is "search the catalogue". Getting there requires the catalogue to be genuinely faster than Slack, which in turn requires good search, good freshness, and enough baseline metadata that most entries return a useful result. The failure mode is the catalogue that has an entry for every table but nothing human-useful in any of them; consumers learn quickly that the catalogue is not where answers come from and go back to asking people.

Discovery is the product

An important reframing is that a catalogue's primary user is not the governance team but the analyst or engineer who is trying to answer a question. Organisations whose catalogue teams treat themselves as an internal product team — with user research, usage analytics, and iteration — tend to build catalogues that get used; organisations that treat cataloguing as a compliance obligation tend to build ones that do not.

Section 08

Active-metadata platforms make metadata flow both ways

Traditional catalogues were passive indexes: they scraped metadata out of the warehouse, stored it, and displayed it. Active-metadata platforms, a term coined by Gartner in 2021 and quickly adopted by the vendor community, describe the generation of tools that additionally emit metadata back into the platform — driving access policies, triggering pipeline runs, annotating dashboards, alerting on lineage changes, and integrating with the orchestrator, BI layer, and ML platform in real time. The distinction matters because it changes what a metadata platform does, not just what it stores.

The four that dominate the landscape

Four open-source projects, and a layer of commercial vendors that sit around them, set the standard. DataHub (LinkedIn, then Acryl) emphasises metadata-as-a-graph with strong lineage, ownership, and search. OpenMetadata (Collate) emphasises a unified ontology across data, dashboards, pipelines, and ML models, with tight integrations to dbt, Airflow, and Superset. Apache Atlas is the older Hadoop-era incumbent, still present in many enterprise estates, particularly those built on Hortonworks or Cloudera foundations. Amundsen (Lyft) remains the clearest embodiment of the search-first catalogue pattern, though newer teams tend to pick DataHub or OpenMetadata for the broader feature set. The commercial layer — Collibra, Alation, Atlan, Informatica Cloud Data Governance — targets enterprises that want turnkey deployments with strong compliance features and extensive connector catalogues.

The integration surface is what matters

A metadata platform's value is roughly proportional to the number of systems it integrates with and the depth of the integration. A platform that ingests metadata from the warehouse is table stakes; a platform that additionally pulls from dbt, Airflow, Spark, Kafka, Looker, Tableau, Sagemaker, and Snowflake RBAC — and emits to the same set — is genuinely a control plane. The practical sizing question for a buyer or adopter is which of their existing systems the platform covers out of the box and which will require custom connectors.

Metadata as a graph

The right mental model for a modern metadata platform is a graph database. Nodes are assets (tables, columns, dashboards, models, pipelines, users, teams); edges are lineage, ownership, usage, and policy relationships. Queries over that graph answer questions that cannot be answered by any single system in isolation: "which dashboards depend on this column", "which production tables are accessed by the recommendations model", "which users own assets that include PII but lack an approved DPA". The shift from "metadata as rows in a governance tool" to "metadata as a queryable graph" is the concrete technical change that active-metadata really names.

Where the value lands

Most teams first find value in three places. Change-impact analysis — before dropping a column, see every downstream consumer. Policy enforcement — tag a column as PII in the metadata layer, and have the access layer, pipeline, and BI tool all respect it automatically. Stewardship — assign ownership in the metadata layer and route incidents, test failures, and review requests to the right person. These are the capabilities that separate an active-metadata platform from a classic catalogue, and they are the ones that repay the substantial integration effort.

From catalogue to control plane

The short version of the generational shift: a catalogue tells you what exists; an active-metadata platform tells you what exists, what it depends on, who owns it, who uses it, what policies apply to it, and then makes those facts actionable across the rest of the stack. The jump in leverage justifies the jump in deployment complexity for most organisations past a certain size.

Section 09

Schemas are where the contract actually reaches the wire

A contract (Section 3) is the agreement; a schema is the machine-readable form of part of that agreement — the structural part that a serializer and a parser can verify on every message. Schemas are a prerequisite for most of the governance work in this chapter: you cannot enforce classifications on unnamed columns, you cannot emit lineage through untyped payloads, and you cannot evolve a format safely without a change-management process around its schema.

The three standard schema languages

Apache Avro is the standard in Kafka and Hadoop-adjacent ecosystems. It is compact, supports rich schema evolution (defaults, aliases, union types), and is the native format used with Confluent Schema Registry. Protocol Buffers (Protobuf) is the Google lineage, widely used for service-to-service RPC and increasingly for event payloads; its proto3 dialect is the one most teams encounter. JSON Schema is the least compact of the three and the most human-readable; it dominates where JSON is already the on-wire format, which includes most web APIs and many streaming systems that chose readability over efficiency. Choosing between them is a real design decision, usually made by the platform team once, and lived with for years.

Schema registries

A schema registry is the central store that producers and consumers consult to resolve the schema associated with a stream, topic, or dataset. Confluent Schema Registry is the reference implementation in the Kafka world, with open-source cores and a commercial offering; AWS Glue Schema Registry and similar managed equivalents fill the role on the cloud providers. The registry gives the platform three things: a single authoritative location for every schema, a versioning history, and a compatibility check that validates proposed new versions against the rules for safe evolution before they are accepted.

Compatibility rules

The most consequential feature of a schema registry is its compatibility enforcement. The three common modes — backward, forward, and full compatibility — formalise the rules for what a schema change can and cannot do. Backward compatible: consumers running the new schema can read data written with the old. Forward compatible: consumers running the old schema can read data written with the new. Full: both. The right mode depends on how producers and consumers roll out relative to each other; the wrong mode lets a producer deploy a breaking change and discover the consequences only when the downstream systems start failing.

Schema evolution in the lake and the warehouse

Schemas live in the lake and the warehouse too, though the machinery is less explicit. Parquet files carry their own schema in the footer; Delta, Iceberg, and Hudi (Chapter 02) layer schema evolution, enforcement, and time-travel on top of the raw files. Cloud warehouses enforce a declared schema on every write. The modern practice is to treat every write path — streams, lakes, warehouses — as having an explicit schema, an explicit registry of some kind (even if the registry is the table-format metadata), and an explicit change-management process; ad-hoc evolution without any of those is how the classic "the schema quietly changed" incident happens.

Schema is the thin end of the wedge

For teams just beginning a governance programme, a clean schema and a working registry are often the single most concrete starting point. They are an engineering artefact, easy to adopt incrementally, and they immediately make the rest of the programme — contracts, tests, classifications, lineage — tractable in ways that unbounded semi-structured payloads cannot support.

Section 10

Master data management asks which customer record is the real one, across everywhere

Most organisations of any size have the same customer, the same product, or the same employee represented in several systems at once — CRM, billing, support, marketing, HR. Master data management (MDM) is the practice of reconciling those parallel representations into a single authoritative record per entity, known in the older literature as the golden record. It is one of the oldest disciplines in data management, one of the least fashionable, and in large enterprises one of the highest-leverage.

The core problem: entity resolution

At the heart of MDM is entity resolution: the problem of deciding which records refer to the same real-world entity and which do not. Deterministic matching uses exact or rule-based matches on fields like email, phone, or external ID; it is precise but misses anything with typos or formatting differences. Probabilistic matching uses similarity scores across multiple fields and a threshold to declare a match; it catches more real duplicates at the cost of some false positives. ML-based matching trains a classifier on hand-labelled pairs and generalises beyond hand-tuned rules; it works well on large corpora with rich signals. Real programmes typically use all three in layers.

Stewardship is the missing ingredient

Entity resolution will always have ambiguous cases — two records that might or might not be the same person. A credible MDM programme includes human stewards who review the ambiguous matches, decide them, and feed the decisions back into the matching model. Without stewardship the programme will either be conservative (high precision, many uncaught duplicates) or aggressive (high recall, occasional wrong merges that are enormously embarrassing when the two people are different); with stewardship, it can converge over time to a stable accuracy appropriate for the use.

MDM architectures

There are three broad architectural patterns. Registry MDM keeps the source systems authoritative, adds a cross-reference table, and exposes the golden record as a view that joins them. Consolidation MDM copies into a central store and cleans in-place, with the central store as the published golden record but not authoritative for writes. Coexistence and centralised patterns make the MDM store the system of record, pushing changes back into the source systems. Registry is the cheapest and most common starting point; full centralisation is rare because it requires reworking every source system.

Where MDM sits relative to the warehouse

In modern stacks, MDM overlaps substantially with warehouse modelling (Chapter 02 Sections 13–14) and with feature stores (Chapter 03). Many teams end up doing lightweight MDM inside dbt models: a central dim_customer that resolves the CRM, billing, and marketing records into a single row per real-world customer. A full enterprise MDM product (Informatica, Reltio, Tibco EBX) is heavier; it is the right choice when the entity being mastered has regulatory weight, lives in dozens of systems, or drives high-consequence operational decisions.

Unfashionable and essential

MDM has never been the exciting part of data. It is nonetheless the difference between an organisation that can answer "how many customers do we have" with confidence and one that genuinely does not know. Mature data teams treat MDM as a first-class discipline alongside pipelines and governance; immature ones treat it as a workaround that never gets finished.

Section 11

Classification is the label that says what kind of data this is and what rules apply to it

Before access controls, privacy techniques, or compliance processes can do useful work, each data asset needs a label that says what kind of thing it is — whether it contains personal data, whether it is sensitive under a specific regulation, what internal confidentiality tier it sits in. Data classification is the practice of applying and maintaining those labels consistently, and it is the hinge that connects the metadata layer to the policy layer.

Two classification axes

Most classification schemes combine two axes. The first is sensitivity or confidentiality: typical tiers are public, internal, confidential, and restricted (or the equivalent government scale). The second is data type, which flags regulatory significance: PII (personally identifiable information), PHI (protected health information under HIPAA), PCI (payment card information), financial data subject to specific disclosure rules, and employee data subject to labour regulations. A column can carry labels on both axes — for example, "confidential" on sensitivity and "PII" on type — and each label pulls in its own controls.

Classification has to happen automatically, or it won't

A scheme that relies on every engineer to apply the right labels to every new column will not stay current. The mature practice combines three techniques. Pattern scanners identify likely PII by regex (email addresses, phone numbers, national ID numbers) or by ML classifiers trained on examples. Lineage-based propagation automatically inherits labels from upstream to downstream tables so that derived assets are classified the moment they are created. Manual override lets data stewards correct misclassifications and apply labels the automation missed. All three are native features of modern active-metadata platforms (Section 8).

Classification labels drive action

The value of a classification label is entirely in what it triggers. A PII label should drive masking in non-production environments, row-level access control in queries, retention-policy enforcement, and inclusion in data-subject-access-request workflows. A "confidential" label should restrict which users and groups can query the asset. A "restricted" label should additionally route any new access request through a named approver. If labels are stored but not acted on, classification has failed; if every label is wired to concrete controls, classification is the mechanism through which policy actually reaches the query.

Special categories under the law

Certain categories are legally distinguished and carry particularly strict handling rules. Under GDPR, special category data includes race, ethnicity, political opinions, religious beliefs, trade-union membership, genetic and biometric data, health data, sexual orientation, and criminal convictions; processing any of these requires an explicit lawful basis. HIPAA's PHI designation is similarly strict in scope. Any classification scheme that fails to distinguish these categories from ordinary PII will fail its compliance obligations; serious programmes have explicit tags for each of the special categories relevant to their jurisdictions.

Classification is the handshake

Classification is the interface between metadata and policy. It is the layer at which the metadata team says "this column contains this kind of data" and the access team says "that kind of data gets this kind of control". Without classification, every policy must be written per-column by hand; with classification, policies can be written per-label and enforced across the entire estate.

Section 12

Access control answers "who is allowed to see what" for every dataset on the platform

Once data is classified, the next job is to enforce who can read or write it. Modern warehouses and lakes support an increasingly rich vocabulary of access controls: at the schema, table, column, and row level, with role-based and attribute-based variants, and with delegation, temporary grants, and audit logs. Getting this layer right is where a governance programme is, in a concrete sense, doing compliance work; everything upstream is in service of controls at this layer landing on the right columns and rows.

RBAC, ABAC, and row/column policies

Role-based access control (RBAC) is the default: users belong to groups, groups hold roles, roles grant privileges on specific objects. It is simple, auditable, and works well until the number of fine-grained distinctions outgrows the number of roles anyone can keep track of. Attribute-based access control (ABAC) grants access based on runtime attributes of the user (department, clearance, location) and the object (classification label, region, sensitivity); it scales to finer-grained policies at the cost of harder reasoning about "why does this user have access to this row". Row-level and column-level policies are the modern warehouse primitives that let a single view expose different rows or mask different columns to different users based on the policy function, without multiplying physical tables or views.

Warehouses now support this natively

A decade ago, fine-grained access control required a layer above the warehouse (tools like Immuta, Privacera, Apache Ranger). The cloud warehouses have since absorbed much of this functionality directly. Snowflake has row-access policies, masking policies, and dynamic data masking. BigQuery has column-level access control, row-level security, and data masking tied to policy tags. Databricks Unity Catalog provides attribute-based access and dynamic views. For most teams, the warehouse-native controls are the right starting point; the dedicated tools are still valuable for cross-warehouse enforcement and for richer policy languages.

Least privilege, reviewed regularly

The governing principle is least privilege: grant the minimum access required to do the job, and nothing more. In practice this requires two practices that organisations rarely get right on their own. Provisioning through groups, not users: access is granted to roles or groups, and people get access by being added to a group, so that access can be revoked by removing group membership when someone changes teams. Periodic access review: at a fixed cadence (often quarterly), asset owners review who has access to their data and confirm that it is still appropriate. Both practices become tractable when the access system and the active-metadata platform are integrated; without that integration, they drift quickly into shelfware.

Audit is the other half

Access without audit is a liability: a regulator asks who accessed a specific record on a specific date, and the organisation must be able to answer with confidence. Modern warehouses emit access logs at the query level; the task is to collect, retain, and query those logs. SOC 2 and HIPAA both require the capability directly. A working setup routes warehouse access logs into a long-term store (often the lake itself), indexes them for query, and surfaces a user-oriented view in the active-metadata platform so that incident response does not require running ad-hoc SQL at 2am against the audit log.

Controls are only as good as the classifications they act on

An access-control system that restricts PII columns is useful only if the PII columns are reliably labelled. This is why Sections 11 and 12 are a pair: classification without enforcement is paperwork, and enforcement without classification is either too broad (everything locked down) or too narrow (sensitive data hidden in plain sight). The two layers carry the programme together.

Section 13

Privacy techniques let the platform use data without exposing individuals

Access control (Section 12) restricts who can see which rows and columns in their original form. Privacy techniques go further: they modify the data itself so that analysis, testing, and ML training can proceed without revealing the individuals whose records they contain. The toolkit runs from simple masking all the way through formal differential privacy, and each technique trades some combination of utility, performance, and mathematical rigour.

Masking, redaction, and pseudonymisation

Masking replaces sensitive values with synthetic or partial values — a credit-card number becomes ****-****-****-1234, an email address becomes j***@company.com. Redaction removes the field entirely. Pseudonymisation replaces direct identifiers with consistent synthetic tokens, so that records can still be joined but the real identity is not present in the dataset. All three are cheap, effective for operational use (test environments, internal analytics), and insufficient as a legal matter for anything labelled as fully de-identified: pseudonyms can often be re-identified by combining with other attributes, as the famous Netflix Prize and AOL search-log incidents demonstrated.

Tokenisation and format-preserving encryption

Tokenisation maps sensitive values to random tokens through a lookup vault; systems that need the real value go through the vault, systems that do not store only the token. It is the standard approach for payment-card data and is well-supported by commercial vendors (Protegrity, Thales). Format-preserving encryption produces ciphertext that looks like valid plaintext (a 16-digit credit card encrypts to a different 16-digit number) so that downstream systems with strict format expectations do not break. Both techniques are reversible given the key or the vault, which is why they are compliance-appropriate for storage but not a substitute for real de-identification.

k-anonymity and its descendants

k-anonymity, introduced by Latanya Sweeney in 2002, requires that every record in a released dataset be indistinguishable from at least k−1 others on its identifying attributes. Later refinements — l-diversity, t-closeness — patched specific weaknesses where k-anonymous data still leaked information about sensitive attributes. The framework is useful and widely taught, but it is now recognised as fundamentally vulnerable to composition attacks: combining multiple k-anonymous releases, or combining one with an external dataset, can often re-identify individuals. It remains a legitimate tool for specific release scenarios, not a general-purpose solution.

Differential privacy

Differential privacy (Dwork and colleagues, 2006) is the modern mathematical framework for quantifying privacy loss. A mechanism is ε-differentially private if its output distribution changes by at most a factor of e^ε when any single record is added to or removed from the dataset; the parameter ε is a formal privacy budget that composes across queries in a well-understood way. Differential privacy has moved from theory to practice over the last decade: the US Census Bureau used it for the 2020 decennial release; Apple, Google, and LinkedIn run differentially private telemetry; Microsoft's SmartNoise and Google's dp-accounting library make the machinery accessible. It is the strongest formal guarantee available, and it trades utility for privacy in ways teams need to measure carefully before adopting.

Where each technique fits

In practice the techniques layer. Pseudonymisation and masking handle the vast majority of internal use: dev and test environments, internal dashboards, analyst queries. Tokenisation handles regulated-field storage: payment cards, social security numbers. Differential privacy handles the most sensitive external releases, the published aggregate statistics, and training regimes where the trained model might be queried to infer whether specific records were in the training set. No single technique is sufficient across all four use cases; the design job is matching technique to risk and to utility requirement.

De-identification is harder than it looks

A recurring lesson in the privacy literature is that "de-identified" data is often not: combining age, gender, and postcode is enough to uniquely identify a majority of individuals in many populations. Teams that need genuinely defensible privacy claims — rather than engineering conveniences — should assume ad-hoc masking is not enough and reach for tokenisation, aggregation thresholds, or differential privacy depending on the threat model. Chapter 22 (Privacy in ML in Part X) revisits these questions in the ML-specific setting.

Section 14

Compliance regimes translate legal obligations into concrete data-handling requirements

A working governance programme is ultimately accountable to specific legal and contractual regimes. An engineer does not need to be a lawyer, but they do need enough literacy to understand what each regime actually requires of the data platform, and where the high-leverage engineering choices sit. The five regimes that cover most of the ground for an English-speaking company operating internationally are GDPR, CCPA, HIPAA, SOC 2, and ISO 27001.

GDPR: the European baseline

The General Data Protection Regulation (in force since 2018) applies to any organisation processing personal data of people in the EU or EEA, regardless of where the organisation itself is located. Its core obligations most relevant to a data platform are: a legal basis for every processing activity (usually consent, contract, or legitimate interest); data-subject rights including access, correction, portability, and erasure (the right to be forgotten); purpose limitation (data collected for one purpose cannot be used for another without new legal basis); storage minimisation (retain only as long as needed); and breach notification within 72 hours. The erasure right is the most architecturally consequential: it requires the platform to be able to find and delete all traces of a specific person on request, which is non-trivial once data has been propagated to warehouses, lakes, and backups.

CCPA and the US state patchwork

The California Consumer Privacy Act (effective 2020, amended by CPRA in 2023) gives California residents rights broadly similar to GDPR's — access, deletion, opt-out of sale, non-discrimination — with some structural differences around what counts as "sale" and what constitutes a service-provider exemption. CCPA has been joined by Virginia (VCDPA), Colorado (CPA), Connecticut (CTDPA), Utah (UCPA), and a growing list of other state laws, with overlapping but non-identical obligations. For a company operating across US states, the practical approach is usually to build to GDPR-like capabilities and apply them across the board, since retrofitting per-state compliance after the fact is more expensive than building once to the strictest regime.

HIPAA and other sector regimes

HIPAA governs protected health information (PHI) in the United States; any platform handling PHI for a covered entity or business associate must meet specific safeguards around access, transmission, and disclosure, and must sign formal business-associate agreements with its vendors. Other sector regimes carry similar weight: PCI DSS for payment-card data, FERPA for education records, GLBA for consumer financial data, sector-specific regimes for children's data (COPPA) and in specific regulated industries. Each regime dictates specific controls; a platform serving multiple sectors typically ends up with the union of their requirements, often routed through its classification (Section 11) and access-control (Section 12) systems.

SOC 2 and ISO 27001: the trust framework

SOC 2 (specifically Type II) is the de facto standard for B2B SaaS trust in North America; it requires a set of controls across security, availability, processing integrity, confidentiality, and privacy, audited annually. ISO 27001 is the equivalent international standard, certifying an information security management system. Neither is a law; both are contractual expectations that customers increasingly require before they will sign a commercial agreement. From a data-platform perspective, SOC 2 and ISO 27001 compliance is mostly a matter of documenting controls that a mature governance programme already has in place: classification, access review, audit logs, incident response, change management, backup and recovery.

The AI regulatory layer

A newer regulatory layer specifically targets ML systems rather than data alone. The EU AI Act (adopted 2024) imposes risk-tiered obligations on AI systems, including transparency, human oversight, documentation of training data, and post-market monitoring for "high-risk" systems. The US NIST AI Risk Management Framework, executive orders, and state-level proposals add further obligations. Section 17 revisits these from the ML angle; from the data-platform angle, they largely reinforce existing lineage, provenance, and documentation requirements, extended to cover training datasets and model outputs as first-class governed assets.

Compliance is a platform capability, not a checklist

Treating compliance as a checklist of annual paperwork is how organisations end up with brittle, expensive audits and unpleasant surprises when a new regime lands. Treating compliance as a set of platform capabilities — discoverable data, classified data, access-controlled data, audited data, deletable data — makes each new regime a small incremental obligation on top of a working base, instead of a fresh project from zero.

Section 15

Ethics is what remains after legal compliance, and it has grown into a recognised discipline of its own

Compliance asks "is this allowed". Ethics asks a different question: "is this appropriate, given what we know and what we can foresee". A platform can be fully compliant with GDPR, HIPAA, and SOC 2 and still be used for surveillance, discrimination, or manipulation in ways the organisation would be embarrassed to defend publicly. Over the last decade, data ethics has matured from an academic preoccupation into an operational practice, and most serious data platforms now include ethical review as an explicit step for the highest-stakes uses.

The foundational literature

The discipline has a small canon that repays reading. Cathy O'Neil's Weapons of Math Destruction (2016) catalogued the ways automated decision systems compound inequality. Virginia Eubanks's Automating Inequality (2018) traced the same dynamics through US social services. Ruha Benjamin's Race After Technology (2019) extended the argument to a structural analysis of how engineering decisions encode historical bias. Safiya Noble's Algorithms of Oppression (2018) focused on search and recommendation. The literature consistently identifies a small number of recurrent failure modes: biased training data leading to biased predictions, proxy variables that stand in for protected attributes, feedback loops that amplify inequality, and performance metrics that hide disparate impact.

Operational ethics: review gates

The practical operational form is a review gate: before a high-stakes data use goes live, a designated group reviews the proposed use for ethical considerations beyond legal compliance. Useful review frameworks (Fairness, Accountability, Transparency principles; Data & Society's guidelines) structure the review around questions like: who is this system making decisions about, and do they have recourse if it is wrong; what are the downstream consequences of systematic error; does the training data represent the affected population; are there proxies for protected attributes that will produce disparate impact. The review is not a publishing step; it is an engineering step with concrete deliverables (impact assessments, mitigations, monitoring plans).

Data-provenance ethics

Beyond the downstream ethics of how models are used, there is the upstream ethics of how training data is obtained. The legitimacy of large-scale web-scraped training corpora has been contested in courts and in public (the ongoing copyright and privacy lawsuits against LLM developers are one such arena). Serious programmes track training-data provenance explicitly: where did each dataset come from, under what license, with what consent, with what compensation to creators. The Chapter 01 treatment of data provenance is the starting point; ethical programmes require it to extend to every model on the platform, not just to the data underneath.

The limits of ethics programmes

It is worth being honest about the limits of ethics programmes inside companies. They are constrained by the commercial interests of the companies that run them, they can be captured by executives who want a rubber stamp, and they tend to produce process rather than substance when the underlying product strategy demands a use the ethics programme would otherwise reject. The literature on corporate ethics (including the AI Now Institute's work and the documented disbandings of ethics teams at several major tech companies) argues that internal review must be paired with external accountability — regulators, auditors, and published commitments — to have real force. An honest programme says so explicitly rather than overstating its independence.

The conversation does not end here

Ethics at the data-platform level is necessary but not sufficient. The full ethical frame for machine learning is covered in Part X of the compendium (AI Safety & Ethics), where the questions about alignment, bias, fairness, and societal impact are treated as first-class subjects. This section establishes that by the time a dataset is used to train a model, ethical review should already have happened; it has not in many high-profile cases, and that is a failure mode of the discipline rather than an inevitability.

Section 16

A governance programme needs people with names attached to decisions, not just tooling

Tools matter; so do roles. Even the best catalogue, observability platform, and access system fail if no one is responsible for the quality of a specific asset, no one owns the decision to change a specific contract, and no one is on call when a specific pipeline misbehaves. The organisational layer that makes tools stick is the governance operating model: a small, consistent vocabulary for who decides what.

Owners, stewards, and custodians

Three roles recur in every governance framework. A data owner is accountable for a dataset at the business level: the VP of Finance owns the general ledger tables, the Head of Marketing owns the campaign-attribution data. A data steward is the operational counterpart: the analyst or engineer who maintains the definition, quality, and documentation of the asset day-to-day. A data custodian is the technical role responsible for the infrastructure: storage, backup, access, platform health. The distinction is that the owner decides policy, the steward executes it, and the custodian operates the platform underneath. A platform in which the same person plays all three roles for everything is either very small or very confused; a platform in which the three are distinct and named makes governance decisions traceable.

Centralised, federated, and mesh models

The structural question is how the governance programme relates to the rest of the organisation. A centralised model puts all governance into a single team; it is easy to coordinate but easy to bottleneck. A federated model keeps a small central team for standards and tooling and pushes ownership and stewardship into the domain teams that produce the data; it is the dominant pattern in mid-sized companies. The data mesh model (Zhamak Dehghani, 2019–2020) takes the federation further, treating each domain team as a "data product" team that owns its datasets end-to-end with a light-touch central platform. None of these is universally right; the choice depends on the organisation's size, the heterogeneity of its domains, and its appetite for the cultural work that federation requires.

Councils, RACI, and decision rights

Beyond individual roles, most programmes rely on a couple of standing bodies. A data governance council (sometimes data governance committee) is a cross-functional group — finance, legal, security, engineering, product — that sets policy, approves classifications for novel data, resolves cross-domain disputes, and acts as the escalation point. A lightweight RACI chart (Responsible, Accountable, Consulted, Informed) per major decision type makes clear who does what when a new dataset is onboarded, a new access request is reviewed, or a policy is revised. Much organisational pain in governance traces to unstated decision rights; the RACI is a small investment that prevents large amounts of confusion.

The incentive problem

The organisational failure mode almost every governance programme eventually faces is that the people expected to do governance work are rewarded for other things. An engineer hired to ship product features is not naturally going to document a dataset, respond to catalogue questions, or review access quarterly. A governance programme that cannot write those activities into job descriptions, performance reviews, and team OKRs will devolve into a small group of enthusiasts doing heroic work that does not scale. The mature pattern is to make the governance work visible in the same places that other work is visible: as explicit objectives, with explicit owners, tracked alongside everything else.

The organisation is the deliverable

Tooling is necessary; the organisation is the thing that actually holds the programme together. A successful governance effort ends up as much a set of role definitions, decision rules, and reporting lines as it is a stack of open-source and SaaS products. Treating it as an engineering-only initiative, without investing in the organisational structure, is the single most common way a governance programme fails.

Section 17

ML inherits every governance problem and adds a few of its own

Every quality, lineage, classification, and access-control problem described in this chapter applies to machine learning as much as to analytics — with the added difficulty that errors in the training data propagate into trained parameters that are much harder to audit than a SQL query. ML also introduces governance artefacts the previous sections have not yet named: training-data provenance, feature lineage, model cards, and the dawning regulatory framework around model governance itself.

Training-data provenance is a first-class asset

A trained model is a complicated summary of its training data. If the provenance of that data is unclear — which rows, which sources, which timestamps, which licences, which transformations — downstream debugging becomes impossible and downstream compliance becomes speculative. The discipline that has emerged (Gebru et al.'s Datasheets for Datasets, 2018; Hugging Face and Google's dataset cards) is to treat the training corpus as a governed asset in its own right, with a descriptor, lineage, quality metrics, and retention policy attached. The shift is from "we trained on some crawl of the web" to "we trained on specific versions of specific sources, documented and retained".

Feature lineage and the feature store

ML features — the numerical inputs presented to a model at training and inference time — are themselves data assets. They are produced by pipelines, consumed by models, served to production, and evaluated for drift. A platform without feature lineage cannot answer "which models depend on feature X" when X needs to change, and cannot trace a model degradation to its upstream cause. Feature stores (Feast, Tecton, Sagemaker Feature Store, Vertex AI Feature Store) exist partly to solve this: they are the catalogue, lineage, and access layer for features, inheriting the same governance practices the rest of this chapter describes.

Model cards and datasheets

Mitchell et al.'s Model Cards (2019), together with Datasheets for Datasets and the IBM FactSheets initiative, define a lightweight documentation discipline for trained models: intended use, evaluation metrics across subgroups, known limitations, ethical considerations, training-data summary, and maintenance contacts. The practice has been adopted by Hugging Face (model cards for every model on the Hub), Google, and most major labs, and it is increasingly required by regulators. Model cards turn a trained artefact into something that can be documented, reviewed, and held to account in the way any other governed asset can; without them, a model is a black box that the organisation will struggle to reason about.

The regulatory horizon for models

Regulation is beginning to treat models themselves as governed objects, not just the data beneath them. The EU AI Act's documentation and post-market-monitoring obligations for high-risk AI systems are the most concrete example, but the trajectory in the UK, US, and other jurisdictions is in the same direction. Teams operating at the frontier of capability are already being asked to evaluate models for specific risks (biosecurity, CSAM, election interference, autonomous-agent capabilities) before release. The infrastructure this requires — red-team results, eval suites, capability documentation, release decisions — is a new branch of governance that builds directly on the data-governance machinery of the rest of this chapter.

Part III ends here

This is the final chapter of Part III. Over seven chapters, the compendium has traced data from the moment it is captured through storage, pipelines, streaming, distributed compute, cloud infrastructure, and now the quality, metadata, and policy layer that makes the whole thing trustworthy. Part IV opens with classical machine learning — supervised regression, classification, the bias-variance tradeoff, and the model families that preceded deep learning. Those chapters inherit everything Part III built: every model runs on governed data, queries a catalogued feature store, consults a monitored pipeline, and eventually answers to the same regulatory and ethical obligations this chapter described. The substrate is finished; the models are what comes next.

Where to go next

The data governance literature is unusually uneven — a handful of foundational books, a much larger body of tool-specific documentation, and a scattering of essays and standards that have become canonical mostly by repetition. The list below picks the references that repay re-reading: the canonical textbook (DAMA-DMBOK), the data-contracts and data-mesh literature that reshaped the field recently, the main open-source platform docs, the privacy and compliance texts, and the ML-specific governance documents that bridge into Part IV.

The canonical books

DAMA-DMBOK: Data Management Body of Knowledge

DAMA International · 2nd ed. 2017 · Technics Publications

The canonical reference for data-management practice — the framework the rest of the discipline refers back to. Covers governance, quality, architecture, modelling, metadata, master data, and security across eleven "knowledge areas". Dry, thorough, and authoritative; it is the book every data-management certification (CDMP) is built against. Not the book to read cover-to-cover; the book to keep on the shelf and consult when a specific governance question needs a defensible definition.

Technics Publications
Data Management at Scale

Piethein Strengholt · O'Reilly · 2nd ed. 2023

The most useful practitioner's book on modern data-management architectures. Covers the move from centralised warehouses to domain-oriented data mesh, with concrete chapters on governance, metadata, data products, and the operating model. Particularly strong on how a mid-to-large enterprise actually organises around data — the structural choices most books glide over.

O'Reilly
Data Quality: The Accuracy Dimension

Jack Olson · Morgan Kaufmann · 2003

An older text that has aged better than most. Olson's treatment of data quality — profiling, dimensions, measurement, remediation — is the source many modern treatments implicitly derive from. Pair with Thomas Redman's Data Driven (2008) for the business-case companion; both predate the observability tooling of the 2020s but formalise the concepts those tools now automate.

Morgan Kaufmann
Enterprise Knowledge Management: The Data Quality Approach

David Loshin · Morgan Kaufmann · 2001

The most influential of Loshin's several books on data quality and governance, and one of the first to treat metadata as a first-class enterprise concern. Its chapters on master data management, reference data, and the economics of quality are still cited; its vocabulary (data stewardship, data quality rules, domain-specific validation) is the vocabulary most modern programmes adopted. Loshin's later The Practitioner's Guide to Data Quality Improvement (2010) is the more operational companion.

Morgan Kaufmann

Data contracts and data mesh

Data Contracts

Chad Sanderson & Adrian Kreuziger (forthcoming); essays via dataproducts.substack.com · free

Sanderson's essays — particularly the 2022–2023 series on Substack and LinkedIn — are the reference the data-contracts movement built around. They articulated the shift from downstream quality enforcement to producer-boundary contracts, and they catalysed most of the tooling (Data Contract CLI, ODCS, and the commercial offerings from Gable, Monte Carlo, and others). Read in chronological order; the forthcoming book is O'Reilly 2025.

Data Products
Driving Data Quality with Data Contracts

Andrew Jones · Packt · 2023

The first book-length treatment of data contracts as a practice. Covers the conceptual model, the technical implementation (schema, quality, evolution), the organisational changes, and worked examples from Jones's own work at GoCardless. Shorter and more operational than Sanderson's essays; a good second read after the Substack archive.

Packt
Data Mesh: Delivering Data-Driven Value at Scale

Zhamak Dehghani · O'Reilly · 2022

The book-length treatment of the data-mesh architecture Dehghani introduced in her 2019 Martin Fowler essay. Argues for domain-oriented, decentralised data ownership with a central platform team; treats each domain as a "data product" team. Controversial in parts, foundational in others; the vocabulary it introduced (data products, self-serve data platform, federated governance) has become standard even among teams that do not adopt the full pattern.

O'Reilly Original essay
Open Data Contract Standard (ODCS) and Data Product Specification

bitol.io · datacontract.com · free · open source

The two emerging open standards for data contracts and data products. The Bitol project (hosted by the Linux Foundation AI & Data) maintains the ODCS YAML specification; datacontract.com provides a complementary specification and tooling. Both are under active development; reading the specs is the fastest way to understand what a contract concretely contains in engineering terms.

Bitol / ODCS datacontract.com

Data testing and observability

Great Expectations documentation

docs.greatexpectations.io · free · open source

The primary reference for the Great Expectations library. The "Getting Started" tutorial, the "Expectations Gallery", and the integration guides (Spark, Pandas, SQL warehouses) cover most of what a team needs. The framework's concept of "data docs" — automatically-published test-result sites — is the piece worth understanding before dismissing Great Expectations as "just tests".

GX docs
dbt tests and dbt documentation

docs.getdbt.com · free

The tests section of the dbt docs is short, complete, and is the single best starting point for analytics-engineering teams. The "Tests" page, the "Generic tests" reference, and the extensions in dbt_expectations and dbt_utils cover most of the test patterns a modern data platform uses day-to-day. The broader dbt Learn course is free and worth the afternoon.

dbt tests dbt Learn
Soda Core / Soda Cloud documentation

docs.soda.io · free / commercial

The documentation for Soda Core (open source) and Soda Cloud (commercial). SodaCL, the compact YAML-like language for declaring checks, is the piece that distinguishes Soda from its competitors and is worth reading the reference for even if ultimately the team uses Great Expectations or dbt tests.

Soda docs
Data Observability books and Monte Carlo's Data Reliability Engineering blog

O'Reilly · montecarlodata.com/blog · free and paid

The Monte Carlo co-founders' 2022 O'Reilly short book Data Quality Fundamentals (Moses, Gavish, Vorwerck) is the best single introduction to data observability as a discipline, covering the five pillars, the operating model, and the anti-patterns. Their ongoing blog and the competing Bigeye and Metaplane blogs are the best source of field reports on what observability programmes actually look like in production.

O'Reilly MC blog

Metadata platforms and lineage

DataHub documentation

datahubproject.io · free · open source (Acryl)

The primary reference for DataHub, the dominant open-source active-metadata platform. The "Concepts" section explains the graph model; the "Metadata Ingestion" section covers the 70+ source connectors; the "Actions" and "Policies" sections cover the active (control-plane) half. The Acryl blog is the main source of deeper case studies and architectural writing.

DataHub docs
OpenMetadata documentation

docs.open-metadata.org · free · open source (Collate)

The reference for OpenMetadata, the other major open-source catalogue. Its unified ontology across data, dashboards, pipelines, and ML models is the distinguishing feature, and is the cleanest illustration in the open-source world of what "metadata as a graph" means in practice. The "Connectors" and "Features" sections cover the integration surface.

OpenMetadata docs
OpenLineage and Marquez

openlineage.io · marquezproject.ai · free · open source

The emit-time lineage stack. OpenLineage is the open standard for lineage events, with producers for Airflow, Spark, dbt, and Flink; Marquez is the reference consumer and metadata store. For teams that want accurate, emit-time lineage rather than query-log inference, these are the starting points. Good documentation and active community.

OpenLineage docs Marquez
Amundsen and Apache Atlas

amundsen.io · atlas.apache.org · free · open source

The two older open-source catalogues still in production at many organisations. Amundsen (Lyft) is the clearest embodiment of the "search-first catalogue" pattern and remains a good reference architecture even for teams that pick newer tools. Apache Atlas is the incumbent in Hadoop-adjacent estates (Hortonworks/Cloudera) and is the integration target for many enterprise governance tools.

Amundsen Atlas

Privacy, security, and compliance

The Algorithmic Foundations of Differential Privacy

Cynthia Dwork & Aaron Roth · Foundations and Trends in TCS · 2014 · free PDF

The canonical monograph on differential privacy, by the two researchers who did more than anyone else to develop the framework. Mathematically serious; the first two chapters are accessible and cover the core definitions, the Laplace and Gaussian mechanisms, and composition. Any team seriously considering differential privacy should start here rather than with a vendor tutorial.

PDF
NIST SP 800-188 and the NIST Privacy Framework

nist.gov · free

NIST's publications are the best single non-vendor reference for privacy engineering. SP 800-188 covers de-identifying government datasets; the NIST Privacy Framework provides a structured approach to privacy risk management mapped to the Cybersecurity Framework. Both are used as reference standards by auditors and regulators and are written in government-document tone that ages well.

SP 800-188 Privacy Framework
GDPR, CCPA, and the EU AI Act — primary texts

gdpr-info.eu · oag.ca.gov · artificialintelligenceact.eu · free

There is no substitute for reading the regulations. GDPR-info.eu hosts a well-structured copy of the GDPR with searchable articles; the California AG's office publishes the CCPA/CPRA canonically; artificialintelligenceact.eu publishes the EU AI Act. Engineers are often surprised how readable the primary texts are compared to the secondary commentary; an afternoon with the actual articles is usually worth more than a stack of vendor whitepapers.

GDPR CCPA EU AI Act
SOC 2 and ISO 27001 — the practitioner's references

aicpa.org · iso.org · free summaries, paid standards

The authoritative references for the two dominant B2B trust frameworks. AICPA publishes the SOC 2 Trust Services Criteria (the TSP 100) which define the control categories; ISO sells the 27001 standard. Vendor-written guides (Vanta, Drata, Secureframe, Tugboat Logic) are the most accessible summaries and are free; they are not a substitute for reading the actual criteria when architecting the control set.

AICPA SOC 2 ISO 27001

Ethics and ML governance

Weapons of Math Destruction

Cathy O'Neil · Crown · 2016

The book that brought algorithmic-harm concerns into the mainstream. O'Neil — a former hedge-fund quant turned critic — catalogues the ways automated decision systems in hiring, lending, policing, and insurance compound inequality. Not technical; essential reading for anyone building systems that will make decisions about people. Pair with Eubanks's Automating Inequality (2018), Benjamin's Race After Technology (2019), and Noble's Algorithms of Oppression (2018) for the full canon.

Crown
Datasheets for Datasets, Model Cards, and FactSheets

Gebru et al. 2018 · Mitchell et al. 2019 · IBM 2018–ongoing · free

Three closely-related proposals that together defined the documentation discipline for ML assets. Datasheets for Datasets (Gebru et al.) gives a template for documenting training data; Model Cards (Mitchell et al.) does the same for trained models; IBM's FactSheets extends the idea to full AI services. Short, directly applicable papers; every ML team should have read at least the first two.

Datasheets arXiv Model Cards arXiv
NIST AI Risk Management Framework (AI RMF 1.0)

NIST · 2023 · free

The US government's framework for AI risk management, published as a companion to the cybersecurity and privacy frameworks. Organises AI risk around four functions (Govern, Map, Measure, Manage) and provides a catalogue of concrete practices. Widely adopted as the operational framework US organisations build around, and increasingly referenced by procurement and regulators.

AI RMF
Fairness and Machine Learning

Barocas, Hardt, Narayanan · fairmlbook.org · 2023 · free online

The standard academic textbook on fairness in ML, with deep chapters on definitions of fairness, measurement, legitimate trade-offs, and the historical and legal context. Mathematically serious, readable, and published free online. The authors are three of the leading researchers in the space; the book is the most complete single treatment of the technical and normative questions Part X will return to.

fairmlbook.org

This page is the seventh and final chapter of Part III: Data Engineering & Systems. Part IV — Classical Machine Learning — opens with supervised regression, then classification, the bias-variance tradeoff, regularisation, and the model families (linear models, trees, ensembles, kernel methods, probabilistic graphical models) that preceded deep learning. Those chapters inherit every quality, lineage, and access-control problem this chapter described; classical ML is where the governed data finally becomes a predictive system, and the compendium's centre of gravity shifts from data engineering to the algorithms themselves.