The previous six chapters built the pipes: collection, storage, pipelines, streaming, distributed compute, cloud. This chapter is about the other half of the problem — how an organisation knows what data it has, what it means, whether it is correct, who can see it, and who is responsible when it goes wrong. The answers come from a loose but coherent discipline known as data governance, and from three bodies of practice that have matured around it: data quality (dimensions, tests, contracts, observability), metadata (catalogues, lineage, active-metadata platforms), and policy (classification, access control, privacy techniques, and the legal regimes they answer to). The goal is not paperwork; the goal is a platform where a new analyst can find a trusted dataset in ten minutes, a broken upstream table is caught before the dashboard lies, and a regulator's question about a specific record can be answered in hours rather than weeks.
The first section motivates the whole effort: why governance is not optional once more than a handful of people depend on the same data. Section two breaks quality into its canonical dimensions — accuracy, completeness, consistency, timeliness, uniqueness, validity — because disagreements about "bad data" are usually disagreements about which of these is failing. Sections three through six are the quality toolchain in order of escalation: data contracts between producers and consumers, data tests that run in the pipeline, data observability that watches production freshness and volume, and data lineage that traces an error to its source. Sections seven through nine are the metadata layer: the catalogue where humans find datasets, the active-metadata platforms (DataHub, OpenMetadata, Amundsen, Atlas) that have replaced last decade's passive ones, and schema management as the contract that actually reaches the wire. Section ten is master data management — the flavour of governance that tries to resolve "which customer record is the real one" across systems. Sections eleven and twelve are policy: classification (what kind of data is this) and access control (who can see what). Sections thirteen and fourteen are the protections that sit on top: privacy techniques (masking, tokenisation, differential privacy) and the compliance regimes (GDPR, CCPA, HIPAA, SOC 2) that the whole stack ultimately answers to. Section fifteen is ethics — the practices that exist because "legal" is not the same as "acceptable". Section sixteen is the human layer: roles, ownership, the data-governance operating model that ties the technology to a functioning organisation. Section seventeen closes with where governance compounds specifically for ML: training-data provenance, feature lineage, model cards, and the regulatory landscape that now applies to models as well as to data.
A note on tone. Governance has a long and deserved reputation for being the part of a data stack that is discussed mostly in the past tense, after something broke. The modern practice is deliberately the opposite — it treats quality, metadata, and policy as engineering concerns with their own tools, their own tests, their own on-call rotations, and their own roadmaps. This chapter follows that practice. Where possible it names the specific open-source or commercial tools that embody each idea (Great Expectations, Soda, dbt tests, Monte Carlo, Bigeye, DataHub, OpenLineage, Marquez, Collibra, Alation, Atlas) not as endorsements but as anchor points for the concept. This is also the final chapter of Part III, so Section seventeen is written as a bridge into Part IV: classical machine learning inherits every quality, lineage, and access problem described here, with the added complication that the errors now propagate into trained parameters that are much harder to audit than a row in a warehouse.
A five-person startup can get by without governance; everyone knows what every table means and who owns it. A fifty-person company cannot, and a five-thousand-person company cannot even come close. Governance is the discipline by which an organisation keeps its data assets understandable, correct, and defensible as the number of producers, consumers, and systems grows. It is unglamorous work, but it is almost always the gap between a data platform that accelerates the business and one that quietly poisons it.
Three forces make governance unavoidable once a platform crosses a certain size. The first is scale of use: once the number of analysts, pipelines, and dashboards depending on a table is large, a silent change in definition or a quiet correctness regression propagates widely and invisibly. The second is regulatory surface: GDPR, CCPA, HIPAA, SOC 2, and an expanding set of sector-specific regimes impose concrete obligations on how personal and sensitive data is stored, transferred, accessed, and deleted — obligations that cannot be satisfied without a catalogue, classifications, access records, and audit trails. The third is trust: executive dashboards, external reports, and ML models trained on production data all require a claim of correctness, and that claim cannot be made without machinery to back it up.
The platforms in trouble tend to look alike. Multiple tables claim to be the source of truth for the same concept and disagree by a few percent. Pipelines silently double-count on the day the clocks change. A column quietly changes units from dollars to cents and nobody notices for a month. A departed engineer was the only person who understood a critical transformation. An auditor asks who has access to a specific customer's records and the honest answer is "we don't really know". None of these failures require the organisation to be careless — they require only that it has grown faster than its ability to keep track of what it built.
It helps to define the term, because it is often used loosely. Data governance is the set of practices by which an organisation exercises authority and control over its data assets — deciding who owns what, what quality standard applies, how access is granted, how lineage is tracked, and who is accountable when something breaks. The practices fall into three bundles that the rest of this chapter unpacks: quality (is this data correct and fit for purpose), metadata (what is this data, where did it come from, who owns it), and policy (who is allowed to see it, and under what constraints). A mature programme addresses all three; an immature one usually addresses one and hopes the others take care of themselves.
The governance practices that succeed are the ones that live in the pipeline alongside the code — automated tests, lineage emitted by the orchestrator, classifications enforced by the access layer, catalogue entries that update when tables do. The practices that fail are the ones that live in spreadsheets and slide decks maintained by a separate team. This chapter is written on the assumption that governance is an engineering discipline; every section names the tools and mechanisms that make the practice real.
"The data is bad" is too coarse to act on. The professional discipline of data quality breaks the concept into a small number of orthogonal dimensions, each with its own tests and its own remedies. The canonical list varies slightly between authorities — DAMA-DMBOK, ISO 8000, academic treatments — but the working set used in practice is six: accuracy, completeness, consistency, timeliness, uniqueness, and validity.
Accuracy asks whether values match reality: is the customer address in the record actually where the customer lives. Completeness asks whether the values that should be present are present: is every order row missing a total. Consistency asks whether related values agree across a row, across a table, or across systems: does the sum of line items equal the invoice total; does the customer in the CRM have the same ID as in the warehouse. Timeliness asks whether data is fresh enough for the purpose: a one-day-old feed can be fine for weekly reporting and unusable for real-time fraud detection. Uniqueness asks whether the same real-world entity is represented exactly once: one customer, one row, or else duplicate-resolved before use. Validity asks whether values conform to their declared domain: email addresses that parse, postcodes that exist, enum fields that only hold enumerated values.
The taxonomy matters because different dimensions have different fixes. A completeness problem is usually a pipeline-coverage problem: a source is silently dropping records, or a join is losing rows. A validity problem is usually a schema-and-contract problem: the upstream system is emitting values the downstream system does not expect. A uniqueness problem is often an entity-resolution problem (Section 10). A consistency problem frequently indicates that two systems disagree about a shared identifier or a shared definition, which is often the deepest and most organisational failure. Naming the dimension is the first step toward addressing the right cause.
It is tempting, and usually wrong, to treat quality as an absolute. The mature framing is fitness for purpose: data is "good enough" when it is suitable for the decisions and models that will consume it. A marketing segmentation can tolerate a few percent duplicate customers; a regulatory filing cannot. The quality bar should be set per use, and the same asset can be "clean enough" for one pipeline and "not clean enough" for another. Catalogues and contracts (Sections 3 and 7) make this explicit; they publish the quality claim the producer is willing to stand behind, and consumers decide whether it meets their bar.
Each dimension gets more expensive as it approaches the asymptote. Raising completeness from 95 percent to 99 percent is usually cheap; raising it from 99 percent to 99.99 percent often requires a new data source, additional joins, or a human review step. A governance programme that pretends every dimension should be at 100 percent for every asset will simply be ignored; one that sets a defensible target per dimension per asset — and measures against it — earns the attention of the engineers who have to meet the targets.
The six-dimension framework is a diagnostic checklist, not a strategy. Its job is to make quality conversations precise — "we have a completeness problem on orders.customer_id" — so that the right test, the right alert, and the right owner can be put in place. Sections 3 through 6 are the engineering practices that convert those diagnoses into automated checks.
For most of the last decade, data teams caught problems downstream — in the warehouse, after the pipeline had already loaded questionable data. Data contracts invert that stance. A contract is an explicit, machine-checked agreement between a producer (usually an application service) and its consumers (usually analytics pipelines and ML features) about the schema, semantics, quality, and change-management rules of a dataset. The idea rose to prominence in 2021–2023 through a series of widely read essays by Chad Sanderson and others, and through the data-contract tooling that appeared in its wake.
A useful contract covers at least four things. The schema: the exact set of fields, their types, their nullability, and their semantics. The quality guarantees: what completeness, uniqueness, validity, and freshness the producer will maintain. The change policy: how breaking changes are announced, how long the deprecation window is, and which changes require a new version of the contract. The ownership: who is accountable on the producer side, who is accountable on the consumer side, and what the escalation path is when a violation is detected. The contract is committed to source control, reviewed like code, and enforced by automated checks at the producer boundary.
The distinguishing feature of contracts, compared with older schema-registry and data-quality practices, is that they are enforced at the point where the data is produced — before it ever lands in the warehouse. A production service that emits a Kafka event, writes to an operational database, or publishes a CDC stream validates its output against the contract at emit-time; violations are surfaced to the producing team, not to the downstream analytics team. This matters because the cheapest time to fix a data bug is when the engineer who wrote the emit-code is still looking at their own pull request.
Protobuf and Avro schemas, and Kafka Schema Registry, predate the contract movement by years, and they are often cited as if they solve the same problem. They do not. A schema ensures the payload parses; a contract additionally ensures the payload means what the consumer expects, at the quality level the consumer requires, with a change-management process the consumer can rely on. A schema says "this field is an int64"; a contract says "this field is the monetary amount in USD cents, never negative, present on 100% of rows, and any change to its definition requires a two-week deprecation and a new major version".
The technology for contracts is straightforward (a YAML schema, a validator, a CI check). The cultural work is the hard part: product teams do not automatically see themselves as responsible for the downstream analytics and ML use of the events they emit, and getting them to take on that responsibility requires executive support, incentives, and usually a catalyst event — a high-profile bad-data incident that made the CEO unhappy. Contracts work in organisations that have decided to take data seriously; they do not by themselves make an organisation take data seriously.
The productive analogy is that a data contract is to a dataset what a service API is to a microservice. Nobody would seriously propose running a microservice architecture without explicit, versioned, reviewed APIs; a modern data platform treats its internal datasets with the same seriousness. The terminology is recent, the underlying idea — explicit producer-consumer contracts, enforced at the boundary — is not.
Contracts describe what should be true; data tests verify it automatically, every run. The genre is directly analogous to unit tests in software, and like unit tests it is cheap to write, cheap to run, and enormous in the amount of silent failure it prevents. The three tools that have become standard — dbt tests, Great Expectations, and Soda — cover the space with different accents, but the idea is the same: attach named, executable assertions to specific datasets and fail the pipeline if they do not hold.
Most useful data tests fall into a small number of families. Column-level tests assert properties of a single column: not-null, uniqueness, value-in-set, value-in-range, regex match, referential integrity to another table. Row-level tests assert cross-column properties of a single row: end_date >= start_date, line_total = unit_price * quantity. Distribution tests assert statistical properties of the column as a whole: mean within a range, standard deviation stable, null rate below a threshold, cardinality within expected bounds. Relationship tests assert properties across tables: every order.customer_id has a matching customers.id; the sum of line-item totals in orders equals the aggregate total in the invoice table. Together these four kinds catch a large majority of the preventable correctness regressions.
dbt tests are the dominant entry point for analytics-engineering teams. They live alongside the SQL models, are written in YAML, and run as part of the same DAG that builds the models; the result is that testing and transformation are inseparable in a way that earlier generations of tooling never achieved. Great Expectations is the older, more general library — a Python-first framework with a much larger vocabulary of built-in expectations, integrations to most pipeline orchestrators, and a first-class notion of a data-docs site that publishes test results. Soda is the newest of the three, designed from the start for an observability use case: write checks in a compact DSL (SodaCL), run them against warehouses and lakes, ship failures to alerts and dashboards. Many mature platforms use all three in different layers.
A test written and never run is decoration. The practical question is where in the pipeline a test executes. Good practice is to test at three points: at ingest, immediately after raw data lands, so that upstream regressions are caught before they contaminate the warehouse; at transformation, inside dbt or the equivalent, so that derived models are validated against the assumptions they depend on; and at the consumer boundary, on the specific tables that feed dashboards, ML features, and external reports. Each layer catches a different class of failure.
What should happen when a test fails is a design decision, not a default. Some failures should stop the pipeline and page an engineer: duplicate primary keys in a finance table, null rates above a dangerous threshold on an ML feature. Others should alert but proceed: a small distribution drift that is plausibly a real-world signal. Others should simply record the fact and continue: a tolerable anomaly with no corrective action available. Treating every failure as fatal trains the team to disable tests; treating every failure as a warning trains the team to ignore them. The policy should match the cost of the downstream failure, as Section 2's fitness-for-purpose framing requires.
Data tests are usually the single highest-leverage investment a governance programme can make in its first year. They are cheap to author, they catch real regressions, their output is legible to both engineers and auditors, and they compound: every test written today will catch a regression that would otherwise have been debugged painfully in production next year.
Job-level monitoring — was the pipeline run green, how long did it take, did the task retry — is necessary and insufficient. A pipeline can finish on time, with no errors, and still deliver a table whose row count has silently halved, whose freshness has slipped by hours, whose schema has quietly drifted, or whose primary business metric has moved outside its historical band. Data observability is the practice of monitoring the data itself, continuously, against profiles of what "normal" looks like.
The canonical taxonomy, popularised by Monte Carlo and widely adopted, names five pillars. Freshness: when was this table last updated, and is it within the expected cadence? Volume: how many rows does each run produce, and is that within the expected band? Schema: has the set of columns and their types changed, and was the change announced? Distribution: are the column-level statistics (mean, null rate, cardinality, histogram) within historical tolerances? Lineage: when a metric changes, which upstream table is the likely cause? The five cover most of the silent-failure surface that job-level monitoring misses.
The commercial category is led by Monte Carlo, Bigeye, Acceldata, and Metaplane; their pitch is turnkey profiling of the warehouse plus automatic anomaly detection. Open-source alternatives include Elementary (built on dbt tests), Soda Core, and instrumentation via OpenLineage with lineage-aware dashboards. The commercial tools are typically cheaper to adopt and faster to show value; the open-source tools are cheaper over time and integrate more flexibly with homegrown platforms.
Observability and testing (Section 4) are complementary but different. A test is an assertion written by a human who knew what to check; observability is a profile learned from historical behaviour that detects deviations no one thought to write an assertion for. A test says "customer_id must never be null"; observability says "the null rate on customer_id has been 0.1% for six months and jumped to 3% yesterday — someone should look". Teams need both: tests for known-dangerous conditions, observability for the unknown unknowns.
Data observability's hardest engineering problem is not detection; it is alerting. A naive system that fires on every two-sigma deviation will drown the team in noise within a week. A mature system uses layered thresholds, dependency-aware suppression (don't alert on 20 downstream tables when the upstream source is the cause), per-asset criticality tiers, and business-hours sensitivity. Anomaly detection with no alerting hygiene is a research project; anomaly detection with alerting hygiene is an on-call rotation the team can actually sustain.
The slogan inside the data-observability community captures the gap: green pipelines do not imply correct data. Until a platform is watching the shape of its outputs as carefully as the health of its jobs, it has a persistent blind spot. Observability closes that gap, and for many teams it is the second-highest-leverage governance investment after the test suite of Section 4.
When a dashboard reports the wrong revenue number, the real question is not what is wrong but where. Data lineage is the graph that answers that question: for any column in any table in any dashboard, lineage tells you the chain of upstream tables, transformations, and source systems that produced it. A platform with accurate lineage can debug a number in minutes; a platform without it can spend days.
Lineage has two resolutions. Table-level lineage records that table B was produced from tables A1 and A2; it is straightforward to emit from any orchestrator and covers most pipeline-operations use cases. Column-level lineage additionally records which specific columns of A1 and A2 flowed into which specific columns of B; it is much more useful for debugging, change-impact analysis, and regulatory work, and it is much harder to compute accurately because it requires parsing the SQL of the transformation. The modern catalogue and observability platforms (Section 8) almost all support column-level lineage; the question is how accurately they extract it from the transformation code.
There are two ways to build a lineage graph. Infer-time approaches parse SQL, Spark DAGs, and warehouse query logs after the fact and reconstruct lineage from what actually ran; they work on any platform without modifications but can miss edge cases in dynamic SQL or external transformations. Emit-time approaches instrument the orchestrator and execution engines to emit lineage events as they run; they are more accurate but require adoption of an emission standard. OpenLineage is the dominant open standard for emit-time lineage, with producers for Airflow, Spark, dbt, and Flink, and consumers including Marquez, DataHub, and the commercial observability vendors.
Lineage serves at least four distinct use cases, and the value of each one is what justifies the investment. Debugging: trace a wrong number back to a specific upstream change. Change impact: before dropping a column or renaming a table, identify every downstream consumer. Regulatory response: when asked where a given personal datum lives, produce the full map of derived assets and who has access to each. Quality attribution: when a downstream test fails, narrow the search to the specific upstream asset whose distribution drifted. A platform that cannot do these four things is one that is slow and expensive to change safely.
The failure mode of a lineage programme is partial coverage. If lineage is accurate for the 70% of pipelines that run in dbt, and absent for the 30% that run in hand-written Spark or SaaS integrations, the lineage graph is not merely incomplete — it is misleading, because it suggests the missing 30% does not exist. Getting to near-complete lineage usually requires work in two directions: expanding the set of systems that emit lineage, and parsing query logs for everything else. It is not a feature; it is a capability that accretes over time.
Lineage and the catalogue (Section 7) are adjacent but different. Lineage is the graph — how data moves. The catalogue is the index — what each dataset is, who owns it, how it is used. A modern active-metadata platform (Section 8) publishes both against the same underlying metadata store; that is what makes the combination valuable in a way that either piece alone is not.
For most of the people who use a data platform, the first question is not how to query a table but how to find one. A data catalogue is the searchable inventory of an organisation's data assets — tables, views, dashboards, feature stores, ML models — with enough metadata attached to each entry that a consumer can decide whether it is the right asset for their purpose without reading the pipeline code. A platform without a catalogue forces that discovery to happen by word of mouth; a platform with a working one makes it self-service.
A minimally useful entry has a name, a plain-English description, a schema, an owner, a freshness indication, a usage signal (who queries this, how often), tags that place it in the business taxonomy, and a link to the upstream pipeline that produces it. A richer entry adds sample values, documented business-logic notes, quality-test results, known caveats, and the lineage graph that connects it to its sources. The richer the entry, the less often a consumer needs to ask the owner a question before using the asset.
Cataloguing tools have gone through three roughly distinguishable generations. The first was enterprise metadata repositories like Informatica and Collibra, focused on top-down compliance use cases. The second was the discovery-first tools that emerged from large consumer-internet teams in the late 2010s: Airbnb's Dataportal, Lyft's Amundsen, LinkedIn's DataHub, Netflix's Metacat — all designed to help thousands of analysts find the right table. The third, which this chapter's Section 8 treats separately, is the active-metadata platforms that extended the second-generation catalogues into control-plane tools rather than passive indexes. Modern practice runs on the second- and third-generation descendants.
A long-running design tension is between catalogues that rely on producers to document everything formally (curated, high-quality entries, low coverage) and catalogues that let consumers edit, tag, and comment (crowdsourced, uneven quality, high coverage). The working answer in practice is both: a baseline of machine-extracted metadata for every asset, a curated layer for the business-critical ones, and a crowdsourced layer of tags, comments, and usage examples on top. A catalogue that covers only the curated set will never find most of the platform; one that covers everything with no curation will be full of abandoned and misleading entries.
A catalogue is successful when people use it instinctively. The organisational signal is that internal chat questions like "does anyone know where the X table is?" trail off because the answer is "search the catalogue". Getting there requires the catalogue to be genuinely faster than Slack, which in turn requires good search, good freshness, and enough baseline metadata that most entries return a useful result. The failure mode is the catalogue that has an entry for every table but nothing human-useful in any of them; consumers learn quickly that the catalogue is not where answers come from and go back to asking people.
An important reframing is that a catalogue's primary user is not the governance team but the analyst or engineer who is trying to answer a question. Organisations whose catalogue teams treat themselves as an internal product team — with user research, usage analytics, and iteration — tend to build catalogues that get used; organisations that treat cataloguing as a compliance obligation tend to build ones that do not.
Traditional catalogues were passive indexes: they scraped metadata out of the warehouse, stored it, and displayed it. Active-metadata platforms, a term coined by Gartner in 2021 and quickly adopted by the vendor community, describe the generation of tools that additionally emit metadata back into the platform — driving access policies, triggering pipeline runs, annotating dashboards, alerting on lineage changes, and integrating with the orchestrator, BI layer, and ML platform in real time. The distinction matters because it changes what a metadata platform does, not just what it stores.
Four open-source projects, and a layer of commercial vendors that sit around them, set the standard. DataHub (LinkedIn, then Acryl) emphasises metadata-as-a-graph with strong lineage, ownership, and search. OpenMetadata (Collate) emphasises a unified ontology across data, dashboards, pipelines, and ML models, with tight integrations to dbt, Airflow, and Superset. Apache Atlas is the older Hadoop-era incumbent, still present in many enterprise estates, particularly those built on Hortonworks or Cloudera foundations. Amundsen (Lyft) remains the clearest embodiment of the search-first catalogue pattern, though newer teams tend to pick DataHub or OpenMetadata for the broader feature set. The commercial layer — Collibra, Alation, Atlan, Informatica Cloud Data Governance — targets enterprises that want turnkey deployments with strong compliance features and extensive connector catalogues.
A metadata platform's value is roughly proportional to the number of systems it integrates with and the depth of the integration. A platform that ingests metadata from the warehouse is table stakes; a platform that additionally pulls from dbt, Airflow, Spark, Kafka, Looker, Tableau, Sagemaker, and Snowflake RBAC — and emits to the same set — is genuinely a control plane. The practical sizing question for a buyer or adopter is which of their existing systems the platform covers out of the box and which will require custom connectors.
The right mental model for a modern metadata platform is a graph database. Nodes are assets (tables, columns, dashboards, models, pipelines, users, teams); edges are lineage, ownership, usage, and policy relationships. Queries over that graph answer questions that cannot be answered by any single system in isolation: "which dashboards depend on this column", "which production tables are accessed by the recommendations model", "which users own assets that include PII but lack an approved DPA". The shift from "metadata as rows in a governance tool" to "metadata as a queryable graph" is the concrete technical change that active-metadata really names.
Most teams first find value in three places. Change-impact analysis — before dropping a column, see every downstream consumer. Policy enforcement — tag a column as PII in the metadata layer, and have the access layer, pipeline, and BI tool all respect it automatically. Stewardship — assign ownership in the metadata layer and route incidents, test failures, and review requests to the right person. These are the capabilities that separate an active-metadata platform from a classic catalogue, and they are the ones that repay the substantial integration effort.
The short version of the generational shift: a catalogue tells you what exists; an active-metadata platform tells you what exists, what it depends on, who owns it, who uses it, what policies apply to it, and then makes those facts actionable across the rest of the stack. The jump in leverage justifies the jump in deployment complexity for most organisations past a certain size.
A contract (Section 3) is the agreement; a schema is the machine-readable form of part of that agreement — the structural part that a serializer and a parser can verify on every message. Schemas are a prerequisite for most of the governance work in this chapter: you cannot enforce classifications on unnamed columns, you cannot emit lineage through untyped payloads, and you cannot evolve a format safely without a change-management process around its schema.
Apache Avro is the standard in Kafka and Hadoop-adjacent ecosystems. It is compact, supports rich schema evolution (defaults, aliases, union types), and is the native format used with Confluent Schema Registry. Protocol Buffers (Protobuf) is the Google lineage, widely used for service-to-service RPC and increasingly for event payloads; its proto3 dialect is the one most teams encounter. JSON Schema is the least compact of the three and the most human-readable; it dominates where JSON is already the on-wire format, which includes most web APIs and many streaming systems that chose readability over efficiency. Choosing between them is a real design decision, usually made by the platform team once, and lived with for years.
A schema registry is the central store that producers and consumers consult to resolve the schema associated with a stream, topic, or dataset. Confluent Schema Registry is the reference implementation in the Kafka world, with open-source cores and a commercial offering; AWS Glue Schema Registry and similar managed equivalents fill the role on the cloud providers. The registry gives the platform three things: a single authoritative location for every schema, a versioning history, and a compatibility check that validates proposed new versions against the rules for safe evolution before they are accepted.
The most consequential feature of a schema registry is its compatibility enforcement. The three common modes — backward, forward, and full compatibility — formalise the rules for what a schema change can and cannot do. Backward compatible: consumers running the new schema can read data written with the old. Forward compatible: consumers running the old schema can read data written with the new. Full: both. The right mode depends on how producers and consumers roll out relative to each other; the wrong mode lets a producer deploy a breaking change and discover the consequences only when the downstream systems start failing.
Schemas live in the lake and the warehouse too, though the machinery is less explicit. Parquet files carry their own schema in the footer; Delta, Iceberg, and Hudi (Chapter 02) layer schema evolution, enforcement, and time-travel on top of the raw files. Cloud warehouses enforce a declared schema on every write. The modern practice is to treat every write path — streams, lakes, warehouses — as having an explicit schema, an explicit registry of some kind (even if the registry is the table-format metadata), and an explicit change-management process; ad-hoc evolution without any of those is how the classic "the schema quietly changed" incident happens.
For teams just beginning a governance programme, a clean schema and a working registry are often the single most concrete starting point. They are an engineering artefact, easy to adopt incrementally, and they immediately make the rest of the programme — contracts, tests, classifications, lineage — tractable in ways that unbounded semi-structured payloads cannot support.
Most organisations of any size have the same customer, the same product, or the same employee represented in several systems at once — CRM, billing, support, marketing, HR. Master data management (MDM) is the practice of reconciling those parallel representations into a single authoritative record per entity, known in the older literature as the golden record. It is one of the oldest disciplines in data management, one of the least fashionable, and in large enterprises one of the highest-leverage.
At the heart of MDM is entity resolution: the problem of deciding which records refer to the same real-world entity and which do not. Deterministic matching uses exact or rule-based matches on fields like email, phone, or external ID; it is precise but misses anything with typos or formatting differences. Probabilistic matching uses similarity scores across multiple fields and a threshold to declare a match; it catches more real duplicates at the cost of some false positives. ML-based matching trains a classifier on hand-labelled pairs and generalises beyond hand-tuned rules; it works well on large corpora with rich signals. Real programmes typically use all three in layers.
Entity resolution will always have ambiguous cases — two records that might or might not be the same person. A credible MDM programme includes human stewards who review the ambiguous matches, decide them, and feed the decisions back into the matching model. Without stewardship the programme will either be conservative (high precision, many uncaught duplicates) or aggressive (high recall, occasional wrong merges that are enormously embarrassing when the two people are different); with stewardship, it can converge over time to a stable accuracy appropriate for the use.
There are three broad architectural patterns. Registry MDM keeps the source systems authoritative, adds a cross-reference table, and exposes the golden record as a view that joins them. Consolidation MDM copies into a central store and cleans in-place, with the central store as the published golden record but not authoritative for writes. Coexistence and centralised patterns make the MDM store the system of record, pushing changes back into the source systems. Registry is the cheapest and most common starting point; full centralisation is rare because it requires reworking every source system.
In modern stacks, MDM overlaps substantially with warehouse modelling (Chapter 02 Sections 13–14) and with feature stores (Chapter 03). Many teams end up doing lightweight MDM inside dbt models: a central dim_customer that resolves the CRM, billing, and marketing records into a single row per real-world customer. A full enterprise MDM product (Informatica, Reltio, Tibco EBX) is heavier; it is the right choice when the entity being mastered has regulatory weight, lives in dozens of systems, or drives high-consequence operational decisions.
MDM has never been the exciting part of data. It is nonetheless the difference between an organisation that can answer "how many customers do we have" with confidence and one that genuinely does not know. Mature data teams treat MDM as a first-class discipline alongside pipelines and governance; immature ones treat it as a workaround that never gets finished.
Before access controls, privacy techniques, or compliance processes can do useful work, each data asset needs a label that says what kind of thing it is — whether it contains personal data, whether it is sensitive under a specific regulation, what internal confidentiality tier it sits in. Data classification is the practice of applying and maintaining those labels consistently, and it is the hinge that connects the metadata layer to the policy layer.
Most classification schemes combine two axes. The first is sensitivity or confidentiality: typical tiers are public, internal, confidential, and restricted (or the equivalent government scale). The second is data type, which flags regulatory significance: PII (personally identifiable information), PHI (protected health information under HIPAA), PCI (payment card information), financial data subject to specific disclosure rules, and employee data subject to labour regulations. A column can carry labels on both axes — for example, "confidential" on sensitivity and "PII" on type — and each label pulls in its own controls.
A scheme that relies on every engineer to apply the right labels to every new column will not stay current. The mature practice combines three techniques. Pattern scanners identify likely PII by regex (email addresses, phone numbers, national ID numbers) or by ML classifiers trained on examples. Lineage-based propagation automatically inherits labels from upstream to downstream tables so that derived assets are classified the moment they are created. Manual override lets data stewards correct misclassifications and apply labels the automation missed. All three are native features of modern active-metadata platforms (Section 8).
The value of a classification label is entirely in what it triggers. A PII label should drive masking in non-production environments, row-level access control in queries, retention-policy enforcement, and inclusion in data-subject-access-request workflows. A "confidential" label should restrict which users and groups can query the asset. A "restricted" label should additionally route any new access request through a named approver. If labels are stored but not acted on, classification has failed; if every label is wired to concrete controls, classification is the mechanism through which policy actually reaches the query.
Certain categories are legally distinguished and carry particularly strict handling rules. Under GDPR, special category data includes race, ethnicity, political opinions, religious beliefs, trade-union membership, genetic and biometric data, health data, sexual orientation, and criminal convictions; processing any of these requires an explicit lawful basis. HIPAA's PHI designation is similarly strict in scope. Any classification scheme that fails to distinguish these categories from ordinary PII will fail its compliance obligations; serious programmes have explicit tags for each of the special categories relevant to their jurisdictions.
Classification is the interface between metadata and policy. It is the layer at which the metadata team says "this column contains this kind of data" and the access team says "that kind of data gets this kind of control". Without classification, every policy must be written per-column by hand; with classification, policies can be written per-label and enforced across the entire estate.
Once data is classified, the next job is to enforce who can read or write it. Modern warehouses and lakes support an increasingly rich vocabulary of access controls: at the schema, table, column, and row level, with role-based and attribute-based variants, and with delegation, temporary grants, and audit logs. Getting this layer right is where a governance programme is, in a concrete sense, doing compliance work; everything upstream is in service of controls at this layer landing on the right columns and rows.
Role-based access control (RBAC) is the default: users belong to groups, groups hold roles, roles grant privileges on specific objects. It is simple, auditable, and works well until the number of fine-grained distinctions outgrows the number of roles anyone can keep track of. Attribute-based access control (ABAC) grants access based on runtime attributes of the user (department, clearance, location) and the object (classification label, region, sensitivity); it scales to finer-grained policies at the cost of harder reasoning about "why does this user have access to this row". Row-level and column-level policies are the modern warehouse primitives that let a single view expose different rows or mask different columns to different users based on the policy function, without multiplying physical tables or views.
A decade ago, fine-grained access control required a layer above the warehouse (tools like Immuta, Privacera, Apache Ranger). The cloud warehouses have since absorbed much of this functionality directly. Snowflake has row-access policies, masking policies, and dynamic data masking. BigQuery has column-level access control, row-level security, and data masking tied to policy tags. Databricks Unity Catalog provides attribute-based access and dynamic views. For most teams, the warehouse-native controls are the right starting point; the dedicated tools are still valuable for cross-warehouse enforcement and for richer policy languages.
The governing principle is least privilege: grant the minimum access required to do the job, and nothing more. In practice this requires two practices that organisations rarely get right on their own. Provisioning through groups, not users: access is granted to roles or groups, and people get access by being added to a group, so that access can be revoked by removing group membership when someone changes teams. Periodic access review: at a fixed cadence (often quarterly), asset owners review who has access to their data and confirm that it is still appropriate. Both practices become tractable when the access system and the active-metadata platform are integrated; without that integration, they drift quickly into shelfware.
Access without audit is a liability: a regulator asks who accessed a specific record on a specific date, and the organisation must be able to answer with confidence. Modern warehouses emit access logs at the query level; the task is to collect, retain, and query those logs. SOC 2 and HIPAA both require the capability directly. A working setup routes warehouse access logs into a long-term store (often the lake itself), indexes them for query, and surfaces a user-oriented view in the active-metadata platform so that incident response does not require running ad-hoc SQL at 2am against the audit log.
An access-control system that restricts PII columns is useful only if the PII columns are reliably labelled. This is why Sections 11 and 12 are a pair: classification without enforcement is paperwork, and enforcement without classification is either too broad (everything locked down) or too narrow (sensitive data hidden in plain sight). The two layers carry the programme together.
Access control (Section 12) restricts who can see which rows and columns in their original form. Privacy techniques go further: they modify the data itself so that analysis, testing, and ML training can proceed without revealing the individuals whose records they contain. The toolkit runs from simple masking all the way through formal differential privacy, and each technique trades some combination of utility, performance, and mathematical rigour.
Masking replaces sensitive values with synthetic or partial values — a credit-card number becomes ****-****-****-1234, an email address becomes j***@company.com. Redaction removes the field entirely. Pseudonymisation replaces direct identifiers with consistent synthetic tokens, so that records can still be joined but the real identity is not present in the dataset. All three are cheap, effective for operational use (test environments, internal analytics), and insufficient as a legal matter for anything labelled as fully de-identified: pseudonyms can often be re-identified by combining with other attributes, as the famous Netflix Prize and AOL search-log incidents demonstrated.
Tokenisation maps sensitive values to random tokens through a lookup vault; systems that need the real value go through the vault, systems that do not store only the token. It is the standard approach for payment-card data and is well-supported by commercial vendors (Protegrity, Thales). Format-preserving encryption produces ciphertext that looks like valid plaintext (a 16-digit credit card encrypts to a different 16-digit number) so that downstream systems with strict format expectations do not break. Both techniques are reversible given the key or the vault, which is why they are compliance-appropriate for storage but not a substitute for real de-identification.
k-anonymity, introduced by Latanya Sweeney in 2002, requires that every record in a released dataset be indistinguishable from at least k−1 others on its identifying attributes. Later refinements — l-diversity, t-closeness — patched specific weaknesses where k-anonymous data still leaked information about sensitive attributes. The framework is useful and widely taught, but it is now recognised as fundamentally vulnerable to composition attacks: combining multiple k-anonymous releases, or combining one with an external dataset, can often re-identify individuals. It remains a legitimate tool for specific release scenarios, not a general-purpose solution.
Differential privacy (Dwork and colleagues, 2006) is the modern mathematical framework for quantifying privacy loss. A mechanism is ε-differentially private if its output distribution changes by at most a factor of eε when any single record is added to or removed from the dataset; the parameter ε is a formal privacy budget that composes across queries in a well-understood way. Differential privacy has moved from theory to practice over the last decade: the US Census Bureau used it for the 2020 decennial release; Apple, Google, and LinkedIn run differentially private telemetry; Microsoft's SmartNoise and Google's dp-accounting library make the machinery accessible. It is the strongest formal guarantee available, and it trades utility for privacy in ways teams need to measure carefully before adopting.
In practice the techniques layer. Pseudonymisation and masking handle the vast majority of internal use: dev and test environments, internal dashboards, analyst queries. Tokenisation handles regulated-field storage: payment cards, social security numbers. Differential privacy handles the most sensitive external releases, the published aggregate statistics, and training regimes where the trained model might be queried to infer whether specific records were in the training set. No single technique is sufficient across all four use cases; the design job is matching technique to risk and to utility requirement.
A recurring lesson in the privacy literature is that "de-identified" data is often not: combining age, gender, and postcode is enough to uniquely identify a majority of individuals in many populations. Teams that need genuinely defensible privacy claims — rather than engineering conveniences — should assume ad-hoc masking is not enough and reach for tokenisation, aggregation thresholds, or differential privacy depending on the threat model. Chapter 22 (Privacy in ML in Part X) revisits these questions in the ML-specific setting.
A working governance programme is ultimately accountable to specific legal and contractual regimes. An engineer does not need to be a lawyer, but they do need enough literacy to understand what each regime actually requires of the data platform, and where the high-leverage engineering choices sit. The five regimes that cover most of the ground for an English-speaking company operating internationally are GDPR, CCPA, HIPAA, SOC 2, and ISO 27001.
The General Data Protection Regulation (in force since 2018) applies to any organisation processing personal data of people in the EU or EEA, regardless of where the organisation itself is located. Its core obligations most relevant to a data platform are: a legal basis for every processing activity (usually consent, contract, or legitimate interest); data-subject rights including access, correction, portability, and erasure (the right to be forgotten); purpose limitation (data collected for one purpose cannot be used for another without new legal basis); storage minimisation (retain only as long as needed); and breach notification within 72 hours. The erasure right is the most architecturally consequential: it requires the platform to be able to find and delete all traces of a specific person on request, which is non-trivial once data has been propagated to warehouses, lakes, and backups.
The California Consumer Privacy Act (effective 2020, amended by CPRA in 2023) gives California residents rights broadly similar to GDPR's — access, deletion, opt-out of sale, non-discrimination — with some structural differences around what counts as "sale" and what constitutes a service-provider exemption. CCPA has been joined by Virginia (VCDPA), Colorado (CPA), Connecticut (CTDPA), Utah (UCPA), and a growing list of other state laws, with overlapping but non-identical obligations. For a company operating across US states, the practical approach is usually to build to GDPR-like capabilities and apply them across the board, since retrofitting per-state compliance after the fact is more expensive than building once to the strictest regime.
HIPAA governs protected health information (PHI) in the United States; any platform handling PHI for a covered entity or business associate must meet specific safeguards around access, transmission, and disclosure, and must sign formal business-associate agreements with its vendors. Other sector regimes carry similar weight: PCI DSS for payment-card data, FERPA for education records, GLBA for consumer financial data, sector-specific regimes for children's data (COPPA) and in specific regulated industries. Each regime dictates specific controls; a platform serving multiple sectors typically ends up with the union of their requirements, often routed through its classification (Section 11) and access-control (Section 12) systems.
SOC 2 (specifically Type II) is the de facto standard for B2B SaaS trust in North America; it requires a set of controls across security, availability, processing integrity, confidentiality, and privacy, audited annually. ISO 27001 is the equivalent international standard, certifying an information security management system. Neither is a law; both are contractual expectations that customers increasingly require before they will sign a commercial agreement. From a data-platform perspective, SOC 2 and ISO 27001 compliance is mostly a matter of documenting controls that a mature governance programme already has in place: classification, access review, audit logs, incident response, change management, backup and recovery.
A newer regulatory layer specifically targets ML systems rather than data alone. The EU AI Act (adopted 2024) imposes risk-tiered obligations on AI systems, including transparency, human oversight, documentation of training data, and post-market monitoring for "high-risk" systems. The US NIST AI Risk Management Framework, executive orders, and state-level proposals add further obligations. Section 17 revisits these from the ML angle; from the data-platform angle, they largely reinforce existing lineage, provenance, and documentation requirements, extended to cover training datasets and model outputs as first-class governed assets.
Treating compliance as a checklist of annual paperwork is how organisations end up with brittle, expensive audits and unpleasant surprises when a new regime lands. Treating compliance as a set of platform capabilities — discoverable data, classified data, access-controlled data, audited data, deletable data — makes each new regime a small incremental obligation on top of a working base, instead of a fresh project from zero.
Compliance asks "is this allowed". Ethics asks a different question: "is this appropriate, given what we know and what we can foresee". A platform can be fully compliant with GDPR, HIPAA, and SOC 2 and still be used for surveillance, discrimination, or manipulation in ways the organisation would be embarrassed to defend publicly. Over the last decade, data ethics has matured from an academic preoccupation into an operational practice, and most serious data platforms now include ethical review as an explicit step for the highest-stakes uses.
The discipline has a small canon that repays reading. Cathy O'Neil's Weapons of Math Destruction (2016) catalogued the ways automated decision systems compound inequality. Virginia Eubanks's Automating Inequality (2018) traced the same dynamics through US social services. Ruha Benjamin's Race After Technology (2019) extended the argument to a structural analysis of how engineering decisions encode historical bias. Safiya Noble's Algorithms of Oppression (2018) focused on search and recommendation. The literature consistently identifies a small number of recurrent failure modes: biased training data leading to biased predictions, proxy variables that stand in for protected attributes, feedback loops that amplify inequality, and performance metrics that hide disparate impact.
The practical operational form is a review gate: before a high-stakes data use goes live, a designated group reviews the proposed use for ethical considerations beyond legal compliance. Useful review frameworks (Fairness, Accountability, Transparency principles; Data & Society's guidelines) structure the review around questions like: who is this system making decisions about, and do they have recourse if it is wrong; what are the downstream consequences of systematic error; does the training data represent the affected population; are there proxies for protected attributes that will produce disparate impact. The review is not a publishing step; it is an engineering step with concrete deliverables (impact assessments, mitigations, monitoring plans).
Beyond the downstream ethics of how models are used, there is the upstream ethics of how training data is obtained. The legitimacy of large-scale web-scraped training corpora has been contested in courts and in public (the ongoing copyright and privacy lawsuits against LLM developers are one such arena). Serious programmes track training-data provenance explicitly: where did each dataset come from, under what license, with what consent, with what compensation to creators. The Chapter 01 treatment of data provenance is the starting point; ethical programmes require it to extend to every model on the platform, not just to the data underneath.
It is worth being honest about the limits of ethics programmes inside companies. They are constrained by the commercial interests of the companies that run them, they can be captured by executives who want a rubber stamp, and they tend to produce process rather than substance when the underlying product strategy demands a use the ethics programme would otherwise reject. The literature on corporate ethics (including the AI Now Institute's work and the documented disbandings of ethics teams at several major tech companies) argues that internal review must be paired with external accountability — regulators, auditors, and published commitments — to have real force. An honest programme says so explicitly rather than overstating its independence.
Ethics at the data-platform level is necessary but not sufficient. The full ethical frame for machine learning is covered in Part X of the compendium (AI Safety & Ethics), where the questions about alignment, bias, fairness, and societal impact are treated as first-class subjects. This section establishes that by the time a dataset is used to train a model, ethical review should already have happened; it has not in many high-profile cases, and that is a failure mode of the discipline rather than an inevitability.
Tools matter; so do roles. Even the best catalogue, observability platform, and access system fail if no one is responsible for the quality of a specific asset, no one owns the decision to change a specific contract, and no one is on call when a specific pipeline misbehaves. The organisational layer that makes tools stick is the governance operating model: a small, consistent vocabulary for who decides what.
Three roles recur in every governance framework. A data owner is accountable for a dataset at the business level: the VP of Finance owns the general ledger tables, the Head of Marketing owns the campaign-attribution data. A data steward is the operational counterpart: the analyst or engineer who maintains the definition, quality, and documentation of the asset day-to-day. A data custodian is the technical role responsible for the infrastructure: storage, backup, access, platform health. The distinction is that the owner decides policy, the steward executes it, and the custodian operates the platform underneath. A platform in which the same person plays all three roles for everything is either very small or very confused; a platform in which the three are distinct and named makes governance decisions traceable.
The structural question is how the governance programme relates to the rest of the organisation. A centralised model puts all governance into a single team; it is easy to coordinate but easy to bottleneck. A federated model keeps a small central team for standards and tooling and pushes ownership and stewardship into the domain teams that produce the data; it is the dominant pattern in mid-sized companies. The data mesh model (Zhamak Dehghani, 2019–2020) takes the federation further, treating each domain team as a "data product" team that owns its datasets end-to-end with a light-touch central platform. None of these is universally right; the choice depends on the organisation's size, the heterogeneity of its domains, and its appetite for the cultural work that federation requires.
Beyond individual roles, most programmes rely on a couple of standing bodies. A data governance council (sometimes data governance committee) is a cross-functional group — finance, legal, security, engineering, product — that sets policy, approves classifications for novel data, resolves cross-domain disputes, and acts as the escalation point. A lightweight RACI chart (Responsible, Accountable, Consulted, Informed) per major decision type makes clear who does what when a new dataset is onboarded, a new access request is reviewed, or a policy is revised. Much organisational pain in governance traces to unstated decision rights; the RACI is a small investment that prevents large amounts of confusion.
The organisational failure mode almost every governance programme eventually faces is that the people expected to do governance work are rewarded for other things. An engineer hired to ship product features is not naturally going to document a dataset, respond to catalogue questions, or review access quarterly. A governance programme that cannot write those activities into job descriptions, performance reviews, and team OKRs will devolve into a small group of enthusiasts doing heroic work that does not scale. The mature pattern is to make the governance work visible in the same places that other work is visible: as explicit objectives, with explicit owners, tracked alongside everything else.
Tooling is necessary; the organisation is the thing that actually holds the programme together. A successful governance effort ends up as much a set of role definitions, decision rules, and reporting lines as it is a stack of open-source and SaaS products. Treating it as an engineering-only initiative, without investing in the organisational structure, is the single most common way a governance programme fails.
Every quality, lineage, classification, and access-control problem described in this chapter applies to machine learning as much as to analytics — with the added difficulty that errors in the training data propagate into trained parameters that are much harder to audit than a SQL query. ML also introduces governance artefacts the previous sections have not yet named: training-data provenance, feature lineage, model cards, and the dawning regulatory framework around model governance itself.
A trained model is a complicated summary of its training data. If the provenance of that data is unclear — which rows, which sources, which timestamps, which licences, which transformations — downstream debugging becomes impossible and downstream compliance becomes speculative. The discipline that has emerged (Gebru et al.'s Datasheets for Datasets, 2018; Hugging Face and Google's dataset cards) is to treat the training corpus as a governed asset in its own right, with a descriptor, lineage, quality metrics, and retention policy attached. The shift is from "we trained on some crawl of the web" to "we trained on specific versions of specific sources, documented and retained".
ML features — the numerical inputs presented to a model at training and inference time — are themselves data assets. They are produced by pipelines, consumed by models, served to production, and evaluated for drift. A platform without feature lineage cannot answer "which models depend on feature X" when X needs to change, and cannot trace a model degradation to its upstream cause. Feature stores (Feast, Tecton, Sagemaker Feature Store, Vertex AI Feature Store) exist partly to solve this: they are the catalogue, lineage, and access layer for features, inheriting the same governance practices the rest of this chapter describes.
Mitchell et al.'s Model Cards (2019), together with Datasheets for Datasets and the IBM FactSheets initiative, define a lightweight documentation discipline for trained models: intended use, evaluation metrics across subgroups, known limitations, ethical considerations, training-data summary, and maintenance contacts. The practice has been adopted by Hugging Face (model cards for every model on the Hub), Google, and most major labs, and it is increasingly required by regulators. Model cards turn a trained artefact into something that can be documented, reviewed, and held to account in the way any other governed asset can; without them, a model is a black box that the organisation will struggle to reason about.
Regulation is beginning to treat models themselves as governed objects, not just the data beneath them. The EU AI Act's documentation and post-market-monitoring obligations for high-risk AI systems are the most concrete example, but the trajectory in the UK, US, and other jurisdictions is in the same direction. Teams operating at the frontier of capability are already being asked to evaluate models for specific risks (biosecurity, CSAM, election interference, autonomous-agent capabilities) before release. The infrastructure this requires — red-team results, eval suites, capability documentation, release decisions — is a new branch of governance that builds directly on the data-governance machinery of the rest of this chapter.
This is the final chapter of Part III. Over seven chapters, the compendium has traced data from the moment it is captured through storage, pipelines, streaming, distributed compute, cloud infrastructure, and now the quality, metadata, and policy layer that makes the whole thing trustworthy. Part IV opens with classical machine learning — supervised regression, classification, the bias-variance tradeoff, and the model families that preceded deep learning. Those chapters inherit everything Part III built: every model runs on governed data, queries a catalogued feature store, consults a monitored pipeline, and eventually answers to the same regulatory and ethical obligations this chapter described. The substrate is finished; the models are what comes next.
The data governance literature is unusually uneven — a handful of foundational books, a much larger body of tool-specific documentation, and a scattering of essays and standards that have become canonical mostly by repetition. The list below picks the references that repay re-reading: the canonical textbook (DAMA-DMBOK), the data-contracts and data-mesh literature that reshaped the field recently, the main open-source platform docs, the privacy and compliance texts, and the ML-specific governance documents that bridge into Part IV.
This page is the seventh and final chapter of Part III: Data Engineering & Systems. Part IV — Classical Machine Learning — opens with supervised regression, then classification, the bias-variance tradeoff, regularisation, and the model families (linear models, trees, ensembles, kernel methods, probabilistic graphical models) that preceded deep learning. Those chapters inherit every quality, lineage, and access-control problem this chapter described; classical ML is where the governed data finally becomes a predictive system, and the compendium's centre of gravity shifts from data engineering to the algorithms themselves.