Part III · Data Engineering & Systems · Chapter 01

Data collection, the hidden eighty percent.

A model is a hypothesis about its training data. Change the data and the model changes; keep the data clean and the model gets, at almost no additional cost, the property of being right. And yet — in almost every honest account from industry, academia, or the long tail of machine-learning practice — data collection is still described as the unglamorous three-quarters of the job. This chapter takes that share seriously. It works through the four main ways data actually enters a machine-learning system — public sources, commercial procurement, scraped sources, and synthetic generation — and then through the plumbing every team eventually rebuilds: APIs, rate limits, provenance, annotation, legal terrain, and the quiet discipline of version control applied to rows instead of lines of code.

How to read this chapter

The first five sections set the ground: why data acquisition is the hidden majority of ML work, a taxonomy of sources, the reasons provenance matters more than almost any other property of a dataset, and the two oldest answers — public corpora and commercial procurement — to the question "where do I get this?". Sections six through eight are the modern API story: REST, pagination, authentication, OAuth, and the quiet tax of rate limits. Sections nine through eleven cover web scraping end to end: when it is the right tool, the toolkit of HTTP clients, parsers, and headless browsers, and the engineering of polite, resumable crawling at scale. Section twelve is the legal and ethical terrain — robots.txt, terms of service, GDPR, CCPA — that a practitioner must at least be aware of. Sections thirteen through fifteen cover annotation and synthetic data, the two ways to make labelled data when the world did not helpfully pre-label it. Section sixteen brings DVC and lineage into the ingest pipeline itself. The final section connects every preceding idea to what it changes about a trained model.

Conventions: code snippets are Python 3.11+ and use requests, httpx, or playwright where illustrative. HTTP status codes follow RFC 9110. Legal references are indicative rather than jurisdictional; nothing in this chapter substitutes for advice from counsel in the country a dataset is collected or shipped from. The goal is to leave the reader able to bring data into a project under conditions — technical, organisational, legal — that the future will not regret.

Contents

  1. Why data acquisition is the hidden majorityThe eighty-percent problem
  2. A taxonomy of data sourcesObservational, transactional, elicited, synthetic
  3. Provenance and the paper trailWhere did this come from?
  4. Public corpora and open dataGovernment, research, Hugging Face
  5. Commercial data and marketplacesVendors, panels, licensing
  6. APIs as the modern data channelThe contract-first approach
  7. REST, pagination, and idiomatic consumptionGET, cursors, offsets
  8. Authentication and rate limitsAPI keys, OAuth2, backoff
  9. When scraping is the right toolLast resort, legitimate use
  10. The scraping toolkitHTTP, parsers, headless browsers
  11. Crawling at scaleQueues, politeness, resumability
  12. Legal and ethical terrainrobots.txt, ToS, GDPR, CCPA
  13. Annotation and labellingVendors, guidelines, inter-annotator agreement
  14. Synthetic data generationSimulation, augmentation, generative models
  15. Synthetic data, validatedFidelity, utility, privacy
  16. Versioning data at ingestDVC, lineage, snapshots
  17. Where it compounds in MLDataset-centric practice
Section 01

Data acquisition is the hidden eighty percent

Every industry survey of working machine-learning practitioners for the last decade has reached some version of the same conclusion: the largest slice of the calendar is not spent modelling. It is spent finding data, obtaining data, cleaning data, and arguing with the people who own the data. Calling this the "hidden" eighty percent is only fair because, in most presentations of the field, it is also the invisible eighty percent.

Why modelling is the smaller half

A model is a compressed description of its data. Once the data is settled — the right rows, the right columns, at the right resolution, with the right labels — the choice of model family tends to become a local optimisation, often a small one. Two well-resourced teams, handed the same dataset, usually converge on comparable predictive performance using two different architectures. Handed two different datasets, they diverge dramatically. This is what practitioners mean when they say, almost ritualistically, that data beats algorithms.

Where the time actually goes

The steps inside "data work" are uncelebrated one by one but cumulative in effect: understanding what the organisation already has, locating sources for what it does not, negotiating access, writing ingest code, reconciling schemas, labelling or validating labels, spot-checking outliers, computing coverage against the population of interest, arguing about edge cases, handing off to a downstream team, fielding the bug reports that return three weeks later. A realistic project plan budgets a minority of time for anything that could honestly be called modelling.

The dataset-centric view

A useful mental shift — associated in recent years with Andrew Ng and the data-centric AI community — is to treat the dataset, not the model, as the unit of iteration. Change the data; measure the model; repeat. On most tabular problems and a surprising number of perceptual ones, this loop dominates every other lever the team has available.

Why this chapter is long

The share of effort data takes is matched by the sprawl of skills required. Reading an API spec, parsing paginated JSON, reasoning about OAuth scopes, writing a polite crawler, understanding what a "legitimate interest" means under GDPR, supervising a labelling vendor, and validating synthetic samples all look like different jobs — and in a large company they are. But any one ML engineer working alone, or on a small team, will do all of them in the same week. The remainder of this chapter walks each in turn.

Section 02

A taxonomy of where data comes from

Every dataset, no matter how it arrives, is a compressed answer to three questions: who measured it, under what process, and for whose benefit. A simple taxonomy of sources — organised by the answer to those questions — keeps later confusion at bay.

Observational data

Observational data is collected by watching a system that would have run whether or not the observation was happening: server logs, click streams, satellite imagery, sensor readings, social-media posts, health records. It is abundant and cheap; it is also systematically biased toward whatever the measurement apparatus happens to capture well, and away from what it captures badly or not at all.

Transactional data

Transactional data is the business-system exhaust of an organisation — orders, invoices, user accounts, support tickets, financial ledgers. It has the enormous advantage of being definitional: a payment either happened or it did not. It has the equally large disadvantage of describing only the people and events the business already touches, making it a poor base for questions about people or events it does not.

Elicited data

Elicited data is collected by deliberately asking: surveys, crowd-sourced annotation, clinical questionnaires, user-research interviews, RLHF-style human feedback on model outputs. It is expensive and slow, and it is the only practical way to obtain many judgements that are otherwise locked inside human heads.

Synthetic data

Synthetic data is generated — by a simulator, an augmentation pipeline, or a generative model — rather than observed. It is the only realistic answer when the target distribution is rare (crash scenarios for autonomous vehicles), regulated (medical records), or not-yet-real (a product that has not launched). Its failure modes are covered in Sections 14 and 15.

The hybrid reality

Real datasets almost always combine categories. A training corpus for a recommendation system is transactional (orders) joined with observational (clickstream) and sometimes elicited (explicit ratings). An LLM pre-training corpus is observational (scraped web) with elicited fine-tuning (instructions, preferences) and synthetic continuation (model-generated data scored by other models). Knowing which slice is which is what lets a team reason about where each kind of bias enters.

The first useful question

Confronted with a new dataset, the most economical first question is not "what columns does it have" but "which of these four buckets does it come from, and what does that imply about what it systematically misses?" The answer shapes every downstream choice.

Section 03

Provenance, and the paper trail that ought to follow every row

Provenance is the record of where a piece of data came from, who produced it, under what licence it was released, and what happened to it before it landed in the current pipeline. It is the most neglected property of a dataset and, in practice, the single property that determines whether the dataset can still be used in six months.

What provenance covers

A full provenance record for a dataset names at least: the original source (URL, vendor, or internal system), the date and time of collection, the collection method (scraped, queried, received, generated), any intermediate transformations (deduplication, joins, filters), the licence or terms under which the data may be used, and a contact point for questions. Individual rows, too, carry their own smaller provenance: a row in a training table should, in principle, be traceable back to a record the organisation can justify having.

Why it gets lost

Provenance rots because nobody owns it. A download lands on someone's laptop; a colleague copies it to a shared drive; six months later the filename is all that remains of its origin. The cure is almost entirely social: store datasets inside a versioned, attributed system (see Section 16), write a short "datasheet" describing the dataset the moment it is created, and refuse to build pipelines on top of data whose lineage cannot be traced.

Datasheets and data cards

Datasheets for Datasets (Gebru et al., 2018) and Data Cards (Google, 2022) are the two most influential templates for documenting a dataset's motivation, composition, collection process, preprocessing, intended uses, and known limitations. They are the data-world analogue of a model card; shipping one with every significant dataset costs an afternoon and saves many future months.

Provenance is a regulatory requirement, not just a best practice

The EU AI Act, California's SB 942, and a growing number of US state laws treat lineage and documentation as compliance artefacts. A team that can answer "what is in this training set, and who produced each row?" is already most of the way to meeting them. A team that cannot is, increasingly, not allowed to ship.

The one-line acceptance test

A useful rule of thumb when evaluating any candidate dataset: can a colleague reconstruct how it was collected, from its documentation alone, in under thirty minutes? If not, the dataset is one layoff, one laptop failure, or one vendor change away from being unusable — and any model trained on it inherits that fragility.

Section 04

Public corpora and open data

A surprising amount of what an ML project needs is already on the internet, already indexed, and already licensed for permissive use. The first move on any new problem is to search the public catalogue — not because the public version will be the final dataset, but because it will almost always frame the problem better than starting from scratch.

The three main tiers

Public data falls roughly into three tiers. Government open data — data.gov, data.gov.uk, Eurostat, the World Bank, national statistical agencies — is the largest and most authoritative, and is generally released under permissive terms. Research corpora — ImageNet, COCO, GLUE, SQuAD, LibriSpeech, OpenStreetMap, the UCI ML repository — are the standard benchmarks around which whole sub-fields have been built. Community catalogues — Hugging Face Datasets, Kaggle, Academic Torrents, Zenodo — aggregate the long tail of user-contributed datasets and make them searchable.

Common Crawl and its descendants

A category of its own: Common Crawl, a petabyte-scale snapshot of the public web updated monthly, is the raw substrate from which most large language-model training corpora are derived. C4, RedPajama, The Pile, Dolma, and FineWeb are downstream curations — filtered, deduplicated, language-tagged, quality-scored. Knowing they exist spares a team from trying to crawl the web itself.

Licences, and why they matter for ML

A "free" dataset is not the same as a permissively licensed one. Creative Commons variants (CC0, CC BY, CC BY-SA, CC BY-NC) differ in whether commercial use is allowed, whether attribution is required, and whether derivative works must carry the same licence. Government releases (CC0-equivalent, OGL, public domain) tend to be the most permissive. Research-only licences, familiar from ImageNet or many medical datasets, do not permit training commercial models.

The "training is fair use" question is unresolved

As of early 2026, US and EU courts have produced conflicting rulings on whether training a commercial model on copyrighted material — even material that is technically accessible — falls inside fair use or text-and-data-mining exemptions. A prudent team treats the licence on the dataset, not assumptions about fair use, as the binding constraint.

The acceptance checklist

Before adopting a public dataset, a minimum checklist: confirm the licence and compatibility with the intended product; verify the dataset is still hosted at a stable URL; check for known errata, bias audits, or retractions; hash the downloaded archive against the publisher's checksum; and store the copy inside the organisation's versioned data store rather than trusting the upstream host.

Section 05

Commercial data, and the marketplaces that sell it

Public data is the first option; buying is the second. Commercial data vendors exist because there are categories — panel data, financial time series, consumer transactions, geospatial imagery, curated training sets — where the quality, coverage, or freshness a project needs is simply not available for free.

What the market actually contains

Roughly four categories dominate. Panel and survey data (Nielsen, Kantar, YouGov) describe consumer behaviour or opinion at statistically representative scale. Financial and alternative data (Bloomberg, Refinitiv, Quandl, satellite-derived indicators) feed trading and macro research. Geospatial and imagery data (Planet, Maxar, Esri) describe the physical world at a resolution and cadence individual buyers cannot produce. Curated training sets — increasingly sold by annotation companies — bundle labelled data with a quality guarantee.

Data marketplaces

Snowflake Marketplace, AWS Data Exchange, Databricks Marketplace, and Google Analytics Hub are the main infrastructure-coupled marketplaces: a dataset listed there arrives as a share into the buyer's existing warehouse or lake, which collapses what was once a multi-week onboarding project into a click. Independent marketplaces (Datarade, Narrative, Lotame) broker across vendors but still hand the buyer a file or an API key.

What a buyer actually contracts for

A commercial data contract defines: scope (which tables, which columns, which geographies, which date ranges), refresh (daily, weekly, point-in-time), permitted use (internal analytics only vs. model training vs. product features), redistribution (almost always prohibited), term and pricing (often per-seat or per-record), and audit rights. The permitted-use clause is the one most often under-read and most often the one that bites later.

The "can this train a model?" clause

Many commercial licences written before 2023 are silent on machine-learning training; those written since are usually explicit and usually restrictive. An ML team evaluating a vendor should assume that rights to train, to embed training-derived weights in a product, and to distribute those weights are three separate permissions — and should confirm each in writing.

When to build versus buy

A rough decision rule: buy when the data is the same for every competitor (market prices, satellite imagery, consumer panels), and build when it is proprietary to the organisation's own operations. Paying a vendor for data that the organisation could, in principle, collect itself is sometimes the right call — but the case usually rests on time-to-first-result, not on cost over the lifetime of the project.

Section 06

APIs, the contract-first way to collect data

When an organisation is willing to be queried, it offers an API — a machine-readable contract that specifies what can be asked, what will be returned, at what rate, and under what authentication. Every other collection mechanism in this chapter is a workaround for the absence of one.

The API as a contract

A well-designed API commits its publisher to three things: a stable set of endpoints that continue to behave as documented, a versioned evolution path so that breaking changes are visible and gradual, and a predictable rate at which a client may consume the service. In return, the consumer agrees to identify itself, obey the rate limits, and respect the terms of service. This exchange is why consuming an API is almost always cheaper and more durable than any scraping alternative.

The API zoo

Four styles dominate production. REST (Section 7) is the default for most public and internal APIs: resources addressed by URL, operations expressed as HTTP methods, payloads usually JSON. GraphQL lets the client specify exactly which fields it wants in a single query, at the cost of more complex caching and rate-limiting. gRPC is the high-throughput binary cousin, common inside service meshes. Webhooks invert the flow entirely: the server calls the client whenever an event occurs, which is the right shape for event-driven ingestion.

Documentation is a feature

The quality of an API's documentation is a near-perfect proxy for the quality of the data behind it. OpenAPI (formerly Swagger) specifications turn documentation into a machine-readable artefact, from which client libraries can be generated in a dozen languages. A team evaluating a new data source should read the OpenAPI spec before anything else — it reveals, in a few minutes, what the API actually supports, and how much hand-rolling the caller will have to do.

A single rule: prefer the API

If the target site or service publishes an API, use it — even if the API covers only 80% of what the scraper could reach. The stability, the legal clarity, and the vendor relationship are almost always worth the trade. Scrape only what the API cannot give, and only where scraping is permitted.

What follows from a contract

The rest of this cluster of sections — REST idioms, pagination, authentication, rate limiting — is what it takes to actually consume an API well. The next section begins with the most common pattern in the wild.

Section 07

REST, pagination, and the shape of idiomatic API consumption

REST is less a specification than a set of conventions; most APIs called "RESTful" are really HTTP-plus-JSON with a roughly resource-oriented URL layout. That is enough. What matters is consuming it the way it expects to be consumed.

The canonical shape

A REST API exposes resources at URLs — /users/42, /orders?status=open, /teams/engineering/members. HTTP methods express intent: GET reads, POST creates, PUT/PATCH update, DELETE removes. Responses carry status codes (200 OK, 201 Created, 400 Bad Request, 401 Unauthorized, 404 Not Found, 429 Too Many Requests, 500 Server Error) and a JSON body. A client that handles these correctly handles most APIs correctly.

A minimal idiomatic call

import httpx

client = httpx.Client(
    base_url="https://api.example.com/v1",
    headers={"Authorization": f"Bearer {TOKEN}",
             "User-Agent": "research-ingest/1.0 (contact@example.com)"},
    timeout=30.0,
)

r = client.get("/orders", params={"status": "open", "limit": 100})
r.raise_for_status()      # turn 4xx/5xx into exceptions
orders = r.json()["data"]

Four things to notice: a reusable client with shared headers, an explicit user-agent that identifies the caller, a generous timeout, and raise_for_status() so that errors become errors rather than silently corrupt JSON.

Pagination, three ways

An endpoint that returns a collection almost always paginates, and it does so in one of three styles. Offset pagination (?page=3&per_page=50) is simple and wrong: rows added or removed mid-crawl cause pages to shift, so records are duplicated or missed. Cursor pagination (?cursor=abc123, with the next cursor returned in the body) is stable and is what most well-designed APIs use. Link-header pagination (the GitHub style: Link: <...>; rel="next") is a specialisation of cursor pagination wrapped in the HTTP standard.

def crawl(endpoint, params):
    while True:
        r = client.get(endpoint, params=params)
        r.raise_for_status()
        body = r.json()
        yield from body["data"]
        cursor = body.get("next_cursor")
        if not cursor:
            break
        params = {**params, "cursor": cursor}

Idempotence, retries, and safety

GET is idempotent by definition — issuing it twice has the same effect as issuing it once — which makes retry on transient failure safe. POST is not, and well-designed APIs accept an Idempotency-Key header so the client can make it so. When writing an ingestion pipeline, a client should retry idempotent calls on network errors and on 5xx responses, and retry with caution on 429 (see Section 8).

Three headers worth sending

User-Agent identifies the caller; Accept-Encoding: gzip usually halves the response size; If-None-Match (with a stored ETag) or If-Modified-Since lets the server answer 304 Not Modified instead of re-sending unchanged data. Together they are the difference between a pipeline that can run every hour and one that cannot.

Section 08

Authentication and the quiet tax of rate limits

An API has to know who is asking before it can answer. It also has to restrict how fast any one caller may ask — both to protect itself and to keep the shared service fair. How those two constraints are implemented determines most of the operational pain of running an ingestion pipeline.

The authentication ladder

In order of increasing safety: API keys (a static string sent in a header) are easy, and are the standard for server-to-server access to low-stakes data. OAuth2 client credentials add a short-lived bearer token, obtained from a token endpoint, that can be rotated without rotating the key. OAuth2 authorization code is the full three-legged flow used when the caller is acting on behalf of a user, with explicit consent: the user is redirected to the provider, grants specific scopes, and the client receives a token usable only within those scopes. mTLS and signed requests (AWS SigV4, Google's IAM) are stricter still, and are the norm inside cloud environments.

A minimal OAuth2 client-credentials exchange

r = httpx.post(
    "https://auth.example.com/oauth/token",
    data={"grant_type": "client_credentials",
          "client_id": CLIENT_ID,
          "client_secret": CLIENT_SECRET,
          "scope": "read:orders"},
)
r.raise_for_status()
token = r.json()["access_token"]        # short-lived
expires_in = r.json()["expires_in"]     # refresh before expiry

A production client caches the token and refreshes it a little before expiry; refreshing on every request is an easy and visible mistake.

Rate limits and the reason for them

A rate limit caps how many calls a caller may make in a time window — 5000 an hour, 100 a second. When the limit is hit, the API responds with 429 Too Many Requests and usually a Retry-After header. The most common response shapes are: the token bucket (a pool that refills at a fixed rate, allowing bursts) and the leaky bucket (a fixed rate with no bursts). Reading the provider's documentation for which one applies is the only way to size a client correctly.

Exponential backoff, with jitter

import random, time

def with_backoff(fn, max_tries=6):
    for i in range(max_tries):
        try:
            return fn()
        except httpx.HTTPStatusError as e:
            if e.response.status_code not in (429, 500, 502, 503, 504):
                raise
            retry_after = e.response.headers.get("Retry-After")
            delay = float(retry_after) if retry_after else 2 ** i
            time.sleep(delay + random.random())
    raise RuntimeError("exhausted retries")

Jitter is not cosmetic

If a thousand clients retry in lockstep at 2, 4, 8 seconds, they produce a thousand-client thundering herd every time. Adding a small random component ("jitter") to each backoff spreads the retries, and is the single cheapest thing a client can do to be a well-behaved citizen of someone else's service.

Secrets management

An API credential is a secret. It does not belong in a repository (see Chapter II/06 on version control), does not belong in a container image, and does not belong in an environment variable logged by a framework. The minimum viable pattern is a secrets manager (AWS Secrets Manager, Vault, GCP Secret Manager) with short-lived issuance, and a client that fetches on process start.

Section 09

Scraping, when it is the right tool

A web scraper is a program that pretends to be a browser, downloads pages that were published for human readers, and extracts structured data from them. Scraping is the answer to "there is no API" — and it is never, by itself, the best answer. But there are contexts where it is the only available one.

When scraping is defensible

Four cases recur honestly in practice. Public research: a researcher gathering, say, a corpus of government court decisions that are published but not indexed. Competitive intelligence on publicly posted information: collecting publicly advertised prices, specifications, or job postings. Archival: preserving pages likely to disappear (the Internet Archive is itself a principled scraper). Integration with a site that has no API: pulling the caller's own data out of a service that refuses to export it.

When it is not

Scraping personal data from behind a login, scraping data the site's terms of service explicitly forbid scraping, and scraping in volumes that degrade the target service are all, at minimum, commercially reckless and, in several jurisdictions, legally exposed. The 2022 hiQ Labs v. LinkedIn decision in the US held that scraping publicly accessible pages does not by itself violate the CFAA — but that is a narrow ruling, it does not touch contract or copyright claims, and EU law starts from a stricter baseline.

Scraping is brittle, always

A scraper depends on the HTML structure of pages it did not author. That structure changes without notice, and every change breaks the scraper. A team running a scraper therefore also runs a monitoring system — alerts when extraction rates drop, structural assertions that fail loudly — and budgets ongoing maintenance. A "set-and-forget" scraper is a mirage.

The last-resort test

Before writing a scraper, a team should explicitly rule out: using the site's API, using a data vendor that already covers the same domain, asking the site directly for bulk access, or using an existing public dataset. Scraping is the fallback when all four fail, and is cheapest to operate when confined to exactly what the fallback requires.

What follows

The next two sections describe, in order, the toolkit that makes scraping tractable and the engineering that makes it polite and resumable at scale.

Section 10

The scraping toolkit, from HTTP clients to headless browsers

A modern scraping stack has three layers: a transport that fetches bytes, a parser that turns bytes into a navigable tree, and — when needed — a headless browser that executes JavaScript so that the tree looks like what a human would see.

The HTTP layer

The transport is almost always requests (synchronous, familiar) or httpx (async-capable, HTTP/2-aware). Either is sufficient for static HTML pages. The interesting decisions at this layer are connection pooling, retry policy (Section 8), cookie handling, and making sure the User-Agent identifies the caller rather than masquerading as a browser it is not.

The parsing layer

Three parsers cover almost every case. BeautifulSoup is the forgiving choice — it will read bent HTML without complaint and exposes a pleasant navigation API. lxml is faster and supports XPath, which is the most precise way to specify a deeply nested selector. selectolax is the fastest of the three and a near-drop-in for BeautifulSoup when a pipeline becomes CPU-bound.

from bs4 import BeautifulSoup
import httpx

html = httpx.get("https://example.com/products").text
soup = BeautifulSoup(html, "lxml")
for card in soup.select("div.product-card"):
    yield {
        "name":  card.select_one("h3").get_text(strip=True),
        "price": card.select_one(".price").get_text(strip=True),
        "url":   card.select_one("a")["href"],
    }

The headless-browser layer

Many sites render their content in the browser: the initial HTML is a shell, and the product data arrives later via JavaScript. The HTTP layer sees the shell; a headless browser sees the rendered page. Playwright is the current default (Python, JavaScript, and .NET bindings; Chromium, Firefox, and WebKit targets). Selenium is the older, still-serviceable choice, common in QA tooling. Both are an order of magnitude slower and heavier than pure HTTP; use them only when the page demands it.

Scrapy, when scraping is the whole project

Scrapy is a full framework — request scheduler, item pipeline, middleware, extensibility — for crawlers big enough to justify one. A one-page extraction does not need Scrapy; a thousand-page crawl against a well-structured site usually does. Scrapy also provides the primitives (delays, concurrency limits, retry middleware) that Section 11 treats as non-negotiable.

A conservative default stack

For one-off jobs: httpx + BeautifulSoup. For recurring jobs on static HTML: Scrapy. For jobs that require JavaScript: Playwright, called only on the pages that actually need it. Mixing the three — a Scrapy crawler that hands the hard pages to Playwright — is a common and justified production pattern.

Section 11

Crawling at scale

A scraper that fetches a few pages is a script. A scraper that fetches a few million is an operating problem: it must be polite to the target, tolerant of failure, resumable after crashes, and cheap enough to run on whatever machine it happens to run on.

The politeness budget

Every request a crawler makes costs the target server CPU, bandwidth, and attention. The minimum disciplines: obey robots.txt (Section 12); add a delay between requests (a second per host is a common default); cap concurrent connections to any one host (four to eight is plausible, sixteen is impolite); use Accept-Encoding: gzip to shrink responses; honour Retry-After; identify the crawler in User-Agent with a contact address. These are not optional if the crawler is expected to continue running next week.

Queues, deduplication, state

A crawl that fits in RAM is not at scale. At scale, the frontier of URLs-to-visit lives in an external queue (Redis, SQS, Kafka), a visited-set lives in a persistent store (Redis set, RocksDB, DuckDB), and extracted records land in an object store (S3, GCS) partitioned by date. The crawler itself becomes a pool of workers that pull from the queue, visit a URL, enqueue newly discovered URLs, and record the result. The decoupling is what makes crashes recoverable.

Resumability

The single best defence against wasted crawling is idempotent extraction: visiting a URL twice produces the same record, and the record has a stable key (a content hash, or a normalised URL). With that property, a crash simply leaves some URLs unvisited; the worker wakes up, asks the queue, and continues. Without it, every crash forces a re-crawl and irritates the target site.

Adaptive concurrency

A crawler that runs at fixed concurrency ignores the signal the target is sending. A better client watches response latency and error rate, scales concurrency down when either rises, and scales it up when both fall. Scrapy's AUTOTHROTTLE implements this idea; any mature in-house crawler implements something similar.

Proxies and the ethics of rotation

Rotating proxies, residential IP pools, and browser fingerprint randomisation are technologies sold to "avoid being blocked". They are rarely appropriate. A crawler that is being blocked is almost always being blocked for a reason — usually because it is ignoring robots.txt, exceeding advertised rate limits, or scraping content the target has asked not to be scraped. The right response is almost always to slow down or to stop, not to evade.

Section 13

Annotation and labelling, the craft of turning raw data into supervision

Most useful supervised-learning datasets are raw data that has been looked at, interpreted, and tagged by a human. The quality of the resulting labels — and the cost, speed, and ethics of producing them — determines the ceiling of every model trained on them.

The landscape of labelling

Three archetypes of provider exist. Crowd platforms — Amazon Mechanical Turk, Prolific — offer many workers at low marginal cost, with variable quality. Managed services — Scale AI, Labelbox, Appen, Surge — combine tooling, a workforce, and quality-control processes; they are more expensive and far more reliable. Specialist firms (Centaur Labs for medical, Anolytics for automotive, many others) pair domain-trained annotators with a narrower task. A team choosing among them is really choosing how much quality control to outsource.

Guidelines are the whole product

A labelling project's output is only as good as its annotation guideline. A good guideline defines the task, enumerates edge cases with worked examples, specifies how to handle ambiguity (skip, flag, best-guess), and evolves as the labellers encounter unforeseen cases. The single greatest predictor of a successful labelling project is the number of times the guideline was revised in its first week.

Inter-annotator agreement

When two annotators look at the same sample, do they agree? Measuring that agreement — with Cohen's kappa for two raters or Fleiss' kappa for many — gives a principled upper bound on achievable model performance. A label that humans cannot agree on is one a model has no business being asked to predict, and a kappa below around 0.6 typically indicates the task itself needs rethinking, not the annotators.

Active learning and human-in-the-loop

When labels are expensive, a model can help decide what to label next. An active learning loop trains on a small labelled set, predicts on the unlabelled pool, and asks humans to label the examples the model is least certain about — maximising information gained per labelled sample. Human-in-the-loop pipelines run the same idea in production: a model labels automatically, surfaces low-confidence predictions to a human, and retrains on the corrected results.

The ethical floor of labelling

Crowd-sourced labelling has a well-documented history of paying less than minimum wage and exposing workers to traumatic content (toxicity, CSAM detection, gore) without meaningful support. Ethical labelling programmes name and meet these obligations: fair pay, opt-out for sensitive content, mental-health support, and genuine consent. A team outsourcing labels inherits responsibility for these conditions whether it acknowledges them or not.

Section 14

Synthetic data, three ways to manufacture what the world did not provide

Synthetic data is generated rather than observed. It is the practical answer when the real distribution is rare, regulated, or not yet real, and in the last five years it has become a large fraction of the training signal for frontier models. The three main methods differ sharply in what they model and what they require.

Simulation

Simulation generates synthetic data from a first-principles model of the environment. Driving datasets from CARLA and AirSim; robotics datasets from MuJoCo and Isaac Sim; rendered scenes from Unreal or Unity. Simulation shines when the physics is well-understood and photorealism matters less than correctness — rare events, hazardous ones, or ones that have not yet happened. The gap between simulated and real data, the sim-to-real gap, is the dominant concern.

Augmentation

Augmentation creates new synthetic samples by perturbing existing real ones: cropping, rotating, and colour-jittering images; paraphrasing text; time-stretching and pitch-shifting audio; adding Gaussian noise to tabular features. It does not create information from nothing — a rotated cat is still the original cat — but it hardens a model against nuisance variation that is present in deployment but rare in training. It is the cheapest and safest form of synthetic data.

Generative models

The newest source: a generative model, having been trained on real data, emits samples that look drawn from the same distribution. Large language models generate synthetic instructions, chat traces, and domain-specific text. Diffusion models generate synthetic images and video. GANs and VAEs generate synthetic tabular rows (SDV, CTGAN) for privacy-preserving releases. This is by far the most productive and most dangerous of the three methods — productive because it scales, dangerous because errors in the generator compound (Section 15).

Distillation from larger models

A specialisation worth naming: a smaller "student" model is trained not on raw labels but on outputs (probabilities, rationales, answers) produced by a larger "teacher" model. The LLM-synthesised instruction datasets that drove the open-source LLM boom of 2023–2025 — Alpaca, Vicuna, WizardLM, many others — are essentially distillation corpora. The technique is powerful, and it is also where most of the live legal questions about "training on model outputs" land.

The one-line use case

Synthetic data is cheap where real data is expensive and scarce where the underlying phenomenon itself is rare. It is a leverage tool; it is not a substitute for real data about the actual population a model will serve.

Section 15

Synthetic data, validated — fidelity, utility, privacy

Synthetic data can be worse than useless: it can make a model look good on benchmarks and quietly fail on reality. The only protection is an explicit validation programme that measures three distinct properties.

Fidelity

Fidelity asks: does the synthetic sample distribution resemble the real one, statistically? Univariate checks (per-feature means, variances, histograms); multivariate checks (correlations, joint distributions); distributional distance metrics (KL divergence, Wasserstein distance, FID for images, MAUVE for text). A generator whose output passes none of these is not synthesising — it is hallucinating.

Utility

A higher bar than fidelity: does a model trained on synthetic data perform comparably to one trained on real data? This is usually measured with the TSTR (train-synthetic, test-real) protocol. Utility is the only property that correlates with downstream usefulness; two synthetic datasets with identical fidelity can differ enormously in utility, because fidelity does not notice which moments of the distribution the downstream task actually depends on.

Privacy

A common motivation for synthetic data is to release a shareable stand-in for sensitive real data. That only works if the synthetic sample does not leak the real rows. The relevant tests: membership inference attacks (can an attacker tell whether a particular real row was in the training set?); attribute inference (can an attacker reconstruct a missing column?). Differential privacy guarantees against both classes of attack, with a mathematically quantified budget — at the cost of some fidelity and utility.

Model collapse

A 2023 result from Shumailov et al., and several follow-ups since: a generative model trained recursively on its own outputs degrades, with rare modes disappearing first. Model collapse is the synthetic-data version of a photocopy-of-a-photocopy, and it is an increasing concern as synthetic content grows as a share of the web and of training corpora. Mixing synthetic data with a non-trivial fraction of real data, and keeping that fraction documented, is the known mitigation.

The validation contract

A synthetic dataset shipped into training should carry, alongside its provenance record, a short validation report: fidelity scores against the real reference, TSTR utility on the downstream task, and a privacy-attack result if the data is sensitive. Without all three, a team is guessing.

Section 16

Versioning data the moment it arrives

The previous chapter described git as the unit of reversibility for code. The same argument applies, with small modifications, to data: the project whose datasets are versioned is the project that can reproduce a model in six months; the one whose datasets are not is the one debugging last July's training set in February.

Why git is not enough

Git stores line diffs of text files. A 10 GB parquet file committed to git bloats the repository, slows every clone, and hides most of its actual changes. The two well-known fixes — Git LFS (large-file pointers) and DVC (data version control) — keep the small pointer files in git and push the large payloads to an external store (S3, GCS, Azure Blob, on-prem), recording content hashes that make the pair reproducible.

Beyond DVC: table-format lakehouses

Delta Lake, Apache Iceberg, and Apache Hudi extend the idea to warehouse-scale tabular data. Each commit to a table is an atomic operation, each version is addressable by timestamp or ID ("time travel"), and schema evolution is a first-class operation rather than an emergency. A model trained last Tuesday can be re-trained on exactly last Tuesday's data — which is the reproducibility property that matters.

Lineage graphs

Lineage tools — OpenLineage, Marquez, Apache Atlas, and the lineage features inside Databricks, Snowflake, and dbt — record the graph of datasets and the transformations between them. The payoff is visible the first time a metric moves unexpectedly: lineage tells a team every upstream table that changed, and who changed it, in seconds rather than hours.

Great Expectations and contract testing

Great Expectations, Soda, and the tests-as-contracts pattern in dbt turn schema and distributional assertions into code: a column must be non-null, a primary key must be unique, a categorical must come from a known set, a daily row count must fall within a tolerance. The assertions run at ingest; a pipeline that fails its contract fails loudly, before the bad data reaches a model.

The ingest-time discipline

The cheapest place to version, validate, and document a dataset is the moment it first lands in the organisation. Every hour the raw copy spends unversioned is an hour of provenance lost. A useful ingest pipeline writes to a versioned store, runs contract tests, records lineage, and updates a data card — all before any downstream consumer sees the rows.

Section 17

Where data work compounds in machine learning

Every habit described in this chapter — provenance, API discipline, scraping politeness, annotation quality, synthetic validation, ingest-time versioning — looks like overhead for the first project that adopts it, and pays back every project thereafter. Three effects explain most of the compounding.

Reproducibility as a default

A model trained on a versioned dataset, produced by a pipeline whose code and configuration are themselves versioned, can be retrained by anyone with access to the same systems. That is the baseline property of reproducibility. It is the difference between a model an organisation can debug and one it can only admire. In regulated settings, it is the difference between a model that can be shipped and one that cannot.

Dataset-centric iteration

Once datasets are versioned, improving them becomes tractable in the same way improving code is: a hypothesis, a change, a comparison, a merge. Relabel the five hundred examples the current model gets wrong and retrain is a coherent engineering task if the dataset is versioned and a vague aspiration if it is not. Andrew Ng's data-centric AI campaign is, in practice, the systematic exploitation of this fact.

Evaluation that actually matches deployment

A dataset with careful provenance and documented coverage makes it possible to construct evaluation sets that look like the population the model will actually serve — stratified by demography, by time, by source, by whatever dimension matters. Without that, "93% accuracy" is a number about the test set, and the test set is a number about the scraper that made it.

Regulatory and organisational durability

The 2024–2026 wave of AI regulation — the EU AI Act, China's interim measures, US state-level laws, sector-specific rules in finance and healthcare — is converging on a common shape: organisations must describe what data their models were trained on, demonstrate risk management around that training, and preserve that documentation for years. The teams that have, from the beginning, treated data acquisition as a first-class engineering practice are the teams for whom this imposes almost no new cost.

The through-line

A machine-learning system is its data. The earliest decisions about a dataset — where it came from, under what licence, how it was labelled, how it was versioned — propagate, unchanged and unacknowledged, through every downstream artefact. Taking those decisions seriously is not a tax on the project; it is the project, for a share of the schedule large enough that denying the share costs more than paying it.

Further reading

Where to go next

Data acquisition is a field sprawled across HTTP standards, legal frameworks, vendor documentation, annotation handbooks, and a growing research literature on synthetic data. The list below picks the canonical standards, the short books worth owning, the vendor pages worth bookmarking, and the research papers that have genuinely changed practice.

Canonical references on HTTP and APIs

Web scraping, practically

Open catalogues and public data

Annotation, labelling, and human-in-the-loop

Synthetic data — methods and validation

Versioning, lineage, and data quality

Legal and ethical references

This page opens Part III: Data Engineering & Systems — the half of the compendium that treats data itself as a first-class engineering concern. The chapters that follow extend the same attitude to storage and warehousing, to the pipelines and orchestration that move data between systems, to streaming and real-time ingestion, to distributed computing, to the cloud platforms on which all of it runs, and finally to the governance layer — quality, metadata, lineage, access — that makes the whole edifice auditable. If this chapter is about getting the data in the door, the next six are about what it takes to keep it useful.