Not every piece of data can be held for the nightly batch. Payments clear in seconds, fraud has to be caught in milliseconds, a user's next recommendation is wanted before they scroll away, a sensor emits a reading that matters only while it is fresh. Streaming is the architectural answer: an unbounded, ordered flow of events rather than a bounded file; a processor that updates continuously rather than once a day; a set of correctness guarantees — exactly-once, event-time, watermarks — that batch never had to think about. The tools that carry the events — Kafka, Pulsar, Kinesis — and the engines that compute over them — Flink, Spark Structured Streaming, Kafka Streams — are the core of this chapter. So is the harder part: the semantics that make streaming correct rather than just fast.
The first three sections frame the subject: why streaming exists at all given how well batch works, why the append-only log is the primitive the whole field is built on, and how stream processing actually differs from the batch/stream sketch in Chapter 03. Sections four through six are the transport layer — Apache Kafka as the dominant broker, its ecosystem (Connect, Streams, Schema Registry), and the alternatives (Pulsar, Kinesis, Redpanda, NATS) that matter in different niches. Sections seven through nine move from transport to computation: what stream processing actually is, Apache Flink as the reference engine, and Spark Structured Streaming as the micro-batch alternative. Sections ten through thirteen are the semantic core of the chapter — event time versus processing time, windowing, state and checkpoints, and the three delivery guarantees (at-most-once, at-least-once, exactly-once) — the ideas that separate a correct streaming system from one that is merely live. Sections fourteen through sixteen widen the frame: event-driven architectures beyond analytics, streaming ML (online features, real-time inference), and the operational reality of running streams at scale (lag, backpressure, ordering). The final section connects all of it to why streaming infrastructure is increasingly on the critical path for modern ML.
Conventions: vendor names are used where there is no good generic term — Kafka specifically, brokers generically; Flink specifically, stream engines generically. References to "the warehouse" still assume the storage layer from Chapter 02, and the pipelines from Chapter 03 are assumed as context for how batch and stream interact in a real platform. The treatment of semantics (event time, watermarks, exactly-once) is deliberately careful: these are the topics where intuition from batch tends to mislead, and where most production streaming incidents originate.
Batch is the default, and for most analytical work it is also the right answer. Streaming is the architecture you adopt when the default stops working — when the gap between an event happening in the world and a decision being made about it has to collapse from hours to seconds, and the whole shape of the system has to change to make that collapse possible.
A surprising amount of software is a latency argument. Fraud detection that fires after yesterday's batch has run is useless; a recommendation that reflects a user's previous session is worth a tenth of one that reflects their current one; an alert on a failing production line that arrives an hour late is a postmortem, not an alert. Each of these is a case where the value of the data decays faster than a batch pipeline can deliver it, and each of them pushes the underlying architecture toward streaming. The question is rarely "should this be real-time?" in the abstract; it is "what is it worth to compute this in one hundred milliseconds instead of one hour?" — and for a narrow but important set of use cases the answer is: a lot.
A streaming system has three properties that distinguish it from a batch pipeline. Data arrives as an unbounded sequence of events rather than as a finite file. Processing runs continuously rather than on a schedule. And the system's correctness is defined in terms of invariants over time — "the running total is accurate within five seconds of the latest event", "no event is lost", "each event is counted exactly once" — rather than in terms of a completed batch's output matching an expected result. All three together are what make streaming a different discipline rather than a faster batch pipeline.
Every concern that batch handles implicitly becomes explicit in a streaming system. Ordering: events arrive out of order, and the correct handling depends on the semantics the consumer expects. Time: is a count of "events in the last hour" based on when the event happened, when it arrived, or when it was processed? Failure: a stream runs forever, which means a worker crash cannot mean "restart the job" — it has to mean "resume from where the worker left off, without losing or duplicating events". State: most interesting stream computations are stateful (counts, joins, aggregates), which means there has to be a recovery story for the state itself. The rest of this chapter is the systematic answer to each of those problems.
The single most common streaming mistake is adopting it where batch would do. Streaming adds operational load, debugging difficulty, and cost; it earns its place only where the latency requirement genuinely demands it. A reasonable default is: batch first, streaming when batch no longer fits. Teams that invert this — stream first, batch only where stream is clearly overkill — spend a large fraction of their engineering budget on problems their use case never actually had.
The conceptual core of modern streaming is not a queue, a database, or a pub/sub bus — it is the log. An ordered, append-only, immutable sequence of records, partitioned for scale and replicated for durability. Most of what makes streaming systems behave the way they do is a direct consequence of this single data structure.
An event is a fact: something happened at a time, has a key (a user, an order, a sensor), carries a payload, and is immutable once written. "User 42 clicked product 7 at 2026-04-21T13:02:11Z" is an event; so is "Order 9001 was paid" and "Sensor A reported 72°F". The discipline is to represent what happened, not the resulting state — the state is a view of the event stream, not the stream itself. Jay Kreps's 2013 essay "The Log" is the canonical articulation of this inversion, and it is the conceptual step most batch engineers have to make to think clearly about streaming.
Once written, an event is never modified. Corrections are themselves new events ("order 9001 was refunded"), and the derived state adjusts accordingly. The immutability property is what makes a log a useful foundation: it can be replayed, re-processed by a different consumer, re-hashed by a new model, and audited — properties that an in-place-mutated database table does not naturally provide. The cost is that the log grows; retention policies and compacted topics (see below) are the answer.
A log at scale is not a single sequence; it is a set of partitions, each of which is its own ordered sequence. Events are routed to partitions by a key, so all events for the same user land on the same partition in order. Each event in a partition has a monotonically increasing offset; consumers track where they are by remembering the last offset they processed. Partitions are replicated across brokers so that the loss of any single broker does not lose events. These three properties — partitioned for throughput, offset-addressed for resumability, replicated for durability — together are the thing that makes a log usable as the backbone of a streaming platform.
The log abstraction predates Kafka, but Kafka is what made it operationally mainstream. "A distributed, replicated, partitioned, append-only commit log" is both a description of Kafka and a description of the abstraction the rest of this chapter will take as given. Alternatives (Pulsar, Kinesis, Redpanda) implement the same abstraction with different trade-offs; the vocabulary is substantially shared.
Chapter 03 opened the batch/stream distinction; this chapter has to close it. The useful view is not that stream is "batch but faster", but that several properties — the shape of the input, the definition of correctness, the cost profile, the failure handling — change together when you cross the line.
A batch job reads a bounded dataset — yesterday's Parquet files, this hour's partition, the Postgres table as of now — and produces a bounded output. A streaming job reads an unbounded input and produces an unbounded output. The practical consequence is that "done" is a state a batch job is in and a stream job never is. Every streaming system's vocabulary — windows, watermarks, checkpoints — exists so that a job whose input is infinite can still produce useful finite answers along the way.
When a batch job fails, the operational move is "re-run from the top". That is not available to a streaming job. The state has to survive the failure, the in-flight events have to be replayed without being processed twice, and the output has to remain correct after the restart. Every serious stream processor solves this with some variant of the same pattern: periodically snapshot the state, record the input offsets at the snapshot boundary, and on restart roll back to the most recent snapshot and replay from the recorded offsets. The difference in failure handling is one of the largest practical differences between the two worlds.
A batch pipeline's cost is proportional to the data it processes and easy to predict: a nightly job processes a day's worth of data, so the bill is the per-unit cost times one day of data. A streaming pipeline's cost is proportional to wall-clock time — it runs continuously, paying for compute whether or not events are arriving — and to the state it maintains. For workloads with high event rates the stream can be cheaper; for workloads with low rates it can be surprisingly expensive. The latency shape is the inverse: batch latency is its own run duration plus the schedule gap (typically hours), streaming latency is measured in seconds or milliseconds.
Real platforms are almost always hybrid. The warehouse is batch; the event bus is streaming; a CDC feed bridges between them. A reasonable design principle is to use batch for everything the consumer can tolerate it for, and streaming only for the specific data whose value decays faster than batch can deliver. Chapter 03's pipeline discipline applies to both; what this chapter adds is the semantics that only the stream side needs.
Kafka is, for streaming, what Parquet is for storage: the default. Written at LinkedIn, open-sourced in 2011, now the Apache project at the centre of most event-driven architectures — it is the single technology whose vocabulary the rest of streaming has adopted. An understanding of what Kafka actually is, and is not, is the load-bearing piece of this chapter.
A Kafka cluster is a set of brokers — servers that store and serve events. Events are written to topics, which are logical names; each topic is split into one or more partitions, each of which is an ordered, append-only log stored on disk and replicated across brokers. Producers write events to a topic; the partition is chosen by hashing a key (or round-robin if no key). Consumers read from one or more partitions, tracking their position (the offset) so that a restart resumes without loss. Every higher-level concept in Kafka — consumer groups, transactions, compacted topics — is built on this substrate.
A Kafka producer writes batches of records to a partition; its knobs (acks, retries, enable.idempotence) control what happens on network failure and how strong the delivery guarantee is. A Kafka consumer reads from partitions in order, processes events, and commits its offsets back to the broker so that a restart knows where to resume. Consumer groups let multiple consumers share the work of reading a topic: each partition is assigned to exactly one consumer in the group, and adding consumers up to the partition count is how a pipeline horizontally scales.
Kafka is a distributed log; it is not a database, and not a message queue in the classical RabbitMQ sense. It does not provide per-message acknowledgements (it provides per-offset commits), does not provide random access to events (it provides sequential reads from an offset), and does not provide complex routing (it provides topics and partitions, and the routing is the consumer's problem). The result is very high throughput, very durable storage, and a simple-but-not-easy programming model. Tools built around Kafka — Streams, ksqlDB, Flink connectors — provide the higher-level patterns; Kafka itself stays minimal.
Until recently, a Kafka cluster required a separate ZooKeeper ensemble for metadata coordination; Kafka 2.8 introduced KRaft, a Raft-based quorum protocol that moved metadata into the brokers themselves, and Kafka 3.x made it the default. Operational simplification is real; new deployments should start on KRaft without looking back.
Kafka itself is a log. The operational value comes from the ecosystem that has grown around it — the connector framework that moves data in and out, the stream-processing library that runs in-process, the schema registry that keeps producers and consumers honest, and the SQL-style query layer that sits over all of it.
Kafka Connect is a framework for moving data between Kafka and external systems — databases, object storage, warehouses, search indexes, SaaS applications — via declarative JSON configuration rather than bespoke code. It runs as its own process (a Connect cluster), hosts source connectors that emit data into Kafka and sink connectors that write data out, and handles offset tracking, at-least-once delivery, and failure recovery generically. Hundreds of community and vendor-supplied connectors exist; the Debezium CDC connectors (section 13 of Chapter 03) are among the most consequential.
Kafka Streams is a Java/Scala library — not a separate cluster — that embeds a stream processor inside an application. You write transformations using a DSL (map, filter, join, windowedBy), and the library handles partition assignment, state store management (RocksDB-backed), and checkpointing using Kafka itself. ksqlDB sits on top and lets you express the same patterns as SQL. Neither competes directly with Flink; both are the right choice when the processing logic belongs with the application that already uses Kafka, rather than in a separate cluster.
An event's payload is bytes; making those bytes self-describing and evolvable is what a schema registry does. Confluent's Schema Registry is the canonical implementation: producers register Avro, Protobuf, or JSON Schema definitions with the registry, publish events carrying a schema ID, and the registry enforces compatibility rules (backward, forward, full) on every evolution. Consumers look up schemas by ID and deserialise safely. A Kafka cluster without a schema registry is a Kafka cluster where producers silently break consumers; the registry is not optional for production platforms.
Apache Kafka is the upstream project; most production deployments run either a managed service (Confluent Cloud, AWS MSK, Aiven, Redpanda Cloud) or an operator-based self-hosted cluster (Strimzi on Kubernetes). Running Kafka on bare VMs without either is possible and increasingly rare; the operational discipline involved — broker rolling, partition balancing, rack-aware replication, retention policy — is enough that most teams adopt the operator or the managed service rather than build it from scratch.
Kafka is the default but not the only choice. Four alternatives show up often enough to matter: Pulsar and Kinesis in the broker slot, Redpanda as a Kafka-protocol-compatible rewrite, and NATS in the lighter-weight messaging niche. Each is a reasonable answer in specific circumstances.
Pulsar, originally built at Yahoo, separates the serving layer (brokers) from the storage layer (Apache BookKeeper), which makes scaling storage independent of scaling throughput. It has first-class multi-tenancy, geo-replication built in rather than bolted on, and native support for both queue-like and log-like consumption. For organisations that need heavy multi-tenant isolation or geo-replication, Pulsar's architecture is a genuinely better fit than Kafka's; for most other cases the larger Kafka ecosystem wins.
Kinesis is Amazon's managed streaming service, with similar partitioned-log semantics to Kafka but a simpler (and narrower) API. Two products worth distinguishing: Kinesis Data Streams, the broker-equivalent, and Kinesis Data Firehose, a no-code pipeline that lands events into S3/Redshift/OpenSearch. On AWS-native stacks Kinesis is often the pragmatic choice; on anything multi-cloud or self-hosted, Kafka is usually preferred because the ecosystem is portable.
Redpanda is a C++ reimplementation of the Kafka protocol that runs as a single binary, without JVM and without ZooKeeper. It uses the same client libraries, the same Schema Registry, and the same Kafka Connect ecosystem; from the application's perspective it is Kafka. What it changes is operations: lower memory, tighter tail latency, simpler deployment. For teams that want Kafka-the-protocol without JVM-the-tax, it is the obvious alternative.
NATS is a different category: a lightweight pub/sub messaging system with a persistent-log add-on (JetStream). Its strength is latency and simplicity — microseconds rather than milliseconds, a binary that runs on a Raspberry Pi. It does not replace Kafka for heavy analytical streams, but for microservice messaging, request/reply, and IoT-scale deployments, it is often a better fit.
All four of these are converging on "Kafka-protocol, log-first, with extensions". The protocol has become a lingua franca even where the implementation differs; most mature applications write producer and consumer code once and can target any of them. The selection now is more about operations, cost, and cloud fit than about fundamentally different semantics.
A broker moves events; a stream processor computes over them. The set of computations stream processors support is smaller than a batch SQL engine's but includes the operations that most real-time applications need: per-event transforms, aggregations over windows, joins between streams, and stateful pattern detection. Understanding the three shapes of computation is how the rest of the chapter becomes legible.
The simplest operations — map, filter, flatMap — are stateless: the output of any single event depends only on that event's contents. Enriching an event with a lookup from a static table, filtering events by type, parsing a raw payload into typed fields: all fall into this category. Stateless operators are easy to scale (each event is independent) and easy to reason about (a failure loses nothing except the in-flight event, which is replayed from the log). The majority of production stream code is stateless; the part that isn't is where the interesting work is.
An aggregation — a running count, a rolling average, a top-K — needs to remember something across events. So does a join between two streams (enrich a clickstream event with the user's current profile) or between a stream and a table. A stream processor with state stores (backed by RocksDB or a distributed key-value store) and a checkpointing mechanism (see section twelve) is what makes these tractable. The state is a first-class concern: it has a size, a memory footprint, a recovery story, and a cost.
Complex event processing (CEP) is the oldest genre of streaming: detect patterns ("three failed logins then a successful one", "price rose more than 5% then reversed") across a stream. Flink's CEP library, Kafka Streams' sessionised patterns, and specialised engines (Esper, Drools) live here. It is a narrower use case than aggregation and join but the right tool when the question is truly "has this sequence of events occurred?".
Before choosing an engine, decide whether the processing is stateful. Stateless computations run happily on Kafka Streams, AWS Lambda, a plain consumer loop — the engine choice hardly matters. Stateful computations at scale are where the serious differences between Flink, Spark Structured Streaming, and Kafka Streams actually show up. The architectural question is not "which stream engine?" but "what state, with what size, with what recovery guarantees?".
Flink is the engine most of the concepts in this chapter are cleanest in. True event-at-a-time processing, rich state management, first-class event-time semantics, and exactly-once guarantees through distributed snapshots — they are all Flink's defaults, not retrofits. For serious stateful streaming at scale, it is the reference against which everything else is measured.
Flink's execution model is true streaming: each event flows through the operator graph as it arrives, with latency measured in single-digit milliseconds under load. Contrast with Spark Structured Streaming's default micro-batch execution, where events are buffered for a few hundred milliseconds and processed as small batches. For most use cases the difference is a factor of ten in tail latency — significant for real-time features, irrelevant for analytical dashboards. Flink's "continuous processing" mode and Spark's "continuous mode" blur the line, but the defaults still define the usual choice.
Flink's state story is the reason it handles large aggregations well. The default state backend (RocksDB, optionally with incremental checkpoints) lets a single task manager hold hundreds of gigabytes of state on local disk with memory-speed access to the hot set. Checkpoints are written asynchronously to durable storage (HDFS, S3, GCS) so that a restart rolls back to a consistent snapshot across all operators. The combination of local state for speed and durable checkpoints for recovery is what makes large stateful streaming tractable.
Flink exposes three main APIs: the low-level DataStream API for event-at-a-time Java/Scala programs; the Table API and Flink SQL for declarative stream-and-batch queries (the two share a planner under the unified Table API); and Stateful Functions for an actor-like model on top of the same runtime. Flink SQL has matured to the point that many production deployments use it as the primary interface, with the DataStream API as a fallback for the 10% of logic that needs it. For teams whose streaming logic is naturally expressible in SQL, Flink SQL plus a Schema Registry is a complete stack.
Flink earns its complexity on stateful streams at scale. For a stateless enrichment pipeline, a few-thousand-events-per-second workload, or an application-embedded processor, Kafka Streams or even a plain consumer loop is usually the right call. Flink's operational footprint — JobManager, TaskManagers, checkpoint storage, Kubernetes operator — is its own project to run well. Managed Flink (Confluent's, Ververica's, AWS Managed Service for Apache Flink) flattens some of that; it does not eliminate it.
Spark Structured Streaming, introduced in Spark 2.x as the successor to the older DStream API, approaches streaming differently from Flink: as the limiting case of batch. A continuously-growing table is queried repeatedly in small batches, and the same DataFrame API works on both bounded and unbounded input. It is the right answer for teams whose platform is already Spark-centric and whose latency tolerance is in seconds rather than milliseconds.
The mental model is: imagine an input table that new rows keep being appended to. A streaming query is a query against that table whose result is continuously updated. Write the query once using the DataFrame or SQL API, tell Spark where to read from (Kafka, files, Delta Lake) and where to write to (Delta, Kafka, a warehouse), and Spark runs the query in micro-batches every few hundred milliseconds. The query author, in principle, does not have to think about batches at all; in practice, tuning the micro-batch interval and the trigger is most of the operational work.
The default execution mode processes events in micro-batches — typically 100 ms to a few seconds — which gives high throughput and moderate latency. A continuous processing mode exists for single-digit-millisecond latency but supports only a subset of operations. For the large majority of Spark Streaming deployments, the micro-batch default is what is actually used; for anything truly latency-sensitive, Flink is more often the right tool.
Where Spark Structured Streaming is unambiguously the right choice is inside a lakehouse. Writing micro-batches into Delta Lake (or Iceberg, or Hudi) gives transactional streaming writes to the same tables that batch jobs and dashboards read; the "unified batch-and-stream" promise of the Dataflow Model becomes concrete. The Databricks-centric "medallion architecture" — bronze/silver/gold Delta tables built by mixed batch and streaming jobs — is built on this pattern, and for Spark-centric platforms it is a clean, well-integrated model.
Flink and Spark Structured Streaming are the two serious choices for general-purpose stream processing, and for most teams the decision falls out of the existing platform. Databricks or a heavy Spark footprint → Spark Structured Streaming. Kubernetes-native streaming team with latency-sensitive work → Flink. New greenfield platform with neither as an incumbent → pick based on whether the workload is closer to "continuous ETL into a lakehouse" (Spark) or "low-latency event-at-a-time processing" (Flink).
The single idea that most batch engineers stumble on when they move to streaming is that there are two clocks. When the event happened is one thing; when the system saw it is another; the lag between them is the subject of most production streaming bugs. The vocabulary for handling it — event time, processing time, watermarks — is the semantic core of the field.
Event time is the timestamp at which the event occurred in the real world — the user clicked, the sensor read, the order was placed. Processing time is when the stream processor actually handled the event. In a well-connected system with no backpressure, the two are close; under network problems, batching at the source, or simple geographic distance, they can differ by seconds, minutes, or (for mobile apps that buffer events offline) days. A correct streaming computation almost always wants event time; a misleading one almost always accidentally uses processing time.
If events can arrive out of order, when is a window "complete"? A watermark is the answer: a monotonically advancing heuristic from the stream processor that says "I have seen all events with event-time up to t; any event arriving now with event-time less than t is late". The watermark trails the latest event time by some margin (configurable) to tolerate lateness; it advances as new events arrive. Window computations wait for the watermark to cross the end of the window, then emit the result. Every streaming system that claims event-time correctness implements some version of this pattern.
What if an event arrives after its window has been emitted? Three options. Drop it — simple, lossy. Update the window's result with the late event (requires either an output sink that supports upserts or a downstream consumer that handles retractions). Side-output it to a dead-letter stream for later reconciliation. The choice is architectural, not automatic; Flink's allowedLateness and Beam's trigger/accumulation model make the policy explicit, which is the right design.
Every aggregation should be over event time, not processing time, unless the use case explicitly wants "what the system has seen so far", which is a much narrower question. The Dataflow Model's vocabulary (event time, watermarks, triggers, accumulation) is now the lingua franca of serious streaming; learning it once pays back across Flink, Spark, Beam, and every future engine in the same tradition.
A window is how an unbounded stream gets chopped into bounded pieces for aggregation. The three canonical shapes — tumbling, sliding, and session — cover most use cases; the choice between them is the difference between an aggregation that matches the business question and one that subtly doesn't.
Tumbling windows are fixed-size, non-overlapping, contiguous. "Hourly count of events", "daily active users", "five-minute throughput" — each is a tumbling window. They are the simplest to reason about and the cheapest to compute; each event belongs to exactly one window, and the state per key is bounded by the window duration. Use them whenever the question genuinely is "how many in this bucket?".
Sliding windows overlap: a five-minute window that advances every minute produces a new result every minute, each covering the last five. They are what you want when the question is "the current rolling average" or "how many events in the last hour, updated every minute". The state cost is higher — each event belongs to multiple windows — and the emission rate is higher, which matters when the downstream is a database or an alerting system.
Session windows are event-driven rather than time-driven: a session opens at an event and extends as long as subsequent events arrive within a gap threshold, closing when the gap is exceeded. "User session on a website" is the canonical case — you do not know up front how long any given user's session will be, and a tumbling window would artificially split it. Sessions are more expensive and more complex, but for user-behaviour analytics they are the correct shape.
Beyond the three standard shapes there is a global window (one window covering the entire stream, used with custom triggers) and various specialised constructions (padding, rolling, punctuated). For most production work, the three above cover the territory; when the question does not fit any of them, the answer is usually a different question, not a bespoke window.
Aggregations need memory. Joins need memory. Pattern detection needs memory. The discipline of handling that memory — how it lives, how big it gets, how it survives a failure — is where most of the engineering weight of a serious streaming system lands.
In a stream processor, state is whatever the operator has to remember between events to produce correct output. Examples: the current count per key for a windowed aggregate; the most recent row per key for a stream-table join; the sequence of recent events for a CEP pattern; the offset position in the input stream. Each stream processor exposes state as first-class (Flink's ValueState / ListState / MapState; Kafka Streams' state stores; Spark's stateful DataFrames), and each provides a local-plus-durable layout: a fast backend (in-memory or RocksDB) for reads and writes, a durable snapshot (S3, HDFS) for recovery.
A checkpoint is a coordinated snapshot of the state of every operator plus the input offsets at a shared barrier, written durably so that recovery can roll back to a globally consistent point. Flink's implementation is based on the Chandy–Lamport distributed-snapshot algorithm (injecting barriers into the data stream that flush state along with them); Spark's is simpler (micro-batch boundaries are natural checkpoint boundaries). Either way, on failure the runtime restores the state, rewinds the input to the recorded offsets, and resumes — a guarantee that no event was lost and (with care) none was processed twice.
State is not free. A stateful operator's memory, disk, and checkpoint-storage costs scale with the number of keys and the depth of history kept per key. State that grows unboundedly is the most common cause of stream-job resource blowups; TTL (time-to-live) settings on state, state compaction, and explicit expiry are the operational answers. A useful rule is that every piece of state should have an explicit retention policy — written down, not implicit — so that a reader of the job can predict its memory footprint a year from now.
Checkpoints are automatic and meant for recovery; savepoints are manual and meant for upgrades — a developer triggers a savepoint, changes the job, and restarts from the savepoint without losing state. Every serious streaming deployment has a savepoint discipline; teams that do not, eventually find themselves unable to deploy a code change without dropping hours of aggregations on the floor.
No streaming conversation is complete without this triad. The three delivery guarantees are defined precisely enough that a production engineer should be able to name which one any given pipeline provides and why. Getting this wrong is the most common source of "the numbers are off by a little and we don't know why" incidents in streaming platforms.
At-most-once delivery means an event is delivered either zero times or one time, never more. It is the easy guarantee: the producer fires and forgets, the consumer commits its offset before processing, and any failure loses events. Use it only where loss is cheaper than duplication — telemetry in aggregate, UI pings where a missing event does not materially affect the dashboard, logging where sampling is already the point.
At-least-once delivery means an event is delivered one or more times. The producer retries on failure, the consumer commits its offset after processing, and duplicates can occur. This is the default for most production stream pipelines, because duplicate handling is usually tractable at the consumer (idempotent writes, deduplication by event ID) and loss is not. If a pipeline is "exactly-once" by reputation but at-least-once by architecture, the deduplication is happening somewhere downstream; know where.
Exactly-once is an end-to-end property and a much stronger claim. Within a closed Kafka/Flink or Kafka/Kafka-Streams system, exactly-once is achievable using transactional writes plus idempotent producers plus coordinated checkpoints: the broker records a produce-and-commit as a single atomic unit, and a failure either rolls back both or commits both. Across heterogeneous systems — Kafka to a database, Kafka to an HTTP API — exactly-once is harder and usually provided via idempotent writes at the sink (a transactional outbox, an idempotency key per event) rather than by the broker. Marketing claims of "exactly-once" almost always mean "exactly-once within the bounded set of systems we ship"; reading the fine print is a useful habit.
Most production streams should be at-least-once with deduplication at the consumer. Exactly-once within Kafka is worth turning on when available and the cost is tolerable; exactly-once across system boundaries is worth engineering for when the downstream cannot handle duplicates (payments, billing, critical counters). Ignoring the question and hoping duplicates do not happen is the most common wrong answer.
Streaming is not only an analytical technique. Used at the application level, the same log primitive restructures how microservices communicate, how state is persisted, and how cross-service correctness is achieved. The pattern has a name — event-driven architecture — and its use of Kafka (or equivalents) as a durable integration bus is one of the most consequential architectural shifts of the last decade.
Three related but distinct ideas. An event-driven service reacts to events rather than being called directly — a downstream service subscribes to "order placed" events and reacts, rather than the upstream calling it. Event sourcing goes further: the service persists its state by writing events to a log, and the current state is a fold over the log; the log, not the database, is the system of record. Event-streamed architectures use a shared durable log (Kafka) as the integration substrate between otherwise independent services. Most production systems adopt one or two of the three; all three together is a stronger commitment.
The hardest problem event-driven systems face is the dual-write: a service commits a change to its database and must also publish an event to Kafka, and if either succeeds without the other the system diverges. The outbox pattern is the canonical answer: the service writes the event to an outbox table in the same transaction as the database change, and a separate CDC or polling process forwards outbox rows to Kafka. The event-publish becomes a consequence of the database commit rather than a separate step, and the consistency problem goes away.
On top of an event-driven substrate, two patterns recur. CQRS (Command Query Responsibility Segregation) separates writes (commands that produce events) from reads (queries against derived views), so each can be optimised independently. Sagas are long-running distributed transactions composed of events: a series of local commits with compensating events on failure, replacing the classical two-phase commit that distributed systems cannot actually afford. Both patterns predate Kafka and are more consequential with it.
Event-driven architectures repay a team that has already mastered synchronous microservices; they are painful for a team that hasn't. Adopting Kafka because "microservices" without understanding eventual consistency, idempotency, and log-based integration usually produces a distributed monolith with extra hops. The order is: get the synchronous version right, then move the integrations that genuinely benefit from the log into the log.
Machine learning historically ran on batch-scale training and batch-scale scoring. Streaming adds three things: features that are fresh to the second, inference that runs as events arrive, and — more rarely — models that update themselves continuously. The first two are table stakes for modern ML systems; the third is still an active research area as much as a production technique.
Many of the best features are very recent: a user's actions in the last thirty seconds, a sensor's readings in the last minute, the running mean of a transaction amount over the last hour. Computing these from a daily batch pipeline throws away most of their value. Streaming feature pipelines — a Flink or Spark Streaming job that reads events from Kafka, aggregates per key, and writes to a feature store (Tecton, Feast, Redis, DynamoDB) — produce features that are fresh when the model needs them. The hard part is not the pipeline; it is keeping the training path and the serving path computing the same features from the same definitions, which is what a feature store exists to enforce.
Scoring at event arrival is either a direct consumer of the event stream — a Flink or Kafka Streams job that joins events with a feature lookup and emits predictions — or a synchronous service called from a stream processor. The architectural choice depends on whether the prediction is consumed by another stream (stay in the stream, avoid synchronous calls) or by a synchronous caller (HTTP request-response pattern, model behind a gateway). The pattern to avoid is calling a synchronous HTTP service from inside a high-throughput stream without backpressure or isolation; it is the most common way a streaming ML system produces its own incident.
True online learning — a model that updates its weights as events arrive — is niche. It works for bandit-style personalisation, for anomaly detection where concept drift is the norm, and for specialised cases like fraud models with very high event rates. For most ML, the production reality is frequent batch retraining — the model is retrained every hour or every day on recent data and redeployed — which captures most of the benefit of online learning without the operational complexity. If someone proposes online learning, the right first question is "would hourly retraining be enough?".
The training-serving skew problem from Chapter 03 becomes sharper in streaming ML. If the batch training pipeline computes a feature from a window of historical events, and the streaming serving pipeline computes it from a window of live events, the two can silently diverge — watermark differences, event-time versus processing-time mismatches, late events handled differently. A rigorous feature store with a single definition shared between batch backfill and stream serving is the only reliable defence; ad-hoc reimplementation of features in two languages is the canonical production bug in this space.
A streaming system that works in development will eventually fail in production on issues a batch system never had: consumer lag creeping, a slow consumer pushing back on the whole graph, events arriving out of order in a way that breaks an assumption. The operational vocabulary — lag, backpressure, rebalance, skew, poison pills — is worth owning before running a stream at production scale.
Consumer lag is the gap, in events or seconds, between the latest offset published to a partition and the latest offset the consumer has committed. A steady lag of "zero or small and stable" is healthy; lag that grows monotonically is the pipeline falling behind the producer and will not recover without intervention. Every Kafka observability stack (Burrow, Cruise Control, the broker's JMX metrics, Confluent Control Center) centres on this metric; an alert on "lag above threshold for N minutes" is the single most important alarm a streaming platform carries.
Backpressure is the mechanism by which a slow downstream operator signals an upstream one to slow down. Flink and Kafka Streams both implement it natively; problems start when backpressure is not respected (a synchronous HTTP call in a Flink map, a database sink without batching). Partition skew — one partition receiving far more data than its siblings because of a bad hash key — is a sibling problem: if one key is hot (one user, one product, one sensor), the partition that key maps to becomes the bottleneck for the whole job. The fix is almost always to change the partitioning key; occasionally a key salting trick is warranted.
When a consumer joins or leaves a group, Kafka rebalances the partitions across the remaining consumers. A rebalance storm — a consumer flapping in and out of the group — stalls the whole group while rebalances complete. Poison pills — events that crash the consumer on parse — halt an entire partition until handled, because the consumer cannot advance past a record it cannot process. Both have well-known operational mitigations (static consumer group membership, dead-letter queues for unparseable messages); both appear, in the same forms, across every stream processor in the field.
Every streaming platform deserves a single dashboard with five numbers: consumer lag per topic, events-per-second in and out per topic, end-to-end latency from producer timestamp to sink, checkpoint duration, and state size. If those five are visible and alarmed, 90% of streaming incidents get caught before the consumer notices. If any of them is not instrumented, that is where the next incident will come from.
Streaming is moving steadily from "a capability some ML teams need" to "a substrate most production ML touches". The trajectory traces the familiar arc of storage and pipelines: initially a specialist concern, eventually a default assumption. Understanding why helps a team decide when to invest.
Five years ago, a recommendation system trained nightly and scored daily was state of the art. Today a recommendation system whose features are not fresh to the last event is visibly behind. The competitive pressure has pushed more of the feature pipeline, more of the inference path, and more of the training data preparation into streaming — not because streaming is intrinsically better, but because the product demands the latency. For teams whose ML products have a latency dimension (personalisation, ranking, fraud, ops-monitoring), streaming has stopped being optional.
The Kafka log that carries ML features also carries application events, CDC from databases, audit events, and sometimes the model predictions themselves. A team that has built the discipline for one — schema registry, contracts, retention, observability — has mostly built it for the others. The payback from investing in streaming infrastructure is highest when multiple teams share the same event substrate rather than each running its own.
Almost everything this chapter argued about streaming — event-time semantics, exactly-once guarantees, state and checkpoints, watermarks, operational vocabulary — is more intricate than its batch equivalent but is genuinely learnable. Teams that treat streaming as a harder, specialised branch of pipelines (rather than as a different field) tend to produce durable systems; teams that treat it as "a queue we subscribe to" tend to produce incidents. The discipline generalises, and the investment compounds.
Streaming extends the pipeline discipline of Chapter 03 with time semantics, state, and delivery guarantees that batch did not have to think about. The tools — Kafka, Flink, Spark Streaming, Kafka Streams — are less important than the primitives: the log, the watermark, the checkpoint, the exactly-once boundary. For modern ML, an increasing share of the platform sits on top of these primitives. The rest of Part III — distributed compute, cloud platforms, governance — builds further upward; the chapters to come assume that bytes can both sit still in storage and flow forever across a log, and that a correct platform knows what to do with both.
Streaming has a small set of canonical essays and papers that reward careful reading, a substantial set of official documentation from the load-bearing projects, and a growing literature on operational practice. The list below picks the references working streaming engineers actually return to, starting with the foundational ideas and ending with the vendor documentation and ops-focused writing.
This page is the fourth chapter of Part III: Data Engineering & Systems. The next — Distributed Computing — steps back from the streaming/batch distinction to the substrate underneath both: Spark and the shuffle model, MapReduce's surviving ideas, the partitioned parallel execution model that every large-data processor is built on. After that come the cloud platforms everything runs on and the governance layer that keeps it all auditable.