Part II · Programming & Software Engineering · Chapter 01

Python, the lingua franca of data work.

Python won not because it is the fastest language — it is not — but because, once you sit inside the scientific Python stack, nearly every useful idea in statistics, machine learning, and data visualization is one import away. This chapter teaches the Python that data people actually write: the idioms that make a script short, the NumPy that makes it fast, and the pandas that makes it shaped like your data.

How to read this chapter

Read the sections in order the first time through; each one depends on earlier ones. The prose explains why each idiom is preferred; the code blocks show the shape in pixels. Paste them into a notebook as you go — the fastest way to remember .loc vs .iloc is to have typed both wrong at least once.

Conventions: Python code appears in monospace; shell commands start with $. We assume Python 3.11+, NumPy 2.x, and pandas 2.x. Where behaviour has changed recently — mostly around PyArrow-backed dtypes and copy-on-write — the text flags it.

Why Python, and why this stackMotivation
The Python you actually useTypes, names, references
Iteration, comprehensions, generatorsPythonic style
Functions, closures, decoratorsFirst-class functions
Classes and data containersdataclass, namedtuple
The standard library toolkititertools, collections, pathlib
Enter NumPy: the array abstractionndarray, dtype, shape
Vectorization and broadcastingWhy NumPy is fast
Indexing: slices, fancy, booleanThe underrated skill
NumPy's numerical toolboxufuncs, reductions, linalg
Enter pandas: Series and DataFrameIndexed tables
Loading and saving dataCSV, Parquet, SQL
Selecting, filtering, mutatingloc, iloc, query
Group-by and aggregationSplit-apply-combine
Joining, pivoting, reshapingmerge, pivot, melt
Time series in pandasresample, rolling, shift
Performance and memorydtypes, chunks, Polars
Where it shows up in MLPayoff

Section 01

Why Python, and why this stack?

Python is slow, dynamically typed, and full of sharp edges. It is also the default language of data science, machine learning, and most scientific computing since about 2012. Understanding why is the first step to using it well.

The language itself is ordinary. Guido van Rossum designed Python in 1989 as a readable, batteries-included successor to ABC; it has always emphasised simple syntax, strong readability, and a small set of orthogonal building blocks. None of that alone would have put it on every data scientist's laptop.

What did the work was the scientific Python stack: a deliberate, decade-long project to build numerical primitives that feel like Python but run like C. NumPy (2006) gave Python a fast multidimensional array. matplotlib (2003) gave it plotting. pandas (2008, Wes McKinney at AQR) gave it a DataFrame. scikit-learn (2010) gave it a unified ML API. Jupyter (née IPython Notebook, 2011) gave it a literate, interactive interface. By the time TensorFlow and PyTorch arrived in the mid-2010s, a generation of researchers already thought in Python, and the new frameworks exposed Python APIs because that is where their users were.

Key idea

Python is not one language — it is a glue language wrapped around fast C, C++, and Fortran libraries. You write readable Python that dispatches to vectorized routines running at native speed. Get that mental model right and the rest of this chapter makes sense.

Everything that follows assumes this division of labour. The Python code in your notebook should stay short, clear, and descriptive; the heavy lifting — arithmetic over millions of elements, group-by aggregations, joins, linear algebra — happens inside compiled kernels you call with one line. When a NumPy or pandas operation feels too slow, the answer is almost never "rewrite it in Cython." It is "find the vectorized call you missed."

Section 02

The Python you actually use

Before idioms, two pieces of Python's semantics decide how every data script you will ever write behaves: everything is an object, and names are just labels bound to those objects.

Dynamic typing, strong typing

Python is dynamically typed: variables do not have types, values do. You never declare int x; you assign x = 3 and Python remembers that the object 3 happens to be an int. But it is also strongly typed: "3" + 3 raises a TypeError rather than silently coercing. This combination — flexible binding, strict operations — is why Python code reads clearly and still fails loudly when you mix apples and oranges.

Names, references, and mutability

A Python variable is a name bound to an object. Assignment does not copy; it rebinds. This matters most with mutable objects — lists, dicts, sets, DataFrames, NumPy arrays:

a = [1, 2, 3]
b = a                 # b refers to the same list
b.append(4)
print(a)              # [1, 2, 3, 4]  — same list, two names

If you want an independent copy, ask for one: b = a.copy(), b = list(a), or, for nested structures, copy.deepcopy(a). The single most common pandas bug in a Kaggle notebook is mutating a slice of a DataFrame without realizing that the slice aliases the original.

Truthiness

Every object has a boolean interpretation. Empty containers ([], {}, "", None, 0, 0.0) are falsy; everything else is truthy. Write if items: instead of if len(items) > 0:. NumPy arrays and pandas Series break this rule — their truthiness is ambiguous by design, and you must use .any() or .all() explicitly.

The zen of simple defaults

Python's defaults are chosen to be boring in the right way. Division is float division (7 / 2 == 3.5); integer division is //. Indexing is zero-based. Slice endpoints are exclusive (a[0:3] yields three elements). Strings are unicode. None of these choices are deep — they are just consistent, and internalizing them removes an entire category of off-by-one bugs.

Rule of thumb

When a Python script surprises you, the cause is almost always one of three things: a name pointing to an object you did not expect, a mutable default argument (def f(x=[]):), or integer vs float division. Check those three before anything else.

Section 03

Iteration, comprehensions, generators

Most of the daylight between a Python beginner and a fluent Pythonista is spent on these three ideas. Once the iteration protocol clicks, a large class of loops collapses into single, readable expressions.

For-loops as iteration over iterables

Python has only one kind of loop that matters — the for loop over an iterable. There is no for (i = 0; i < n; i++). If you need the index, use enumerate:

for i, word in enumerate(words):
    print(i, word)

If you need to walk several iterables in parallel, use zip. If the iterables have different lengths and that matters, use itertools.zip_longest. Writing manual index arithmetic — for i in range(len(xs)): x = xs[i] — is a red flag.

Comprehensions

A list comprehension is a loop that returns a list. A dict or set comprehension is the same, in braces. Comprehensions read top-to-bottom as "the thing I want, for each item, under this condition":

squares    = [x*x for x in range(10)]
lookup     = {word: i for i, word in enumerate(words)}
uniq_words = {word.lower() for word in corpus}

Nest carefully — two levels is the practical maximum before a comprehension becomes a short story. If you are tempted to write three levels, switch to a for-loop or factor out a helper.

Generators and laziness

A generator produces values on demand. Replacing the square brackets of a list comprehension with parentheses gives you a generator expression: (x*x for x in big_iterable). A generator function uses yield instead of return, and the function's body runs one chunk at a time:

def chunks(seq, n):
    buf = []
    for item in seq:
        buf.append(item)
        if len(buf) == n:
            yield buf
            buf = []
    if buf:
        yield buf

Generators are how Python handles streams: log files that do not fit in memory, tokenizers that produce a billion tokens, HTTP responses paged a thousand records at a time. They compose — itertools.chain, itertools.islice, and friends all operate on generators, producing generators, all without materializing the intermediate sequences.

Key idea

Lists hold data; generators produce it. When you are not sure which you need, write the generator — a list is one list(...) call away, and the generator version will not blow up when the data gets bigger.

Section 04

Functions, closures, decorators

In Python, functions are ordinary objects: you can pass them as arguments, return them from other functions, and stuff them into data structures. That one fact underpins most of the library APIs you will meet.

Arguments, defaults, and the * / ** forms

A function signature can declare positional-or-keyword arguments, positional-only arguments (before a bare /), keyword-only arguments (after a bare *), and variadic *args / **kwargs:

def score(model, X, y, /, *, metric="accuracy", weights=None, **extra):
    ...

The one default-argument rule you must internalize: do not use mutable defaults. def f(x=[]) shares the same list across every call, which will eventually bite. Use None as a sentinel and materialize inside the function.

Lambdas and higher-order functions

A lambda is a single-expression anonymous function, useful exactly where a short function is more readable inline than named. Use it for sort keys and pandas column expressions; avoid it where def would be clearer:

rows.sort(key=lambda r: (r["dept"], -r["salary"]))
df["log_price"] = df["price"].apply(lambda p: math.log(p + 1))

Closures

Functions defined inside other functions can reference the enclosing scope. The inner function, plus the variables it has "closed over", is a closure. Closures are how Python-style callbacks, config factories, and parameter sweeps work:

def learning_rate(initial, decay):
    def schedule(step):
        return initial / (1 + decay * step)
    return schedule

lr = learning_rate(initial=1e-3, decay=0.01)
lr(0), lr(100)   # 0.001, ~0.00050

Decorators

A decorator is a function that takes a function and returns a modified one. The @dec syntax is sugar for f = dec(f). You will see decorators all over the data stack — @functools.lru_cache for memoization, @staticmethod / @classmethod for class scoping, @pytest.fixture for test setup, @jit in Numba, @app.route in Flask.

from functools import lru_cache

@lru_cache(maxsize=None)
def fib(n):
    return n if n < 2 else fib(n-1) + fib(n-2)

Write your own only when the wrapping behaviour (timing, logging, retries, validation) is genuinely cross-cutting. Otherwise a plain helper function is clearer.

Section 05

Classes and data containers

Data code uses classes less than web code does. But when you do reach for one, modern Python gives you better tools than the class statement your intro course showed you.

The humble dataclass

Most "classes" in a data pipeline are glorified records — a bundle of fields with no interesting behaviour. @dataclass turns that into one declaration:

from dataclasses import dataclass

@dataclass
class Experiment:
    name: str
    seed: int = 0
    lr: float = 1e-3
    batch_size: int = 64

You get __init__, __repr__, equality, and — with frozen=True — immutability, all for free. Use dataclasses for configs, hyperparameters, feature specs, model cards, and any "plain old data" you would otherwise pass as a dict.

NamedTuple, TypedDict, Pydantic

Adjacent options, each with a niche. typing.NamedTuple is a lightweight immutable record that is also a tuple — convenient when you want positional access for speed. typing.TypedDict is a dict whose keys and value types are declared, useful when you are handed JSON and want structure without classes. pydantic.BaseModel layers runtime validation on top of type hints, and is the standard for API schemas and config files.

When to actually write a class

Write a class when you have state plus behaviour: an object that accumulates results over multiple method calls, or one whose methods share enough state that threading them through function arguments becomes ugly. Training loops, streaming statistics, simulators, and stateful estimators all qualify. Pure transformations — preprocess, tokenize, score — rarely do. Functions are fine.

Style note

Scikit-learn's fit / transform / predict convention is a rare example of an ML interface where classes earn their keep: the fitted state genuinely lives on the object. Copy the pattern when you have fitted state; resist it when you do not.

Section 06

The standard library toolkit

Python's standard library is unusually well-stocked. Four modules in particular pay for themselves in a data workflow, long before you reach for NumPy.

`collections`

Counter counts things (Counter(words).most_common(10)). defaultdict removes the key in d check from grouping code: by_label = defaultdict(list); by_label[y].append(x). deque is a double-ended queue with O(1) appends and pops at both ends — the right structure for rolling windows and BFS. OrderedDict is now historical (plain dicts preserve insertion order since 3.7), but you still see it in older codebases.

`itertools`

The lazy combinator library. chain concatenates iterables. islice slices without materializing. groupby groups consecutive equal items. product, permutations, and combinations do what they say. accumulate gives cumulative sums (or any other fold) over a stream:

from itertools import accumulate, islice
first_1000_cumsum = list(islice(accumulate(stream), 1000))

`pathlib`

os.path.join is a historical artefact. Use pathlib.Path for everything filesystem:

from pathlib import Path
data_dir = Path("data")
csvs = list(data_dir.glob("**/*.csv"))
for p in csvs:
    df = pd.read_csv(p)
    (data_dir / "parquet" / p.with_suffix(".parquet").name).parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(...)

`datetime` and `zoneinfo`

Always use timezone-aware datetimes in production code; naïve datetimes are the source of roughly half of all daylight-saving bugs. datetime.now(tz=ZoneInfo("UTC")) is your friend. For pure calendar logic — "end of next quarter" — pandas offsets tend to be more ergonomic than raw datetime, which we will see in the time-series section.

Honourable mentions: logging (use it instead of print in anything that will run unattended), json, csv, re, functools, statistics, and — for parallelism — concurrent.futures.

Section 07

Enter NumPy: the array abstraction

NumPy is a single idea executed well: a contiguous block of memory holding elements of one type, plus metadata that tells you how to interpret that block as an $n$-dimensional array.

The ndarray

An ndarray has four essential attributes. data is the raw bytes. dtype is the type of each element (float64, int32, bool, …). shape is the tuple of dimension lengths. strides tells NumPy how many bytes to step to advance along each axis. Everything — indexing, slicing, transposition, broadcasting — is just shape and stride manipulation on the same underlying buffer.

import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
a.shape, a.dtype, a.strides   # (2, 3), float32, (12, 4)

Creating arrays

The creation routines fall into four groups. From existing data: np.array, np.asarray, np.fromiter. From shape alone: np.zeros, np.ones, np.empty, np.full. From sequences: np.arange, np.linspace, np.geomspace. From randomness: rng = np.random.default_rng(seed); rng.normal(size=(1000, 10)) — always use a seeded Generator, never the legacy np.random.* module functions.

Axes and shape

In a shape like (batch, channels, height, width), each dimension is an axis. Every reduction, every aggregation, every matrix multiplication in NumPy is parameterized by axis. a.sum(axis=0) collapses the first axis; a.mean(axis=-1) collapses the last. Learning to think in axes is 80% of fluent NumPy.

Key idea

NumPy is fast because a single Python call dispatches to a C loop that visits a contiguous buffer. You lose that as soon as you step back into Python-land — via for loops over array elements, or Python functions applied element-wise. The whole game is staying in arrays.

Section 08

Vectorization and broadcasting

Vectorization is the habit of expressing a computation as whole-array arithmetic. Broadcasting is the rule NumPy uses to make arrays of different shapes cooperate. Together they are why a ten-line Python script can process a gigabyte of data in a second.

Vectorization

The slow way to compute pairwise squared differences between two length-$n$ vectors is a double Python loop. The fast way is

diffs = (x[:, None] - y[None, :]) ** 2        # shape (n, n)

Two array operations, both implemented in C. On a million-element input, the difference is roughly four orders of magnitude.

Broadcasting rules

When NumPy combines two arrays of different shapes, it aligns their shapes from the right and, wherever a dimension is 1 or missing, "stretches" it to match. The classic example:

X = rng.normal(size=(1000, 5))     # 1000 rows, 5 features
mu = X.mean(axis=0)                # shape (5,)
Xc = X - mu                        # shape (1000, 5) — mu broadcasts along axis 0

The stretching is virtual — no memory is actually duplicated. mu's stride along the missing axis is zero, which tells NumPy to reuse the same bytes for every row.

Broadcasting: mu is a length-3 vector; NumPy virtually tiles it along the four rows of X so the shapes match. No copy is made.

When to insert a length-1 axis

Broadcasting only stretches axes of length 1. To make an $n$-vector broadcast down the columns of a matrix, you must first reshape it into $(n, 1)$. The shorthand is v[:, None]. Keep this habit — [:, None] and [None, :] — and half of NumPy's apparent mystery disappears.

Section 09

Indexing: slices, fancy, boolean

NumPy has three flavours of indexing, and they compose. Mastering them is what separates "I can write NumPy" from "I can read NumPy."

Basic slicing

Comma-separated slices select rectangular subarrays. The result is a view — it aliases the original memory:

A = np.arange(24).reshape(4, 6)
A[1:3, 2:5]          # rows 1–2, cols 2–4 — a (2, 3) view
A[::-1]              # all rows reversed
A[:, ::2]            # every other column

Fancy (integer) indexing

Passing an array of integers picks out arbitrary rows or columns. Unlike slicing, fancy indexing returns a copy:

rows = np.array([0, 2, 3])
A[rows]              # the 0th, 2nd, 3rd rows as a new (3, 6) array
A[rows, [1, 4, 0]]   # picks A[0,1], A[2,4], A[3,0] — pairwise, not outer

Boolean indexing

A boolean array of the same shape selects all the True positions. This is the workhorse of filtering:

X = rng.normal(size=(1000, 3))
outliers = np.abs(X).max(axis=1) > 3
X_clean = X[~outliers]

Boolean masks compose with &, |, ~ (always parenthesize: (x > 0) & (x < 10)). Do not use and / or — those try to evaluate the whole array's truthiness and raise.

View vs copy, the source of subtle bugs

Basic slicing gives views; fancy and boolean indexing give copies. The difference matters when you assign:

A[0:2] = 0               # modifies A in place (view)
A[A < 0] = 0             # also modifies A — assignment through a boolean mask
B = A[[0, 2]]            # B is a copy
B[:] = 0                 # does NOT modify A

Rule of thumb

If you are unsure whether an operation gives a view or a copy, call .base on the result: if it is None, you have a fresh array; otherwise the .base attribute points to the array you are viewing into.

Section 10

NumPy's numerical toolbox

Once your data is in an ndarray, NumPy offers three layers of numerics: element-wise universal functions, whole-array reductions, and proper linear algebra.

Universal functions

A ufunc is a function that operates element-wise on arrays — np.exp, np.log, np.sqrt, np.sin, np.abs, np.maximum. They broadcast, they accept an out= buffer to avoid allocation, and they are parallelized in the underlying C implementation. Prefer them over math module functions when you are working with arrays — math.log does not vectorize.

Reductions

sum, mean, std, var, min, max, argmin, argmax, any, all — all accept an axis argument and a keepdims flag. The keepdims=True trick is essential for per-row normalization:

X_std = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)

Cumulative variants — cumsum, cumprod, cummax — return arrays of the same shape, running the aggregation along the chosen axis.

Linear algebra

The @ operator is matrix multiplication. np.linalg hosts everything you learned in the previous chapter — solve, inv, det, eig, svd, qr, cholesky, norm, matrix_rank, lstsq. Prefer solve(A, b) over inv(A) @ b; the former is faster and numerically better-behaved.

theta, *_ = np.linalg.lstsq(X, y, rcond=None)   # ordinary least squares
U, s, Vt = np.linalg.svd(X, full_matrices=False)

Randomness, done properly

Since NumPy 1.17, the right interface is np.random.default_rng(seed), which returns a Generator with dedicated methods: normal, uniform, integers, choice, permutation. It is faster, has better statistical properties, and — crucially — supports independent per-worker streams via spawn(), which matters for reproducible parallel code.

Section 11

Enter pandas: Series and DataFrame

If NumPy is arrays with types, pandas is arrays with labels. A pandas Series is a 1D labelled array; a DataFrame is a 2D table of columns, each of which is a Series.

Series: a labelled array

import pandas as pd
s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="revenue")
s["b"]      # 20
s[["a", "c"]]

Arithmetic between Series aligns on the index, not on position — s1 + s2 matches labels and fills missing alignments with NaN. This is the single most important thing that distinguishes pandas from a NumPy-with-column-names.

DataFrame: the table

A DataFrame has a row index, a column index, and columns that may hold different dtypes. Internally it is a dict of Series. The first five minutes of almost any data exploration look the same:

df.head()                 # first 5 rows
df.info()                 # columns, dtypes, non-null counts
df.describe()             # numerical summary
df.shape, df.columns, df.dtypes

Indexes are not just row numbers

The index is a real, first-class axis. Setting a useful index — a customer id, a timestamp, a (country, year) pair — enables fast label-based lookups and automatic alignment in joins. Resetting it back (df.reset_index()) is equally easy. Do not leave the default integer index on production tables; most of pandas' elegance depends on indexes that mean something.

Columns are Series; rows are not

df["col"] returns a Series. df.iloc[0] also returns a Series, but a row is a weaker abstraction — its dtype is the common parent of the cell dtypes, which is often object. This is why iterating rows with iterrows() is slow and lossy, and why the advice "never loop over a DataFrame" is nearly always right.

Key idea

Think in columns, not rows. Vectorized column operations are fast and type-preserving; row iteration is slow and drops your types. If you catch yourself writing for _, row in df.iterrows():, there is usually a df.apply, groupby, or column expression waiting to replace it.

Section 12

Loading and saving data

Data I/O is unglamorous and almost always where pipelines break. Pandas supports more formats than you will ever use; three of them matter.

CSV, the lingua franca of messy data

pd.read_csv is deceptively rich. Its arguments are the difference between a one-line fix and a three-hour debugging session:

df = pd.read_csv(
    "sales.csv",
    parse_dates=["date"],
    dtype={"customer_id": "string", "amount": "float64"},
    na_values=["", "NA", "NULL", "-"],
    thousands=",",
    encoding="utf-8",
)

Inspect the dtypes after loading. If amount came back as object, something in the file is not a number and pandas is warning you by refusing to coerce.

Parquet, the right default

Parquet is a columnar, compressed, typed format. It is 5–50× smaller than the equivalent CSV, 10–100× faster to read, and preserves dtypes — timestamps stay timestamps, categoricals stay categoricals, nullable ints stay nullable ints. Use Parquet for anything that will be read more than once:

df.to_parquet("sales.parquet")
df = pd.read_parquet("sales.parquet", columns=["date", "amount"])  # column pruning

SQL and databases

pd.read_sql with a SQLAlchemy connection is the simplest bridge:

from sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://user:pw@host/db")
df = pd.read_sql("select * from orders where date >= '2025-01-01'", eng)

For anything larger than a million rows, push work into the database: filter, aggregate, and join in SQL, then pull the result. The network is usually slower than the database.

What about JSON, Excel, HDF5?

read_json handles per-line JSON (lines=True) and deeply-nested documents (less well — pd.json_normalize exists for that). read_excel works and is unavoidable when stakeholders hand you .xlsx files. HDF5 is legacy in most data-science contexts; Parquet has eaten its lunch.

Section 13

Selecting, filtering, mutating

Four accessors do almost all the work: [], .loc, .iloc, and .query. Learn when each is correct and 90% of your pandas code writes itself.

The accessors, in order

df["col"] picks a column. df[["a", "b"]] picks several. df[mask] filters rows with a boolean Series.

df.loc[rows, cols] is label-based: df.loc[df["age"] > 30, ["name", "age"]]. It is the preferred accessor for almost every situation where you are filtering and selecting columns at once.

df.iloc[i, j] is position-based, 0-indexed, and ignores the index entirely. Use it when you know you want the 0th row, not "the row labelled 0."

df.query("age > 30 and dept == 'sales'") is a string-based filter that reads like SQL. Useful for long predicates and for code that is read more often than written.

Method chaining

Pandas is at its best when you write top-to-bottom pipelines rather than building up intermediate variables:

out = (
    df
    .query("country == 'US' and year >= 2020")
    .assign(margin=lambda d: d["revenue"] - d["cost"])
    .groupby("segment", as_index=False)
    .agg(margin=("margin", "mean"), n=("margin", "size"))
    .sort_values("margin", ascending=False)
)

The chain reads as a story: filter, compute, group, aggregate, sort. The assign(column=lambda d: ...) form lets you add a column in-place without breaking the chain.

The SettingWithCopyWarning

Pandas' oldest papercut. When you assign into a slice of a DataFrame, pandas can sometimes tell you are writing to a copy, not the original — but not always. The 2.0 fix is Copy-on-Write, enabled with pd.set_option("mode.copy_on_write", True). It makes mutation safe and slice semantics predictable. Turn it on. In pandas 3.0 it will be the default.

Heuristic

If you have a DataFrame df and you want to both filter rows and assign to a column, write it as two steps through .loc: first select (mask = ...), then assign (df.loc[mask, "col"] = value). Chained access like df[mask]["col"] = value silently does nothing under CoW.

Section 14

Group-by and aggregation

The split-apply-combine pattern is the centrepiece of pandas. Almost every summary statistic, every pivot, every "per customer" metric goes through groupby.

The split-apply-combine flow

Three steps. Split the data into groups by one or more keys. Apply a function to each group. Combine the results back into a DataFrame or Series. groupby handles the splitting and combining; you specify the applying.

Split-apply-combine: rows are partitioned by the key column, an aggregation runs on each group independently, and the results are stacked into a new table.

Named aggregation

The modern syntax is .agg(new_name=("source_col", "func")), which is readable and gives you control over output column names:

summary = (
    df.groupby("dept")
      .agg(n=("id", "size"),
           revenue=("amount", "sum"),
           avg_ticket=("amount", "mean"),
           first_sale=("date", "min"))
      .sort_values("revenue", ascending=False)
)

Transform vs aggregate vs apply

.agg reduces each group to a single value. .transform returns a result the same shape as the input — one value per original row — which is how you compute within-group z-scores or demeaned series:

df["demeaned"] = df["x"] - df.groupby("group")["x"].transform("mean")

.apply is the escape hatch: it accepts any function, but in exchange it is slower and its return shape is inferred at runtime. Reach for it only when agg and transform cannot express what you need.

Section 15

Joining, pivoting, reshaping

Real analysis rarely happens on a single table. You join; you widen; you pivot; you melt. Pandas has one function for each, and a sharp mental model of what shape you are in.

`merge` and `join`

pd.merge is pandas' SQL join. how is "inner", "left", "right", or "outer". on names the key column or columns. The most valuable argument is validate: specify the expected multiplicity ("1:1", "1:m", "m:1", "m:m") and pandas will raise if your assumption is wrong:

joined = pd.merge(
    orders, customers,
    how="left", on="customer_id",
    validate="m:1",
    indicator=True,   # adds a _merge column telling you each row's origin
)

df.join is a convenience for joining on the index — useful when you have already set informative indexes on both sides.

Long vs wide

Data is long (tidy) when each observation is a row and each variable is a column; it is wide when one of those variables has been spread across columns. Most statistical tools prefer long; most human eyes prefer wide. Pandas has four reshaping tools that move between them:

pivot_table(index, columns, values, aggfunc) — long to wide with aggregation on duplicates.
pivot(index, columns, values) — the same, but errors on duplicates instead of aggregating.
melt(id_vars, value_vars) — wide to long, the inverse of pivot.
stack, unstack — pivot via the index, for MultiIndex DataFrames.

wide = df.pivot_table(index="date", columns="product", values="sales", aggfunc="sum")
long = wide.reset_index().melt(id_vars="date", var_name="product", value_name="sales")

Concatenation

pd.concat stacks DataFrames along an axis. Vertically (axis=0, default) is the usual case — appending new rows. Horizontally (axis=1) glues columns side-by-side, aligning on the index. Both respect dtype consistency; a column of int64 concatenated with a column of float64 will widen to float64.

Section 16

Time series in pandas

The time-series machinery in pandas is uncommonly good — a DatetimeIndex unlocks resampling, rolling windows, shifting, and timezone-aware arithmetic that would take weeks to build by hand.

Datetime indexes

Parse your date columns at load time (parse_dates=). Set them as the index. From that point on, pandas treats dates as first-class:

df = df.set_index("timestamp").sort_index()
df["2025-03"]                                  # all rows in March 2025
df["2025-03-15":"2025-03-22"]                  # an inclusive date range
df.loc[df.index.dayofweek < 5]                 # weekdays only

Resampling

Resampling is the time-domain equivalent of groupby. df.resample("1D") buckets rows by day; the result is a grouper you can aggregate:

daily = df.resample("1D").agg(volume=("qty", "sum"), price=("price", "mean"))
monthly = daily.resample("1ME").sum()

The frequency string is a small DSL: "1H" hourly, "15min" fifteen-minute, "1W" weekly (ending Sunday), "1ME" month-end, "1QS" quarter-start, "1Y" yearly. Offsets compose: "2H30min".

Rolling and expanding windows

df["ma"] = df["price"].rolling(7).mean() computes a seven-period rolling mean. rolling accepts windows by count or by time (rolling("30D"), only on a DatetimeIndex), and by standard aggregations or custom functions. expanding is the cumulative version — the window grows from the start.

Shifts and lags

Feature engineering for supervised time series is almost entirely shift:

df["y_t-1"]  = df["y"].shift(1)
df["y_t-7"]  = df["y"].shift(7)
df["return"] = df["price"].pct_change()

Combine shift with groupby when you have panel data — one group per entity — so lags never cross entity boundaries: df.groupby("id")["y"].shift(1).

Timezone rule

If a dataset spans timezones, localize at the edge. Convert incoming strings to UTC on ingest (tz_convert("UTC")) and convert back to a user-facing timezone only at presentation. Keeping the middle tier in UTC eliminates a whole class of daylight-saving bugs.

Section 17

Performance and memory

Ninety percent of pandas performance tuning is three things: picking the right dtype, avoiding Python loops, and not loading more data than you need.

Dtypes are free speed

A column of three million integers stored as object (Python ints) uses 60× more memory and runs 40× slower than the same column as int32. After loading data, always check df.dtypes and downcast where safe:

df["user_id"]  = df["user_id"].astype("int32")
df["category"] = df["category"].astype("category")   # dict-encoded strings
df["flag"]     = df["flag"].astype("boolean")        # nullable bool

Categoricals are the single biggest win for datasets with low-cardinality string columns — countries, segments, product ids. They cut memory dramatically and speed up groupbys and joins.

PyArrow-backed strings and nullable types

Since pandas 2.0, setting pd.options.future.infer_string = True makes string columns use Arrow-backed storage, which is both faster and properly handles missing values. Expect this to become the default; adopt it now on new projects.

Chunking

For files too large for RAM, read_csv and read_parquet both support chunksize= / iterators. Process a chunk, aggregate, discard, repeat:

total = 0
for chunk in pd.read_csv("huge.csv", chunksize=1_000_000):
    total += chunk["amount"].sum()

When to reach past pandas

Pandas is excellent up to roughly 10–50 GB on a single machine. Past that, three serious alternatives: Polars (Rust, multi-threaded, Arrow-native — the fastest single-machine option), DuckDB (embedded analytical SQL engine, reads Parquet in place), and Dask (pandas-on-many-machines, best when your logic is already pandas). For real streaming, Apache Arrow plus ADBC is the direction everything is heading.

Key idea

Before you reach for a bigger tool, profile. %timeit in Jupyter, cProfile for scripts, memory_profiler for allocations. A pandas script that feels slow is, nine times out of ten, one bad apply away from being ten times faster.

Section 18

Where it shows up in ML

Every machine-learning project in Python is built on top of this chapter. Here is the map from the ideas above to the pipelines you will meet everywhere else in the compendium.

Feature engineering. groupby + transform compute within-group statistics (user-level averages, per-session counts). shift creates lags. merge joins your labels to your features.
Train / validation splits. Boolean masks on a timestamp column (for temporal splits) or on a hash of a user id (for deterministic splits) — both one-liners.
Data loaders. PyTorch Datasets wrap a pandas DataFrame or a NumPy array. The __getitem__ method returns tensors built from df.iloc[i], converted via .to_numpy() and torch.from_numpy (the latter is a zero-copy view).
Tokenization. Hugging Face tokenizers accept a list of strings, typically df["text"].tolist(), and return a dict of tensors or lists.
Vectorized losses. The loss functions you implement yourself — focal loss, contrastive loss, custom pairwise margins — are almost always a few NumPy or PyTorch broadcasting steps.
Evaluation. sklearn.metrics operates on NumPy arrays. pandas.crosstab builds confusion matrices; pivot_table builds error heatmaps by cohort.
Experiment bookkeeping. Each run becomes a row of a DataFrame — hyperparameters, metrics, runtime, git hash. pd.read_parquet("runs/") glob-reads every run into one table; groupby compares configs.
Serving and batching. At inference time, requests are stacked into a DataFrame, features computed vectorized, predictions scored in a single model.predict(X) call.
Reproducibility. numpy.random.default_rng(seed) + torch.manual_seed(seed) + setting PYTHONHASHSEED before anything imports is the minimum bar. The seeds belong on a dataclass with the rest of your config.

Almost every ML paper's "data pipeline" section, if you strip the jargon, is the contents of this chapter. The rest of the compendium assumes you can move fluently between a NumPy array, a pandas DataFrame, and a torch Tensor — because every later chapter will.

Where to go next

The Python data stack is best learned by steeping. Keep a book open while you work, then go back to it once a month. Here is a short list — heavy on the two or three texts every practitioner eventually owns.

Books

Python for Data Analysis

Wes McKinney · 3rd ed. 2022 · free online

Written by the creator of pandas, and the single best book on the library. The third edition covers pandas 1.x / 2.x idioms, the PyArrow-backed types, and copy-on-write semantics. If you read one book on this chapter, read this one.

wesmckinney.com/book
Python Data Science Handbook

Jake VanderPlas · 2nd ed. 2023 · free online

Built from Jupyter notebooks and meant to be run. Thoroughly covers NumPy, pandas, matplotlib, and scikit-learn with the sensibility of a working scientist. The best free companion to McKinney's book.

jakevdp.github.io
Fluent Python

Luciano Ramalho · 2nd ed. 2022

The book to read after you know Python works. Descriptors, data model methods, type hints, concurrency, iterators, and decorators — explained with the precision of someone who has been teaching the language for twenty years. Pairs with section 02–05 of this chapter.

fluentpython.com
High Performance Python

Gorelick & Ozsvald · 2nd ed. 2020

Profiling, Cython, Numba, multiprocessing, GPU basics — every technique you reach for once vectorization alone is not enough. Written to be read alongside your profiler. Directly extends section 17 of this chapter.

O'Reilly
Effective Pandas 2

Matt Harrison · 2024

A small, opinionated book entirely about writing idiomatic pandas: method chaining, assign, pipe, and avoiding the classes of bug this chapter only sketches. Worth reading on a plane.

metasnake.com

Official documentation

The NumPy user guide

numpy.org

Read "Quickstart" and "Broadcasting" in full; skim "Advanced indexing" and "Structured arrays" for reference. The NumPy docs are among the best-written in open-source scientific computing.

NumPy docs
The pandas user guide

pandas.pydata.org

Long, but structured to be read end-to-end. The "10 minutes to pandas," "Indexing and selecting data," "Group by," and "Time series" sections together are the reference every working data scientist returns to.

pandas docs
Python standard library

docs.python.org

Skim itertools, collections, functools, dataclasses, pathlib, and datetime once a year. Almost every "I wrote a helper for this" turns out to be a one-liner built on one of these modules.

Library index

Tutorials and talks

Real Python — pandas, NumPy, and intermediate Python

realpython.com

A reliably good first search hit for any specific question: "pandas merge," "list comprehension," "asyncio." Worth subscribing to.

realpython.com
Kaggle Learn — pandas micro-course

kaggle.com/learn

A short, interactive pandas tour — five notebooks, each ending in an exercise. Good for turning "I read about it" into "I've done it."

Kaggle Learn
Raymond Hettinger — Transforming Code into Beautiful, Idiomatic Python

PyCon 2013 · YouTube

An hour of "don't write it like that, write it like this." Older, but the lessons on iteration, dictionaries, and context managers are timeless — and this talk is why "Pythonic" has the meaning it does.

YouTube

The next generation

Polars user guide

pola.rs

The Rust-backed DataFrame library that is quickly becoming the default for datasets too large for pandas. Lazy, columnar, and multi-threaded by design. The guide is short and the API is worth learning early.

docs.pola.rs
DuckDB in Python

duckdb.org

An embeddable analytical SQL engine that reads Parquet and pandas DataFrames directly. The modern answer to "I wish I could write SQL against this DataFrame" — because you can, in one import.

Python API
Apache Arrow & PyArrow

arrow.apache.org

The columnar format that backs the modern data stack — Parquet, Polars, pandas-2 string dtypes, DuckDB. Read the "Arrow Columnar Format" spec once and a lot of otherwise-arbitrary design choices start to make sense.

PyArrow docs

This page is the first chapter of Part II: Programming & Software Engineering. Up next: software-engineering fundamentals, the modern Python tooling stack, writing production-grade ML code, data engineering pipelines, and deployment. The rest of Part II assumes you can read and write the code in this chapter without thinking.

How to read this chapter

Contents

Why Python, and why this stack?

The Python you actually use

Dynamic typing, strong typing

Names, references, and mutability

Truthiness

The zen of simple defaults

Iteration, comprehensions, generators

For-loops as iteration over iterables

Comprehensions

Generators and laziness

Functions, closures, decorators

Arguments, defaults, and the * / ** forms

Lambdas and higher-order functions

Closures

Decorators

Classes and data containers

The humble dataclass

NamedTuple, TypedDict, Pydantic

When to actually write a class

The standard library toolkit

collections

itertools

pathlib

datetime and zoneinfo

Enter NumPy: the array abstraction

The ndarray

Creating arrays

Axes and shape

Vectorization and broadcasting

Vectorization

Broadcasting rules

When to insert a length-1 axis

Indexing: slices, fancy, boolean

Basic slicing

Fancy (integer) indexing

Boolean indexing

View vs copy, the source of subtle bugs

NumPy's numerical toolbox

Universal functions

Reductions

Linear algebra

Randomness, done properly

Enter pandas: Series and DataFrame

Series: a labelled array

DataFrame: the table

Indexes are not just row numbers

Columns are Series; rows are not

Loading and saving data

CSV, the lingua franca of messy data

Parquet, the right default

SQL and databases

What about JSON, Excel, HDF5?

Selecting, filtering, mutating

The accessors, in order

Method chaining

The SettingWithCopyWarning

Group-by and aggregation

The split-apply-combine flow

Named aggregation

Transform vs aggregate vs apply

Joining, pivoting, reshaping

merge and join

Long vs wide

Concatenation

Time series in pandas

Datetime indexes

Resampling

Rolling and expanding windows

Shifts and lags

Performance and memory

Dtypes are free speed

PyArrow-backed strings and nullable types

Chunking

When to reach past pandas

Where it shows up in ML

Where to go next

Books

Official documentation

`collections`

`itertools`

`pathlib`

`datetime` and `zoneinfo`

`merge` and `join`