Python won not because it is the fastest language — it is not — but because, once you sit inside the scientific Python stack, nearly every useful idea in statistics, machine learning, and data visualization is one import away. This chapter teaches the Python that data people actually write: the idioms that make a script short, the NumPy that makes it fast, and the pandas that makes it shaped like your data.
Read the sections in order the first time through; each one depends on earlier ones. The prose explains why each idiom is preferred; the code blocks show the shape in pixels. Paste them into a notebook as you go — the fastest way to remember .loc vs .iloc is to have typed both wrong at least once.
Conventions: Python code appears in monospace; shell commands start with $. We assume Python 3.11+, NumPy 2.x, and pandas 2.x. Where behaviour has changed recently — mostly around PyArrow-backed dtypes and copy-on-write — the text flags it.
Python is slow, dynamically typed, and full of sharp edges. It is also the default language of data science, machine learning, and most scientific computing since about 2012. Understanding why is the first step to using it well.
The language itself is ordinary. Guido van Rossum designed Python in 1989 as a readable, batteries-included successor to ABC; it has always emphasised simple syntax, strong readability, and a small set of orthogonal building blocks. None of that alone would have put it on every data scientist's laptop.
What did the work was the scientific Python stack: a deliberate, decade-long project to build numerical primitives that feel like Python but run like C. NumPy (2006) gave Python a fast multidimensional array. matplotlib (2003) gave it plotting. pandas (2008, Wes McKinney at AQR) gave it a DataFrame. scikit-learn (2010) gave it a unified ML API. Jupyter (née IPython Notebook, 2011) gave it a literate, interactive interface. By the time TensorFlow and PyTorch arrived in the mid-2010s, a generation of researchers already thought in Python, and the new frameworks exposed Python APIs because that is where their users were.
Python is not one language — it is a glue language wrapped around fast C, C++, and Fortran libraries. You write readable Python that dispatches to vectorized routines running at native speed. Get that mental model right and the rest of this chapter makes sense.
Everything that follows assumes this division of labour. The Python code in your notebook should stay short, clear, and descriptive; the heavy lifting — arithmetic over millions of elements, group-by aggregations, joins, linear algebra — happens inside compiled kernels you call with one line. When a NumPy or pandas operation feels too slow, the answer is almost never "rewrite it in Cython." It is "find the vectorized call you missed."
Before idioms, two pieces of Python's semantics decide how every data script you will ever write behaves: everything is an object, and names are just labels bound to those objects.
Python is dynamically typed: variables do not have types, values do. You never declare int x; you assign x = 3 and Python remembers that the object 3 happens to be an int. But it is also strongly typed: "3" + 3 raises a TypeError rather than silently coercing. This combination — flexible binding, strict operations — is why Python code reads clearly and still fails loudly when you mix apples and oranges.
A Python variable is a name bound to an object. Assignment does not copy; it rebinds. This matters most with mutable objects — lists, dicts, sets, DataFrames, NumPy arrays:
a = [1, 2, 3]
b = a # b refers to the same list
b.append(4)
print(a) # [1, 2, 3, 4] — same list, two names
If you want an independent copy, ask for one: b = a.copy(), b = list(a), or, for nested structures, copy.deepcopy(a). The single most common pandas bug in a Kaggle notebook is mutating a slice of a DataFrame without realizing that the slice aliases the original.
Every object has a boolean interpretation. Empty containers ([], {}, "", None, 0, 0.0) are falsy; everything else is truthy. Write if items: instead of if len(items) > 0:. NumPy arrays and pandas Series break this rule — their truthiness is ambiguous by design, and you must use .any() or .all() explicitly.
Python's defaults are chosen to be boring in the right way. Division is float division (7 / 2 == 3.5); integer division is //. Indexing is zero-based. Slice endpoints are exclusive (a[0:3] yields three elements). Strings are unicode. None of these choices are deep — they are just consistent, and internalizing them removes an entire category of off-by-one bugs.
When a Python script surprises you, the cause is almost always one of three things: a name pointing to an object you did not expect, a mutable default argument (def f(x=[]):), or integer vs float division. Check those three before anything else.
Most of the daylight between a Python beginner and a fluent Pythonista is spent on these three ideas. Once the iteration protocol clicks, a large class of loops collapses into single, readable expressions.
Python has only one kind of loop that matters — the for loop over an iterable. There is no for (i = 0; i < n; i++). If you need the index, use enumerate:
for i, word in enumerate(words):
print(i, word)
If you need to walk several iterables in parallel, use zip. If the iterables have different lengths and that matters, use itertools.zip_longest. Writing manual index arithmetic — for i in range(len(xs)): x = xs[i] — is a red flag.
A list comprehension is a loop that returns a list. A dict or set comprehension is the same, in braces. Comprehensions read top-to-bottom as "the thing I want, for each item, under this condition":
squares = [x*x for x in range(10)]
lookup = {word: i for i, word in enumerate(words)}
uniq_words = {word.lower() for word in corpus}
Nest carefully — two levels is the practical maximum before a comprehension becomes a short story. If you are tempted to write three levels, switch to a for-loop or factor out a helper.
A generator produces values on demand. Replacing the square brackets of a list comprehension with parentheses gives you a generator expression: (x*x for x in big_iterable). A generator function uses yield instead of return, and the function's body runs one chunk at a time:
def chunks(seq, n):
buf = []
for item in seq:
buf.append(item)
if len(buf) == n:
yield buf
buf = []
if buf:
yield buf
Generators are how Python handles streams: log files that do not fit in memory, tokenizers that produce a billion tokens, HTTP responses paged a thousand records at a time. They compose — itertools.chain, itertools.islice, and friends all operate on generators, producing generators, all without materializing the intermediate sequences.
Lists hold data; generators produce it. When you are not sure which you need, write the generator — a list is one list(...) call away, and the generator version will not blow up when the data gets bigger.
In Python, functions are ordinary objects: you can pass them as arguments, return them from other functions, and stuff them into data structures. That one fact underpins most of the library APIs you will meet.
A function signature can declare positional-or-keyword arguments, positional-only arguments (before a bare /), keyword-only arguments (after a bare *), and variadic *args / **kwargs:
def score(model, X, y, /, *, metric="accuracy", weights=None, **extra):
...
The one default-argument rule you must internalize: do not use mutable defaults. def f(x=[]) shares the same list across every call, which will eventually bite. Use None as a sentinel and materialize inside the function.
A lambda is a single-expression anonymous function, useful exactly where a short function is more readable inline than named. Use it for sort keys and pandas column expressions; avoid it where def would be clearer:
rows.sort(key=lambda r: (r["dept"], -r["salary"]))
df["log_price"] = df["price"].apply(lambda p: math.log(p + 1))
Functions defined inside other functions can reference the enclosing scope. The inner function, plus the variables it has "closed over", is a closure. Closures are how Python-style callbacks, config factories, and parameter sweeps work:
def learning_rate(initial, decay):
def schedule(step):
return initial / (1 + decay * step)
return schedule
lr = learning_rate(initial=1e-3, decay=0.01)
lr(0), lr(100) # 0.001, ~0.00050
A decorator is a function that takes a function and returns a modified one. The @dec syntax is sugar for f = dec(f). You will see decorators all over the data stack — @functools.lru_cache for memoization, @staticmethod / @classmethod for class scoping, @pytest.fixture for test setup, @jit in Numba, @app.route in Flask.
from functools import lru_cache
@lru_cache(maxsize=None)
def fib(n):
return n if n < 2 else fib(n-1) + fib(n-2)
Write your own only when the wrapping behaviour (timing, logging, retries, validation) is genuinely cross-cutting. Otherwise a plain helper function is clearer.
Data code uses classes less than web code does. But when you do reach for one, modern Python gives you better tools than the class statement your intro course showed you.
Most "classes" in a data pipeline are glorified records — a bundle of fields with no interesting behaviour. @dataclass turns that into one declaration:
from dataclasses import dataclass
@dataclass
class Experiment:
name: str
seed: int = 0
lr: float = 1e-3
batch_size: int = 64
You get __init__, __repr__, equality, and — with frozen=True — immutability, all for free. Use dataclasses for configs, hyperparameters, feature specs, model cards, and any "plain old data" you would otherwise pass as a dict.
Adjacent options, each with a niche. typing.NamedTuple is a lightweight immutable record that is also a tuple — convenient when you want positional access for speed. typing.TypedDict is a dict whose keys and value types are declared, useful when you are handed JSON and want structure without classes. pydantic.BaseModel layers runtime validation on top of type hints, and is the standard for API schemas and config files.
Write a class when you have state plus behaviour: an object that accumulates results over multiple method calls, or one whose methods share enough state that threading them through function arguments becomes ugly. Training loops, streaming statistics, simulators, and stateful estimators all qualify. Pure transformations — preprocess, tokenize, score — rarely do. Functions are fine.
Scikit-learn's fit / transform / predict convention is a rare example of an ML interface where classes earn their keep: the fitted state genuinely lives on the object. Copy the pattern when you have fitted state; resist it when you do not.
Python's standard library is unusually well-stocked. Four modules in particular pay for themselves in a data workflow, long before you reach for NumPy.
collections
Counter counts things (Counter(words).most_common(10)). defaultdict removes the key in d check from grouping code: by_label = defaultdict(list); by_label[y].append(x). deque is a double-ended queue with O(1) appends and pops at both ends — the right structure for rolling windows and BFS. OrderedDict is now historical (plain dicts preserve insertion order since 3.7), but you still see it in older codebases.
itertools
The lazy combinator library. chain concatenates iterables. islice slices without materializing. groupby groups consecutive equal items. product, permutations, and combinations do what they say. accumulate gives cumulative sums (or any other fold) over a stream:
from itertools import accumulate, islice
first_1000_cumsum = list(islice(accumulate(stream), 1000))
pathlib
os.path.join is a historical artefact. Use pathlib.Path for everything filesystem:
from pathlib import Path
data_dir = Path("data")
csvs = list(data_dir.glob("**/*.csv"))
for p in csvs:
df = pd.read_csv(p)
(data_dir / "parquet" / p.with_suffix(".parquet").name).parent.mkdir(parents=True, exist_ok=True)
df.to_parquet(...)
datetime and zoneinfo
Always use timezone-aware datetimes in production code; naïve datetimes are the source of roughly half of all daylight-saving bugs. datetime.now(tz=ZoneInfo("UTC")) is your friend. For pure calendar logic — "end of next quarter" — pandas offsets tend to be more ergonomic than raw datetime, which we will see in the time-series section.
Honourable mentions: logging (use it instead of print in anything that will run unattended), json, csv, re, functools, statistics, and — for parallelism — concurrent.futures.
NumPy is a single idea executed well: a contiguous block of memory holding elements of one type, plus metadata that tells you how to interpret that block as an $n$-dimensional array.
An ndarray has four essential attributes. data is the raw bytes. dtype is the type of each element (float64, int32, bool, …). shape is the tuple of dimension lengths. strides tells NumPy how many bytes to step to advance along each axis. Everything — indexing, slicing, transposition, broadcasting — is just shape and stride manipulation on the same underlying buffer.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
a.shape, a.dtype, a.strides # (2, 3), float32, (12, 4)
The creation routines fall into four groups. From existing data: np.array, np.asarray, np.fromiter. From shape alone: np.zeros, np.ones, np.empty, np.full. From sequences: np.arange, np.linspace, np.geomspace. From randomness: rng = np.random.default_rng(seed); rng.normal(size=(1000, 10)) — always use a seeded Generator, never the legacy np.random.* module functions.
In a shape like (batch, channels, height, width), each dimension is an axis. Every reduction, every aggregation, every matrix multiplication in NumPy is parameterized by axis. a.sum(axis=0) collapses the first axis; a.mean(axis=-1) collapses the last. Learning to think in axes is 80% of fluent NumPy.
NumPy is fast because a single Python call dispatches to a C loop that visits a contiguous buffer. You lose that as soon as you step back into Python-land — via for loops over array elements, or Python functions applied element-wise. The whole game is staying in arrays.
Vectorization is the habit of expressing a computation as whole-array arithmetic. Broadcasting is the rule NumPy uses to make arrays of different shapes cooperate. Together they are why a ten-line Python script can process a gigabyte of data in a second.
The slow way to compute pairwise squared differences between two length-$n$ vectors is a double Python loop. The fast way is
diffs = (x[:, None] - y[None, :]) ** 2 # shape (n, n)
Two array operations, both implemented in C. On a million-element input, the difference is roughly four orders of magnitude.
When NumPy combines two arrays of different shapes, it aligns their shapes from the right and, wherever a dimension is 1 or missing, "stretches" it to match. The classic example:
X = rng.normal(size=(1000, 5)) # 1000 rows, 5 features
mu = X.mean(axis=0) # shape (5,)
Xc = X - mu # shape (1000, 5) — mu broadcasts along axis 0
The stretching is virtual — no memory is actually duplicated. mu's stride along the missing axis is zero, which tells NumPy to reuse the same bytes for every row.
mu is a length-3 vector; NumPy virtually tiles it along the four rows of X so the shapes match. No copy is made.
Broadcasting only stretches axes of length 1. To make an $n$-vector broadcast down the columns of a matrix, you must first reshape it into $(n, 1)$. The shorthand is v[:, None]. Keep this habit — [:, None] and [None, :] — and half of NumPy's apparent mystery disappears.
NumPy has three flavours of indexing, and they compose. Mastering them is what separates "I can write NumPy" from "I can read NumPy."
Comma-separated slices select rectangular subarrays. The result is a view — it aliases the original memory:
A = np.arange(24).reshape(4, 6)
A[1:3, 2:5] # rows 1–2, cols 2–4 — a (2, 3) view
A[::-1] # all rows reversed
A[:, ::2] # every other column
Passing an array of integers picks out arbitrary rows or columns. Unlike slicing, fancy indexing returns a copy:
rows = np.array([0, 2, 3])
A[rows] # the 0th, 2nd, 3rd rows as a new (3, 6) array
A[rows, [1, 4, 0]] # picks A[0,1], A[2,4], A[3,0] — pairwise, not outer
A boolean array of the same shape selects all the True positions. This is the workhorse of filtering:
X = rng.normal(size=(1000, 3))
outliers = np.abs(X).max(axis=1) > 3
X_clean = X[~outliers]
Boolean masks compose with &, |, ~ (always parenthesize: (x > 0) & (x < 10)). Do not use and / or — those try to evaluate the whole array's truthiness and raise.
Basic slicing gives views; fancy and boolean indexing give copies. The difference matters when you assign:
A[0:2] = 0 # modifies A in place (view)
A[A < 0] = 0 # also modifies A — assignment through a boolean mask
B = A[[0, 2]] # B is a copy
B[:] = 0 # does NOT modify A
If you are unsure whether an operation gives a view or a copy, call .base on the result: if it is None, you have a fresh array; otherwise the .base attribute points to the array you are viewing into.
Once your data is in an ndarray, NumPy offers three layers of numerics: element-wise universal functions, whole-array reductions, and proper linear algebra.
A ufunc is a function that operates element-wise on arrays — np.exp, np.log, np.sqrt, np.sin, np.abs, np.maximum. They broadcast, they accept an out= buffer to avoid allocation, and they are parallelized in the underlying C implementation. Prefer them over math module functions when you are working with arrays — math.log does not vectorize.
sum, mean, std, var, min, max, argmin, argmax, any, all — all accept an axis argument and a keepdims flag. The keepdims=True trick is essential for per-row normalization:
X_std = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)
Cumulative variants — cumsum, cumprod, cummax — return arrays of the same shape, running the aggregation along the chosen axis.
The @ operator is matrix multiplication. np.linalg hosts everything you learned in the previous chapter — solve, inv, det, eig, svd, qr, cholesky, norm, matrix_rank, lstsq. Prefer solve(A, b) over inv(A) @ b; the former is faster and numerically better-behaved.
theta, *_ = np.linalg.lstsq(X, y, rcond=None) # ordinary least squares
U, s, Vt = np.linalg.svd(X, full_matrices=False)
Since NumPy 1.17, the right interface is np.random.default_rng(seed), which returns a Generator with dedicated methods: normal, uniform, integers, choice, permutation. It is faster, has better statistical properties, and — crucially — supports independent per-worker streams via spawn(), which matters for reproducible parallel code.
If NumPy is arrays with types, pandas is arrays with labels. A pandas Series is a 1D labelled array; a DataFrame is a 2D table of columns, each of which is a Series.
import pandas as pd
s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="revenue")
s["b"] # 20
s[["a", "c"]]
Arithmetic between Series aligns on the index, not on position — s1 + s2 matches labels and fills missing alignments with NaN. This is the single most important thing that distinguishes pandas from a NumPy-with-column-names.
A DataFrame has a row index, a column index, and columns that may hold different dtypes. Internally it is a dict of Series. The first five minutes of almost any data exploration look the same:
df.head() # first 5 rows
df.info() # columns, dtypes, non-null counts
df.describe() # numerical summary
df.shape, df.columns, df.dtypes
The index is a real, first-class axis. Setting a useful index — a customer id, a timestamp, a (country, year) pair — enables fast label-based lookups and automatic alignment in joins. Resetting it back (df.reset_index()) is equally easy. Do not leave the default integer index on production tables; most of pandas' elegance depends on indexes that mean something.
df["col"] returns a Series. df.iloc[0] also returns a Series, but a row is a weaker abstraction — its dtype is the common parent of the cell dtypes, which is often object. This is why iterating rows with iterrows() is slow and lossy, and why the advice "never loop over a DataFrame" is nearly always right.
Think in columns, not rows. Vectorized column operations are fast and type-preserving; row iteration is slow and drops your types. If you catch yourself writing for _, row in df.iterrows():, there is usually a df.apply, groupby, or column expression waiting to replace it.
Data I/O is unglamorous and almost always where pipelines break. Pandas supports more formats than you will ever use; three of them matter.
pd.read_csv is deceptively rich. Its arguments are the difference between a one-line fix and a three-hour debugging session:
df = pd.read_csv(
"sales.csv",
parse_dates=["date"],
dtype={"customer_id": "string", "amount": "float64"},
na_values=["", "NA", "NULL", "-"],
thousands=",",
encoding="utf-8",
)
Inspect the dtypes after loading. If amount came back as object, something in the file is not a number and pandas is warning you by refusing to coerce.
Parquet is a columnar, compressed, typed format. It is 5–50× smaller than the equivalent CSV, 10–100× faster to read, and preserves dtypes — timestamps stay timestamps, categoricals stay categoricals, nullable ints stay nullable ints. Use Parquet for anything that will be read more than once:
df.to_parquet("sales.parquet")
df = pd.read_parquet("sales.parquet", columns=["date", "amount"]) # column pruning
pd.read_sql with a SQLAlchemy connection is the simplest bridge:
from sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://user:pw@host/db")
df = pd.read_sql("select * from orders where date >= '2025-01-01'", eng)
For anything larger than a million rows, push work into the database: filter, aggregate, and join in SQL, then pull the result. The network is usually slower than the database.
read_json handles per-line JSON (lines=True) and deeply-nested documents (less well — pd.json_normalize exists for that). read_excel works and is unavoidable when stakeholders hand you .xlsx files. HDF5 is legacy in most data-science contexts; Parquet has eaten its lunch.
Four accessors do almost all the work: [], .loc, .iloc, and .query. Learn when each is correct and 90% of your pandas code writes itself.
df["col"] picks a column. df[["a", "b"]] picks several. df[mask] filters rows with a boolean Series.
df.loc[rows, cols] is label-based: df.loc[df["age"] > 30, ["name", "age"]]. It is the preferred accessor for almost every situation where you are filtering and selecting columns at once.
df.iloc[i, j] is position-based, 0-indexed, and ignores the index entirely. Use it when you know you want the 0th row, not "the row labelled 0."
df.query("age > 30 and dept == 'sales'") is a string-based filter that reads like SQL. Useful for long predicates and for code that is read more often than written.
Pandas is at its best when you write top-to-bottom pipelines rather than building up intermediate variables:
out = (
df
.query("country == 'US' and year >= 2020")
.assign(margin=lambda d: d["revenue"] - d["cost"])
.groupby("segment", as_index=False)
.agg(margin=("margin", "mean"), n=("margin", "size"))
.sort_values("margin", ascending=False)
)
The chain reads as a story: filter, compute, group, aggregate, sort. The assign(column=lambda d: ...) form lets you add a column in-place without breaking the chain.
Pandas' oldest papercut. When you assign into a slice of a DataFrame, pandas can sometimes tell you are writing to a copy, not the original — but not always. The 2.0 fix is Copy-on-Write, enabled with pd.set_option("mode.copy_on_write", True). It makes mutation safe and slice semantics predictable. Turn it on. In pandas 3.0 it will be the default.
If you have a DataFrame df and you want to both filter rows and assign to a column, write it as two steps through .loc: first select (mask = ...), then assign (df.loc[mask, "col"] = value). Chained access like df[mask]["col"] = value silently does nothing under CoW.
The split-apply-combine pattern is the centrepiece of pandas. Almost every summary statistic, every pivot, every "per customer" metric goes through groupby.
Three steps. Split the data into groups by one or more keys. Apply a function to each group. Combine the results back into a DataFrame or Series. groupby handles the splitting and combining; you specify the applying.
The modern syntax is .agg(new_name=("source_col", "func")), which is readable and gives you control over output column names:
summary = (
df.groupby("dept")
.agg(n=("id", "size"),
revenue=("amount", "sum"),
avg_ticket=("amount", "mean"),
first_sale=("date", "min"))
.sort_values("revenue", ascending=False)
)
.agg reduces each group to a single value. .transform returns a result the same shape as the input — one value per original row — which is how you compute within-group z-scores or demeaned series:
df["demeaned"] = df["x"] - df.groupby("group")["x"].transform("mean")
.apply is the escape hatch: it accepts any function, but in exchange it is slower and its return shape is inferred at runtime. Reach for it only when agg and transform cannot express what you need.
Real analysis rarely happens on a single table. You join; you widen; you pivot; you melt. Pandas has one function for each, and a sharp mental model of what shape you are in.
merge and join
pd.merge is pandas' SQL join. how is "inner", "left", "right", or "outer". on names the key column or columns. The most valuable argument is validate: specify the expected multiplicity ("1:1", "1:m", "m:1", "m:m") and pandas will raise if your assumption is wrong:
joined = pd.merge(
orders, customers,
how="left", on="customer_id",
validate="m:1",
indicator=True, # adds a _merge column telling you each row's origin
)
df.join is a convenience for joining on the index — useful when you have already set informative indexes on both sides.
Data is long (tidy) when each observation is a row and each variable is a column; it is wide when one of those variables has been spread across columns. Most statistical tools prefer long; most human eyes prefer wide. Pandas has four reshaping tools that move between them:
pivot_table(index, columns, values, aggfunc) — long to wide with aggregation on duplicates.pivot(index, columns, values) — the same, but errors on duplicates instead of aggregating.melt(id_vars, value_vars) — wide to long, the inverse of pivot.stack, unstack — pivot via the index, for MultiIndex DataFrames.wide = df.pivot_table(index="date", columns="product", values="sales", aggfunc="sum")
long = wide.reset_index().melt(id_vars="date", var_name="product", value_name="sales")
pd.concat stacks DataFrames along an axis. Vertically (axis=0, default) is the usual case — appending new rows. Horizontally (axis=1) glues columns side-by-side, aligning on the index. Both respect dtype consistency; a column of int64 concatenated with a column of float64 will widen to float64.
The time-series machinery in pandas is uncommonly good — a DatetimeIndex unlocks resampling, rolling windows, shifting, and timezone-aware arithmetic that would take weeks to build by hand.
Parse your date columns at load time (parse_dates=). Set them as the index. From that point on, pandas treats dates as first-class:
df = df.set_index("timestamp").sort_index()
df["2025-03"] # all rows in March 2025
df["2025-03-15":"2025-03-22"] # an inclusive date range
df.loc[df.index.dayofweek < 5] # weekdays only
Resampling is the time-domain equivalent of groupby. df.resample("1D") buckets rows by day; the result is a grouper you can aggregate:
daily = df.resample("1D").agg(volume=("qty", "sum"), price=("price", "mean"))
monthly = daily.resample("1ME").sum()
The frequency string is a small DSL: "1H" hourly, "15min" fifteen-minute, "1W" weekly (ending Sunday), "1ME" month-end, "1QS" quarter-start, "1Y" yearly. Offsets compose: "2H30min".
df["ma"] = df["price"].rolling(7).mean() computes a seven-period rolling mean. rolling accepts windows by count or by time (rolling("30D"), only on a DatetimeIndex), and by standard aggregations or custom functions. expanding is the cumulative version — the window grows from the start.
Feature engineering for supervised time series is almost entirely shift:
df["y_t-1"] = df["y"].shift(1)
df["y_t-7"] = df["y"].shift(7)
df["return"] = df["price"].pct_change()
Combine shift with groupby when you have panel data — one group per entity — so lags never cross entity boundaries: df.groupby("id")["y"].shift(1).
If a dataset spans timezones, localize at the edge. Convert incoming strings to UTC on ingest (tz_convert("UTC")) and convert back to a user-facing timezone only at presentation. Keeping the middle tier in UTC eliminates a whole class of daylight-saving bugs.
Ninety percent of pandas performance tuning is three things: picking the right dtype, avoiding Python loops, and not loading more data than you need.
A column of three million integers stored as object (Python ints) uses 60× more memory and runs 40× slower than the same column as int32. After loading data, always check df.dtypes and downcast where safe:
df["user_id"] = df["user_id"].astype("int32")
df["category"] = df["category"].astype("category") # dict-encoded strings
df["flag"] = df["flag"].astype("boolean") # nullable bool
Categoricals are the single biggest win for datasets with low-cardinality string columns — countries, segments, product ids. They cut memory dramatically and speed up groupbys and joins.
Since pandas 2.0, setting pd.options.future.infer_string = True makes string columns use Arrow-backed storage, which is both faster and properly handles missing values. Expect this to become the default; adopt it now on new projects.
For files too large for RAM, read_csv and read_parquet both support chunksize= / iterators. Process a chunk, aggregate, discard, repeat:
total = 0
for chunk in pd.read_csv("huge.csv", chunksize=1_000_000):
total += chunk["amount"].sum()
Pandas is excellent up to roughly 10–50 GB on a single machine. Past that, three serious alternatives: Polars (Rust, multi-threaded, Arrow-native — the fastest single-machine option), DuckDB (embedded analytical SQL engine, reads Parquet in place), and Dask (pandas-on-many-machines, best when your logic is already pandas). For real streaming, Apache Arrow plus ADBC is the direction everything is heading.
Before you reach for a bigger tool, profile. %timeit in Jupyter, cProfile for scripts, memory_profiler for allocations. A pandas script that feels slow is, nine times out of ten, one bad apply away from being ten times faster.
Every machine-learning project in Python is built on top of this chapter. Here is the map from the ideas above to the pipelines you will meet everywhere else in the compendium.
groupby + transform compute within-group statistics (user-level averages, per-session counts). shift creates lags. merge joins your labels to your features.Datasets wrap a pandas DataFrame or a NumPy array. The __getitem__ method returns tensors built from df.iloc[i], converted via .to_numpy() and torch.from_numpy (the latter is a zero-copy view).df["text"].tolist(), and return a dict of tensors or lists.sklearn.metrics operates on NumPy arrays. pandas.crosstab builds confusion matrices; pivot_table builds error heatmaps by cohort.pd.read_parquet("runs/") glob-reads every run into one table; groupby compares configs.model.predict(X) call.numpy.random.default_rng(seed) + torch.manual_seed(seed) + setting PYTHONHASHSEED before anything imports is the minimum bar. The seeds belong on a dataclass with the rest of your config.Almost every ML paper's "data pipeline" section, if you strip the jargon, is the contents of this chapter. The rest of the compendium assumes you can move fluently between a NumPy array, a pandas DataFrame, and a torch Tensor — because every later chapter will.
The Python data stack is best learned by steeping. Keep a book open while you work, then go back to it once a month. Here is a short list — heavy on the two or three texts every practitioner eventually owns.
itertools, collections, functools, dataclasses, pathlib, and datetime once a year. Almost every "I wrote a helper for this" turns out to be a one-liner built on one of these modules.This page is the first chapter of Part II: Programming & Software Engineering. Up next: software-engineering fundamentals, the modern Python tooling stack, writing production-grade ML code, data engineering pipelines, and deployment. The rest of Part II assumes you can read and write the code in this chapter without thinking.