Part XIV · Applied Domains · Chapter 08

AI for Education & Personalization, learning that adapts to the learner.

Education has been a target of personalised computing since the 1960s and a target of machine learning since the 1990s, but the 2023 arrival of capable LLMs reshaped the field in months. Modern AI for education spans knowledge tracing (modelling what each learner knows from their interaction history), adaptive learning systems (sequencing content to each learner's needs), intelligent tutoring (the long Carnegie-Mellon-rooted tradition of cognitive tutors, now being rebuilt with LLMs at the centre), automated assessment, and the institutional question that has dominated the field since 2023: how do schools and universities respond to students using ChatGPT? This chapter develops the methodology of the field, the deployment patterns that work and the ones that don't, and a substantive treatment of the institutional response to AI in academic settings — including the empirical evidence on what bans, detectors, and AI-positive pedagogies have actually produced.

Prerequisites & orientation

This chapter assumes familiarity with sequence models (Part V Ch 05) and transformer architecture (Part VI Ch 02) for the knowledge-tracing material, NLP fundamentals (Part VI Ch 01) for automated-assessment and tutoring sections, and the recommender-systems methodology of Ch 01 (which transfers directly to adaptive learning). The fairness-and-equity material of Part XV (when written) is essential context for several deployment topics, but the chapter develops what is needed inline. No background in education research is assumed; the chapter introduces concepts as they arise.

Two threads run through the chapter. The first is the modelling of the learner: every system in this chapter, from 1990s knowledge-tracing models to 2026 LLM tutors, is some attempt to represent what a particular learner knows, what they can do next, and what they are likely to learn from a given interaction. The methodology has shifted over decades but the goal has not. The second is the institutional context: education AI is deployed inside schools, universities, and tutoring services with their own incentive structures, regulatory frameworks (FERPA, COPPA, the EU AI Act's high-risk-system provisions), and pedagogic traditions. Sections 7 and 8 in particular develop the institutional response to generative AI — how schools have responded to ChatGPT, what AI-detection tools actually deliver, and what early evidence suggests works.

In this chapter

Why AI for Education Is Distinctive stakes · motivation · equity · privacy · institutional context
Knowledge Tracing BKT · DKT · SAINT · BERT-based · LLM-based
Adaptive Learning Systems content sequencing · mastery · scaffolding · zone of proximal development
Intelligent Tutoring Systems cognitive tutor · model tracing · expert / student / tutor models
Automated Assessment and Feedback essay scoring · code grading · rubrics · formative feedback
LLMs as Tutors and Learning Companions Khanmigo · Duolingo Max · ChatGPT-as-tutor · evidence
AI in Schools and Universities: The Cheating Question adoption · detectors · false positives · disclosure norms
Institutional Strategies and Their Results bans · AI literacy · redesigned assessment · AI-positive pedagogy · evidence
Equity, Privacy, and Educational Data FERPA · COPPA · disparate impact · data brokers · transparency
Applications and Frontier language learning · special needs · lifelong · foundation models · frontier

Why AI for Education Is Distinctive

Applying machine learning to education looks like applying it to any other personalisation problem — until you encounter the constraints that no other ML domain faces simultaneously: errors that affect children's life trajectories, motivation as the gatekeeper of any learning intervention, equity obligations under federal civil-rights law, FERPA and COPPA constraints on student data, and a deployment context (schools and universities) with thirty-year planning horizons and substantial institutional inertia. This section maps the constraints; the rest of the chapter develops the methodology that lives within them.

Stakes and the long-tail of educational decisions

Education AI errors compound. A model that mis-tracks a learner's mastery of fractions in fifth grade can cascade into algebra struggles in seventh grade and calculus avoidance in eleventh grade. A college-admissions algorithm that mis-scores a student's writing changes the institution they attend, the career they enter, and the income trajectory that follows. The asymmetry between false negatives and false positives differs by application — a tutor that fails to challenge a strong student is a smaller harm than one that pushes a struggling student into work that produces sustained frustration — and the methodology has to engage with these asymmetries explicitly.

Motivation as the gating constraint

The single most-cited finding in education research is that motivation mediates almost everything else. A perfectly-personalised content sequence that students don't engage with produces no learning; a less optimal sequence that students attempt enthusiastically produces real gains. Education AI must therefore optimise for engagement-respecting outcomes rather than for outcome metrics in isolation, which makes the methodology resemble recommender systems (Ch 01) more than classical supervised learning. Section 6 returns to this in the context of LLM tutors, where the engagement question has been particularly visible.

Equity, accessibility, and the participation gap

Educational AI sits inside a system with substantial existing inequity — between school districts, between socioeconomic groups, between native and non-native speakers of the instruction language, between students with and without learning differences. Adding ML on top of this baseline can either reduce inequity (well-deployed AI tutors give every student access to one-on-one help that was previously available only to wealthy families) or amplify it (a model trained mostly on writing samples from native English speakers misjudges ESL learners' work). Section 9 develops the equity material in detail; the conceptual point is that "fair education AI" is not an afterthought but a deployment requirement.

FERPA, COPPA, and the regulatory envelope

Student data is heavily regulated. FERPA (the Family Educational Rights and Privacy Act, US, 1974) governs access to educational records, with substantial restrictions on third-party data sharing. COPPA (the Children's Online Privacy Protection Act, US, 1998) regulates collection of personal information from children under 13, requiring parental consent and limiting commercial use. The EU's GDPR adds parallel constraints with broader scope. State-level laws (notably Illinois SOPPA, California's SB 1177) layer additional requirements. The 2025 EU AI Act explicitly classifies education AI as high-risk, imposing transparency, audit, and human-oversight requirements. The methodology of the chapter is shaped throughout by these constraints, and Section 9 treats the privacy material in depth.

The institutional context

Schools, universities, and tutoring services have planning horizons measured in decades. A district that adopts a curriculum sticks with it for ten years; a university's grading policies pre-date current faculty; tenure-track careers extend over generations of educational technology. The pace of AI development since 2023 has substantially exceeded what these institutions can absorb deliberately, which is why Sections 7 and 8's institutional-response material feels chaotic — the institutions are responding to a technology that arrived faster than their normal absorption capacity, often without the data or the empirical evidence they would normally require for major policy change.

Why Education AI Is Hard

Education AI is a recommender system whose users are children, whose errors compound across years, whose deployment is heavily regulated, whose users vary enormously in motivation and prior knowledge, and whose institutional context resists rapid change. Every section that follows is shaped by some combination of these constraints, and the methodology of the chapter is the methodology of working within them.

Knowledge Tracing

Before any AI can adapt to a learner, it has to model what that learner knows. Knowledge tracing is the discipline that does this — given a student's history of attempts on educational items, predict whether they will get the next item correct, with the latent variables interpreted as their mastery of underlying skills. The field has gone through several distinct waves of methodology, from probabilistic-graphical-model origins through deep-learning revolution to current LLM-augmented approaches.

Bayesian Knowledge Tracing

The classical approach is Bayesian Knowledge Tracing (BKT, Corbett and Anderson 1995), developed for the Carnegie Mellon cognitive-tutor program. BKT models each skill as a hidden Markov model with two states (learned, unlearned) and four parameters per skill: p(initial knowledge), p(learn) per opportunity, p(slip) (correct guess despite not knowing), and p(guess) (incorrect despite knowing). Given an observed sequence of correct/incorrect attempts, posterior inference on the hidden state produces a probability that the student has mastered the skill.

BKT's strengths are interpretability (each parameter has a clear pedagogic meaning), efficiency (parameters can be fit per skill on small data), and robustness across deployment contexts. Its weaknesses are that the binary mastery model is too coarse, that it cannot share information across related skills, and that it cannot use rich features about items or contexts. For thirty years it was the operational foundation of cognitive tutors anyway, and many production educational systems still use BKT or its variants.

Deep Knowledge Tracing

The 2015 Deep Knowledge Tracing paper (Piech et al., NeurIPS) recast knowledge tracing as a sequence-modelling problem and applied LSTMs. The model takes a sequence of (item, correct/incorrect) pairs and predicts the probability of correctness on each next item, with no hand-engineered skill model. The empirical results were striking — DKT outperformed BKT substantially on standard datasets — and the method opened the door to neural approaches that have since dominated the field.

DKT's limitations include unstable predictions (the model's belief about a student's mastery can fluctuate erratically across items), difficulty interpreting what's been learned, and weakness on long student histories. The 2017–2020 wave of refinements (DKT+, DKVMN, SAKT) addressed pieces of these issues with attention mechanisms and memory networks.

Transformer-based knowledge tracing

The 2020s wave of transformer-based methods has substantially advanced the field. SAINT (Choi et al. 2020, "Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing") uses an encoder-decoder transformer with separately-encoded item and response embeddings, achieving state-of-the-art results on the EdNet benchmark. AKT (Ghosh et al. 2020) adds attention with monotonically-decaying influence over distant items. BiDKT (Tan et al. 2022) and the various BERT-style adaptations leverage masked-prediction objectives. The 2024 generation includes LSTM-BERT hybrids (LBKT) for long sequences and increasingly LLM-based knowledge-tracing models that ingest rich item content rather than just IDs.

Foundation-model knowledge tracing

The 2024–2025 frontier increasingly uses large language models as the substrate for knowledge tracing. The pattern: prompt the LLM with a learner's history (formatted as natural-language interactions or item-by-item summaries) and ask it to predict the next response or to articulate what skills the learner has and hasn't mastered. The empirical results are mixed but encouraging — LLMs handle item content (problem text, context, learner explanations) far better than ID-only models, and they generalise to new items without retraining. The cost is high (every prediction is an LLM forward pass) and production deployments use LLM-based tracing for the most-valuable subset of decisions rather than as a default.

Standard datasets

The field's empirical foundation rests on a handful of public datasets. ASSISTments (online math practice from Worcester Polytechnic) provides millions of student-item interactions with skill tags. STATICS covers engineering statics. EdNet (Riiid, the largest public KT benchmark) covers TOEIC English-language preparation. Junyi Academy covers Chinese-language K-12 mathematics. The 2024 generation of foundation-model KT methods is producing strong results on these benchmarks, but the gap between benchmark performance and operational utility — the same gap clinical AI faces (Ch 05) — remains substantial.

Adaptive Learning Systems

Knowledge tracing models what a learner knows; adaptive learning systems use that model to decide what to give them next. The methodology connects directly to the recommender-system framework of Ch 01 — adaptive learning is essentially personalised recommendation with the additional constraints that the recommended items have a pedagogic purpose and that learner mastery should grow over time, not just engagement.

Mastery-based progression

The dominant pedagogic principle in modern adaptive learning is mastery-based progression: a learner advances to the next topic only when they have demonstrated mastery of the prerequisite. The principle dates to Benjamin Bloom's 1968 "Learning for Mastery" and has been validated repeatedly — Bloom's "2-sigma problem" (1984) documented that one-on-one tutored students performed two standard deviations above traditional-classroom students, with mastery-based progression a major component. Adaptive systems operationalise this by holding learners on a topic until knowledge-tracing models indicate they have crossed a mastery threshold, then unlocking the next topic.

Sequencing and the zone of proximal development

Vygotsky's zone of proximal development (ZPD) — the range of tasks a learner can complete with appropriate scaffolding but not independently — is the operational target for adaptive sequencing. Items that are too easy bore the learner; too hard frustrate them; the right level is challenging-but-tractable. The 85% rule (Wilson et al. 2019, in machine-learning settings) provides empirical justification: optimal learning rate occurs when the learner is correct about 85% of the time. Production adaptive systems use knowledge-tracing predictions to maintain this difficulty calibration.

Spaced repetition

Beyond difficulty calibration, adaptive systems schedule spaced repetition — re-exposing learners to material they have previously learned, at intervals calibrated to their retention curve. The classical SM-2 algorithm (Wozniak 1985, used in SuperMemo and Anki) and its modern descendants (the FSRS algorithm in 2024-era spaced-repetition apps) use a learner's response correctness to estimate when each item should be reviewed next. Duolingo's "Half-Life Regression" model (Settles and Meeder 2016) is the canonical production-scale spaced-repetition ML system, with the half-life of memory for each (learner, word) pair predicted from features.

Item-response theory and difficulty modelling

Item-Response Theory (IRT, originating in psychometrics in the 1950s–60s) models the probability that a learner of ability θ answers an item of difficulty β correctly via a logistic function. The Rasch model (1-parameter IRT) and its multi-parameter extensions provide the substrate for most standardised testing (SAT, GRE, the various adaptive certification tests). Modern adaptive systems combine IRT-derived item-difficulty estimates with knowledge-tracing-derived learner-ability estimates to do real-time difficulty matching.

Reinforcement learning for sequencing

The 2020s have produced adaptive systems that frame sequencing as a reinforcement-learning problem: the state is the learner's current mastery, the action is the next item, the reward is some combination of immediate-correctness and longer-term-mastery proxies. RL-based sequencing has shown empirical improvements over heuristic policies in controlled studies but faces the standard educational-RL problems: the reward signal is delayed and noisy (true learning is observed weeks later, not immediately), the cost of wrong actions is real, and online exploration is constrained by the cost-of-error problem of Section 1. Production deployments generally use RL alongside human-designed scaffolds rather than as the sole sequencing logic.

Intelligent Tutoring Systems

The longest-running tradition in AI for education is intelligent tutoring systems (ITS) — the discipline of building computational tutors that understand both the subject matter and the learner. The Carnegie Mellon program from the 1980s onward produced the field's foundational architectures, and the methodology has direct continuity into modern LLM-based tutors covered in Section 6.

The classical ITS architecture

The canonical ITS architecture, due to John Anderson and colleagues at CMU, has four components: a domain model (formal representation of the subject matter — for algebra, the rules of equation manipulation; for chemistry, balancing-reaction procedures); a student model (representation of what the learner knows, typically via knowledge-tracing methods); a tutoring model (pedagogic strategy — when to give a hint, when to move on, when to require remediation); and a user interface (the layer the learner actually interacts with). The architecture's elegance is that each component has a clean role and can be developed independently; its limitation is that domain models for new subjects are expensive to build, gating the spread of cognitive tutors.

Model tracing and step-level guidance

The Carnegie cognitive tutors used model tracing as their tutoring approach: as the learner works through a problem, the system traces their current state against the domain model, identifying when they take a step that doesn't match any valid solution path and offering a hint, and when they take a wrong-but-recognisable step and offering targeted remediation. The discipline is cleaner than free-form tutoring because every step has a model-defined right answer, and the methodology produces detailed learner-specific data that knowledge-tracing models can use.

The CMU cognitive tutors for algebra and geometry, deployed at scale through Carnegie Learning, were among the first AI-based educational products with rigorous efficacy evidence — the largest randomised study of cognitive-tutor algebra (Pane et al. 2014, RAND) found small but statistically-significant improvements over conventional instruction for second-year users. The result was a useful corrective to the field's hype: AI tutors work but produce modest rather than transformational gains, and the gains depend on substantial deployment investment.

Constraint-based modelling

An alternative tradition, due to Stellan Ohlsson and developed at the University of Canterbury, uses constraint-based modelling. Rather than enumerating every valid solution path (which becomes intractable for complex domains), constraint-based tutors enumerate the constraints that any valid solution must satisfy. A solution that violates a constraint produces feedback; a solution that satisfies all constraints is correct. The approach is much cheaper to author for complex domains (database design, SQL, software-engineering ITS) and remains the dominant approach where solution spaces are too large for full enumeration.

Dialogue-based and natural-language tutoring

Beyond step-by-step problem solving, a parallel ITS tradition has focused on tutorial dialogue — natural-language conversations between a tutor and a learner. AutoTutor (Graesser et al., University of Memphis, 1990s onward) is the canonical example, using rule-based dialogue management with semantic-similarity scoring to evaluate learner answers. The empirical results have been positive across many studies, with effect sizes comparable to human tutoring on the topics where the tutor was carefully authored.

The bridge to LLM-based tutors

Modern LLM-based tutors (Section 6) inherit substantial methodology from this tradition. Domain models become structured prompts and retrieval over curriculum content. Student models become explicit memory and dialogue-state tracking. Tutoring models become pedagogic prompting strategies. The interface is a chat. The continuity is real, and serious LLM-tutor design draws on the ITS literature — the alternative being to rediscover lessons that the field already learned over decades.

Automated Assessment and Feedback

Beyond presenting content, AI for education increasingly evaluates learner work — scoring essays, grading code, providing formative feedback. The methodology spans classical NLP through transformer-based evaluators to LLM-based rubric-following, with the deployment context (high-stakes vs. formative) shaping what's appropriate.

Automated essay scoring

Automated essay scoring (AES) has been a research area since Project Essay Grade in the 1960s. The 2010s wave used handcrafted-feature regression models (e-rater, Intellimetric); the 2020s wave uses transformer-based models (BERT-style scoring, increasingly LLM-based scoring). On standardised tests with detailed rubrics, modern AES systems achieve agreement with human raters (quadratic-weighted kappa ~0.8) comparable to inter-rater agreement between two human scorers.

The methodology is contested. AES models can be gamed by writing that satisfies surface features (length, vocabulary diversity, syntactic complexity) without coherent argument. Studies have repeatedly shown that essays written deliberately as nonsense — but containing the right features — score well on classical AES models. The 2020s LLM-based scorers handle this better but are not immune. High-stakes deployments (TOEFL, GRE-AW) typically combine AES with human review rather than rely on AES alone, and the operational reality is that the AI is one signal among several rather than the sole grader.

Code grading and programming-feedback systems

Programming education has its own automated-assessment tradition. Unit-test-based grading (correctness against a test suite) is common for introductory programming courses but is too coarse for serious feedback — students can get partial credit for code that solves none of the actual problem, or zero credit for code with one off-by-one error. Modern systems combine test-based correctness with style and structure feedback, often via static-analysis tools (linters, type checkers). The 2024 generation increasingly uses LLMs for richer feedback: the LLM is prompted with the student's code, the assignment description, and a rubric, and produces structured feedback comparable to a teaching assistant's.

Formative versus summative assessment

The distinction matters for deployment. Formative assessment is feedback during learning, not for grading; the goal is to help the learner improve. Summative assessment is end-of-unit grading, with the score determining advancement. Different accuracy and fairness requirements apply: formative feedback can tolerate substantial noise because it's one signal among many, while summative grading affects student records and requires much higher reliability. Most production AI-assessment is formative, with summative use limited to specific high-trust applications (multiple-choice tests, exact-answer math problems, etc.).

LLM-based rubric scoring

The 2024–2026 generation of assessment AI uses LLMs to score against detailed rubrics. The pattern: prompt the LLM with the student response, the rubric, and few-shot examples of past scoring, and ask it to assign scores against each rubric dimension with justifications. Empirical results are encouraging — agreement with human raters at parity with inter-rater agreement on many standard tasks — but the failure modes are different from classical AES. LLMs can hallucinate criteria not in the rubric, can be inconsistent across similar responses, and can reflect biases from their pretraining data. Production LLM-based scoring uses calibration corpora to monitor for drift and inter-LLM agreement as a sanity check.

Feedback quality and the engagement question

Beyond accuracy, the question of useful feedback matters as much as the score itself. A correct score paired with vague feedback ("good effort, work on argumentation") is less useful than a slightly-noisier score paired with specific actionable feedback ("paragraph 3 lacks evidence — consider citing the data from page 5"). The 2024 generation of LLM-based feedback systems can produce specific, actionable feedback at scale, which has shifted assessment AI's value proposition from "score work fast" to "produce feedback that helps students improve." This connects directly to the LLM-tutor material of Section 6.

LLMs as Tutors and Learning Companions

The 2023 release of GPT-4 and subsequent capable LLMs reshaped education AI more dramatically than any previous wave. Before 2023, intelligent tutoring required years of domain-model authoring to cover any new subject. After 2023, plausibly-helpful tutoring on any subject became a prompt-engineering problem. The change is real, the empirical results are mixed, and the field is still working out which deployment patterns produce learning rather than just engagement.

Khanmigo and the production-scale LLM tutor

The most-watched deployment is Khan Academy's Khanmigo, launched in 2023 as a GPT-4-based tutor wrapped around Khan Academy's existing content library. Khanmigo's design carefully avoids the most-criticised LLM-tutor failure mode: it does not give direct answers, instead asking Socratic questions and scaffolding the learner to the answer themselves. The product expanded from ~68,000 users in 2023–24 to ~700,000 in 2024–25, with district partnerships growing from 45 to 380+. Khanmigo is now the largest production LLM-tutor deployment in K–12 education.

The empirical evidence on Khanmigo is preliminary. Khan Academy's pre-LLM efficacy research (2024) showed that students using Khan Academy for 30+ minutes per week throughout a school year saw greater-than-expected gains on standardised assessments, with the Khan-platform-with-Khanmigo intervention designed to amplify these gains rather than supplant them. Pilot studies (notably the Michigan Virtual study, Spring 2025) found teacher confidence with AI grew over the course of the deployment, student usage patterns grew from monthly to weekly engagement, but rigorous learning-outcome data remains in collection. The candid Khan Academy framing in 2025 was that LLM tutors "work if students use them correctly" — an honest acknowledgement that engagement remains the gating constraint.

Duolingo Max and language-specific tutors

Duolingo Max launched in 2023 with two GPT-4-powered features: "Explain my Answer" (deeper explanation of why a particular answer is right or wrong) and "Roleplay" (simulated conversational practice with an AI character). The deployment has attracted substantial paid-subscriber adoption, and 2024 published data from Duolingo's research team showed measurable engagement gains and self-reported learning outcomes for paid Max subscribers. Language learning is in many ways the most natural LLM-tutor application — the LLM's native generation capability is exactly what conversational practice requires.

The ChatGPT-as-tutor pattern

Outside formal products, students have made ChatGPT and equivalent tools the most-used educational AI by orders of magnitude. The 2024 EDUCAUSE student survey found that ~92% of college students reported using AI tools in academic work, with the majority using ChatGPT or comparable products at least weekly. The use cases run the full spectrum: brainstorming, explanation-of-concepts, code debugging, writing assistance, full-essay generation, math problem solving. The pedagogic value varies enormously across these patterns, and Section 7 returns to this in the context of academic integrity.

The Socratic-vs-direct-answer design choice

A central LLM-tutor design choice is whether to give answers directly or to scaffold the learner toward answers themselves. Direct answers are what students want; scaffolded discovery is what produces learning. Khanmigo's deliberate Socratic framing has been the most-prominent attempt at the latter; ChatGPT's default helpful-assistant behaviour represents the former. Empirical work on this tension is sparse but suggestive: students learn more from Socratic LLM tutors but engage with them less, and the deployment that wins in adoption may not be the one that wins in learning outcomes. The 2024 research literature is increasingly investigating this trade-off explicitly.

Hallucination, mathematics, and the reliability gap

LLMs hallucinate, and educational content is not exempt. A tutor that confidently gives wrong answers in arithmetic, that confuses chemical formulas, that misattributes quotations is actively harmful. The 2024–2026 generation of education-tuned LLMs (Khanmigo's variant of GPT-4, Duolingo's tuned models, the various university-partnership LLMs) substantially reduces hallucination through fine-tuning, retrieval over curated content, and deliberate refusal patterns when the model is uncertain. Mathematics in particular has seen substantial improvement through tool-augmented LLMs that delegate computation to dedicated math engines rather than computing in the LLM's forward pass. The reliability gap is narrowing but not closed; production tutors require ongoing monitoring and explicit boundaries on what content domains they handle.

AI in Schools and Universities: The Cheating Question

The single most consequential change in education caused by the 2023 LLM wave is not the formal deployment of products like Khanmigo. It is the unprompted, unauthorised, and now near-universal use of consumer LLMs by students themselves. This section documents the empirical reality of student AI use, the institutional responses that have been tried, and the technical inadequacy of AI-detection tools that institutions hoped would solve the problem for them. Section 8 then surveys the strategies that have actually moved the needle.

The adoption rate

The numbers are striking. Multiple 2024–2026 surveys converge on a rate of approximately 90% of students in higher education using AI tools at least occasionally for academic work. The 2024 Tyton Partners "Time for Class" survey found 59% of college students using generative AI at least weekly. The 2025 Anthropic and OpenAI usage reports, drawing on aggregated product telemetry, showed millions of student-coded queries per day during academic-year months. By the standard pattern of technology adoption in education, this is among the fastest mass adoptions ever recorded.

Crucially, this adoption has been substantially undisclosed. A 2024 study from King's Business School found that 74% of students who had used AI on assignments did not disclose this to instructors, even when the syllabus required disclosure. The asymmetry between use rate (~90%) and disclosure rate (~25%) is the empirical heart of the institutional crisis.

The first wave: bans

The initial institutional response in early 2023 was prohibition. New York City public schools blocked ChatGPT on school networks; Los Angeles Unified followed; several flagship universities issued blanket "no AI tools" syllabi. These bans collapsed within months. Students used personal devices and home networks, the bans were unenforceable, and the institutions generally walked them back by the 2023–24 academic year. The exceptions (a handful of schools that maintained bans through equipment-management policies) found that the policies failed to prevent use but successfully prevented the institution from gathering useful information about how students were using AI.

The second wave: AI detectors

The second institutional hope was that AI-detection tools would let schools enforce rules they could not enforce by network blocks. Vendors promised high accuracy: Turnitin claimed 98% accuracy with less than 1% false-positive rate; GPTZero, Originality.AI, Copyleaks, and Winston AI made comparable claims. Schools deployed these tools widely in 2023–24.

The empirical reality has been disappointing in well-documented ways. Independent evaluations have repeatedly found false-positive rates substantially higher than vendor claims — a Washington Post study using Turnitin's detector found a 50% false-positive rate on a sample of human-written student work. GPTZero's false-positive rate ranges from 0.24% to 10% depending on the corpus. Commercial detectors systematically misclassify non-native-English-speaker writing as AI-generated, because the linguistic features they associate with AI (regular sentence structure, limited vocabulary diversity, formulaic phrasing) overlap with the features of skilled-but-non-native English. Neurodivergent writers and writers with formal-instruction backgrounds also score higher false-positive rates. By 2024, multiple universities (Vanderbilt, Northwestern, the University of Texas at Austin, others) had publicly disabled their Turnitin AI-detection features, citing unreliability and equity concerns.

The fundamental problem is information-theoretic. Modern LLMs produce text drawn from distributions overlapping substantially with human writing; "did this human or an LLM write this passage" is not a question with a reliable answer at the resolution institutions need for disciplinary decisions. The detection-tool industry persists because some institutions still purchase the tools, but the methodology is widely discredited among researchers and increasingly among practitioners.

The post-detection environment

By the 2024–25 academic year, the institutional consensus had shifted away from detection toward disclosure-based policies, redesigned assessment, and selective oral-defence requirements. Section 8 develops these strategies and what evidence exists for their effectiveness. The conceptual point here is that the prohibition-and-detection era ended fairly quickly, with the institutions that committed hardest to it generally finding their commitments untenable, and the field has had to develop different approaches.

Institutional Strategies and Their Results

Once detection failed and bans collapsed, schools and universities turned to a different set of approaches: disclosure-based policies, redesigned assessment that's harder to AI-cheat, oral-defence components, and "AI-positive" pedagogies that integrate AI use into the curriculum. This section surveys the strategies and what early evidence exists for each. The empirical record is still thin — most strategies have been deployed for only one or two academic years — but enough has accumulated to draw provisional conclusions.

The institutional response to student AI use has moved through three waves between 2023 and 2026, with a fourth approach ("AI-positive pedagogy") emerging as the most promising though most labour-intensive. Wave 1 (bans) collapsed within months because students used personal devices and the policies were unenforceable. Wave 2 (AI detectors) was largely disabled by 2024–25 across major universities because false-positive rates and disparate impact on ESL and neurodivergent writers made the tools unusable for disciplinary decisions. Wave 3 (disclosure-based policy plus assessment redesign — oral defence, process-based grading, personalised assignments) helps at the margin but is expensive and only partial. The emerging AI-positive approach treats AI as a tool whose use is part of the learning objective itself; early evidence is encouraging but the approach requires substantial faculty effort to redesign assignments and assessment.

Disclosure-based policies

The dominant 2025 model is some form of disclosure-based policy: students may use AI tools, but must disclose how and where. The Harvard Graduate School of Education policy is representative: students can use AI for clarification, brainstorming, and refining their own ideas, but must attribute AI contributions like any other source. Princeton, Columbia, the University of Texas at Austin, Stanford, and most of the AAU member institutions have versions of this approach.

The empirical record is mixed. Compliance rates are modest (the King's Business School finding that 74% of students using AI did not disclose suggests that disclosure-based policies work only when paired with other accountability mechanisms). The policies have nonetheless reduced the most-blatant AI-cheating cases (full-essay-generation submitted as original work), particularly when paired with class-level discussion of expectations and consistent enforcement. The policies have not stopped the broader pattern of AI-augmented assignment completion.

Redesigned assessment

A more ambitious strategy is redesigned assessment: change the assignments themselves so that AI use is either irrelevant or explicitly part of the work. Several patterns have emerged:

In-class writing and oral examination: written work in proctored conditions, plus oral defence of submitted work. This approach mirrors the European university tradition of oral examinations and has been adopted by select courses at most major US universities. The evidence suggests it does prevent AI use during the assessment, but the time and labour costs are substantial.

Process-based grading: instead of grading the final product, grade the process — drafts, revision history, in-class work. Tools like Google Docs version history and dedicated process-tracking software (the Turnitin Draft Coach, Memorial University's "ProseSnap") let instructors see the writing process unfold. The methodology preserves out-of-class assignments while making AI completion visible.

Personalised, context-rich assignments: assignments that draw on specific in-class discussion, that require integration of unique sources, or that ask for analysis of student-specific cases are harder for off-the-shelf LLMs to complete effectively. The 2024 wave of "AI-resistant" pedagogy at scale has produced substantial faculty effort here, with mixed reception (some students appreciate the more personal nature; others find the assignments confusing).

AI-positive pedagogy

The most thoroughly-redesigned approach is AI-positive pedagogy: assignments that explicitly require AI use, with AI as a tool whose use is part of the learning objective. Students might be asked to use ChatGPT to generate a draft, then critique its argumentation; to use AI to explore counter-arguments to their own positions; to compare AI-generated and student-generated solutions to a problem and identify what each gets right or wrong. The methodology treats AI as a collaborator rather than a contraband.

The empirical evidence here is most positive but also most preliminary. Studies from Wharton, the University of Pennsylvania (Mollick and colleagues), and several MOOC platforms show that students in well-designed AI-integrated assignments report higher engagement and demonstrate better critical thinking about source evaluation than comparison groups. The interventions are also the most teacher-investment-intensive — the learning objective has to be re-articulated, the assignment redesigned, the assessment rubric updated, and the class discussion shifted. Scaling this beyond early-adopter faculty is a substantial institutional challenge.

AI literacy as a curricular goal

Several institutions have moved beyond per-course adjustments to AI literacy as a curricular goal. The University of California, Berkeley, the University of Florida (with their AI-across-the-curriculum initiative), Estonia's K-12 AI curriculum, and the various 2025 European national initiatives explicitly add AI fluency to graduation requirements. The methodology mirrors information-literacy programs of the 1990s and 2000s — AI is treated as a foundational skill rather than a subject-specific consideration. Whether these initiatives produce measurable differences in student outcomes will not be known for years; the institutional commitment is substantial.

What the evidence shows so far

Pulling the strands together, the early evidence from 2024–2026 supports several claims with reasonable confidence. Bans don't work — students use personal devices, the prohibition is unenforceable, and the institution loses access to information that could inform policy. AI detectors don't work reliably enough for disciplinary use — false-positive rates are too high and disparate impact too severe to defensibly use detector outputs as the basis for academic-integrity violations. Disclosure policies help at the margin — they reduce the most-blatant cheating cases without solving the broader pattern. Assessment redesign works but is expensive — both in faculty time and in the trade-off with traditional learning objectives. AI-positive pedagogy is most promising in early evidence but requires the biggest pedagogic shift. The institutions that will navigate the AI transition most successfully are those that engage with the methodology pragmatically, monitor the evidence as it accumulates, and accept that no single strategy will solve the problem fully.

Equity, Privacy, and Educational Data

Educational AI carries equity and privacy obligations that go beyond what most ML domains face. Students are minors in many cases; education is a federally-protected interest with civil-rights implications; and the data is regulated by FERPA, COPPA, GDPR, and a growing patchwork of state laws. This section develops the equity and privacy material that has been touched on throughout the chapter.

Disparate impact in educational AI

Several documented failure modes affect protected groups disproportionately. Automated essay scoring systematically scores writing by non-native English speakers lower than human raters do, because the surface features the model has learned to associate with quality include native-speaker stylistic patterns. AI-detection tools (Section 7) flag non-native-speaker, neurodivergent, and formal-register writing as AI-generated at higher rates. Knowledge-tracing models trained primarily on majority-language students underperform on minority-language students with comparable underlying knowledge. Adaptive sequencing can reinforce stereotype-aligned course paths if the training data reflects historical discrimination.

The methodology for addressing this draws on the fairness chapter of Part XV (when written) and from the parallel material in Ch 05 (Healthcare). The standard toolkit includes subgroup-performance auditing, careful training-data curation, fairness-constrained optimisation, and post-deployment monitoring with explicit threshold-equity audits. The institutional implementation is uneven — some major edtech vendors run substantial fairness audits; others publicly claim "fairness through unawareness" approaches that the literature has long discredited.

FERPA and student-record privacy

The Family Educational Rights and Privacy Act (FERPA, 1974) governs access to and disclosure of educational records in any institution receiving federal funds. The rules: educational records cannot be shared with third parties without parental consent (or student consent for those over 18); students have the right to inspect and seek amendment of their records; certain "directory information" can be released absent objection but other categories cannot.

For educational AI, FERPA imposes substantial constraints on data flows. Student work and performance data are educational records and cannot be sent to third-party AI services without appropriate consent and contractual protection. The "school official" exception allows AI vendors operating under direct institutional control to access records, but the contracts must be specific, the use must be educational, and data-resale is forbidden. The 2024 Department of Education guidance on AI in K–12 settings has substantially clarified what's permissible, but compliance requires contractual diligence that smaller districts and institutions often lack.

COPPA and the under-13 problem

COPPA (the Children's Online Privacy Protection Act, 1998) requires verifiable parental consent for collection of personal information from children under 13, and limits commercial use of that data. The EU's GDPR-K parallels apply to under-16-year-olds in most member states. For K–12 educational AI, this means consent flows must be carefully managed — typically through schools acting on parents' behalf — and that data retention, sharing, and use must be tightly bounded.

The 2023–2026 wave of LLM tutors has produced substantial COPPA compliance work. Khanmigo's deployment in K–12 districts requires district-level Data Processing Agreements; OpenAI's school-targeted offerings include FERPA/COPPA-compliant variants; Anthropic and Google have similar education-tier products. The compliance work is real but doable; the more problematic case is consumer LLMs being used by under-13 students without parental knowledge, where the legal status is murky and enforcement is uneven.

Educational data brokers and the resale risk

Beyond formal AI deployment, the broader edtech ecosystem includes substantial data-broker activity — companies that aggregate student data across multiple sources for commercial use. The 2014 Common Sense Media reports first highlighted the scope; the 2022–2024 wave of state-level legislation (Colorado, California, New York) has constrained but not eliminated the practice. Educational AI deployments need to be carefully evaluated for data-resale provisions in vendor contracts, and most states now require explicit non-resale clauses for student data.

Transparency and the AI Act

The 2025 EU AI Act explicitly classifies education AI as high-risk, imposing requirements for transparency, audit, human oversight, and quality management. Specific obligations include: documenting training data and methodology, providing information about the AI's capabilities and limitations to teachers and students, ensuring meaningful human oversight of consequential decisions, monitoring for accuracy and bias, and supporting incident reporting. The compliance burden is substantial; the timeline (mid-2026 enforcement for general-purpose AI in education) has driven major edtech vendors to redesign their products and documentation. Equivalent requirements are emerging in other jurisdictions, and the global trajectory is toward more rather than less regulation of education AI.

Applications and Frontier

Beyond the core areas of knowledge tracing, adaptive learning, ITS, assessment, and LLM tutors, AI for education appears in many specialised applications. This final section surveys the application landscape and the frontier where modern AI is reshaping the field.

Language learning

Language learning is the application area where AI has produced the largest measurable consumer impact. Duolingo (with its half-life-regression spaced repetition, the Duolingo Max LLM features, and increasingly LLM-driven content generation), Babbel, Busuu, and the various conversation-practice apps (TalkPal, Tutor Lily, the Lingo Mate family) collectively reach hundreds of millions of users. The methodology combines spaced repetition, IRT-based difficulty calibration, and increasingly LLM-driven generation of personalised content and conversation partners. Empirical outcomes data is uneven — some published studies show measurable language-skill gains, others suggest engagement-without-mastery patterns — but the market scale and continued usage growth suggest the products are providing real value at the margin.

Special needs and accessibility

AI for students with learning differences and disabilities is a substantial application area with high social value and lower commercial visibility. Speech-to-text and text-to-speech tools support students with dyslexia and writing-output difficulties. Communication-augmentation systems support students with speech impairments. Real-time-captioning systems support deaf and hard-of-hearing students. The 2023–2026 generation of LLM-based tools has substantially expanded what's possible — particularly real-time content simplification, multimodal explanation generation, and individualised scaffolding for students with cognitive differences. The deployment pattern combines AI tools with substantial human-specialist support; AI augments rather than replaces special-education expertise.

Lifelong learning and corporate training

Beyond formal education, AI for lifelong learning has produced a substantial commercial sector. Coursera, edX, Udemy, LinkedIn Learning, and the corporate-training vendors (Pluralsight, Cornerstone, Workday Learning) all use ML for content recommendation, skill modelling, and increasingly LLM-based tutoring. The deployment context differs from K–12 and higher-ed in important ways: adult learners are intrinsically motivated, the assessment stakes are typically lower, and the regulatory framework (FERPA, COPPA) doesn't apply. AI tutors may have their largest measurable impact in this segment precisely because the engagement question that gates K–12 deployment is less binding.

Tutoring at scale and the "Bloom problem"

Bloom's 1984 "2-sigma problem" — observed that one-on-one human tutoring produced gains 2 standard deviations above conventional classroom instruction — has been one of educational AI's organising aspirations. If LLM tutors can achieve even a fraction of human-tutor effectiveness at scale, the cost-effectiveness of education would change dramatically. The 2024–2026 evidence is preliminary: rigorous experimental work (Mollick's various Wharton studies, the Khan Academy efficacy program) is still in collection. Initial results suggest LLM tutors produce real gains but smaller than human one-on-one tutoring, and that the gap depends on tutor design (Socratic vs. direct), student characteristics (motivation, prior knowledge), and topic structure. Whether closing this gap is feasible remains an open empirical question.

Foundation models for education

The 2024–2026 wave of education-specific foundation models is reshaping the methodology of every section of this chapter. Khan Academy partners with OpenAI on a tuned model for Khanmigo. Pearson, Cengage, and McGraw-Hill have developed proprietary-curriculum-trained variants. Several research labs (Carnegie Mellon's Open Learning Initiative, Stanford's AI Lab, the various LearnLab projects) have produced education-tuned open-weight models. The methodology connects directly to the foundation-model material of Part VI Ch 02 — the education layer is fine-tuning, retrieval, and rubric-based prompting on top of general-purpose models, with the curriculum content and pedagogic framing as the customisation.

Frontier methods

Several frontiers are particularly active in 2026. Multimodal learning analytics combine text, voice, video, and physiological signals (eye-tracking, gesture, expression) into richer learner models. Causal inference for educational interventions applies the methodology of Part XIII Ch 03 to learning experiments, with substantial 2024–2026 work on causal effects of LLM tutoring. Agentic tutors orchestrate multi-tool workflows for complex tutoring tasks (research projects, multi-step problem solving) with substantial recent demonstration work. Educational digital twins simulate individual learners' future trajectories under different intervention strategies, supporting policy decision-making. Open-source educational LLMs (the OLMo-Edu lineage, the various community-driven projects) are emerging as alternatives to proprietary commercial systems for institutions that need local control over models.

What this chapter does not cover

Several adjacent areas are out of scope. The substantial cognitive-science literature on how learning works (memory, transfer, motivation, metacognition) is essential context but is its own substantial field. The economics and policy of education — funding, equity-of-access, the labour-market consequences of changing skill demands — are policy questions rather than technical ones. The specific methodology of standardised testing and psychometrics overlaps with the assessment material of Section 5 but has its own technical traditions. The history of educational technology (programmed learning, computer-aided instruction, MOOCs) is the prologue to the AI era but is treated mostly through a separate lens. And the ongoing political debates about AI's role in education — automation of teaching, the future of universities, credentialing — are important but treated through policy rather than ML lenses.