The ES-LLMs Public Sandbox Is Now Live: From Untamed Black Box to Interpretable Pedagogical Orchestration - Nizam Kadir | Machine Pedagogical Intelligence & AI in Education

We have officially launched the public sandbox for the Ensemble of Specialised Large Language Models (ES-LLMs)—the core architecture behind our AIED 2026 paper—on the LearnAdapt platform.

👉 Try the live sandbox: https://learnadaptresearch.org/aied2026
📄 Read the full paper: https://arxiv.org/abs/2603.23990
🛠️ Explore the code: https://github.com/nizamkadirteach/aied2026-esllms

For a long time, the deployment of Large Language Models (LLMs) in education has relied on monolithic systems. These models are undeniably fluent, but they operate as black boxes. Their pedagogical decisions are implicit, difficult to audit, and often misaligned with how learning actually works. In practice, they tend to optimise for immediate user satisfaction rather than long-term understanding—frequently giving answers too early and bypassing the very struggle that supports learning.

This is not just a technical limitation. It is a pedagogical failure.

The Control Problem in AI Tutoring

Monolithic LLM tutors suffer from what we describe as a control problem. While they can generate coherent explanations, they lack the structural discipline required in educational settings. Constraints such as “attempt before hint” or “limit scaffolding to encourage effort” are difficult to enforce reliably through prompts alone.

Our research identifies a phenomenon we call the Mastery Gain Paradox: students appear to perform better in the short term because they receive excessive assistance, but their underlying mastery does not improve—and may even decline .

This raises a fundamental question:

How do we retain the fluency of LLMs while enforcing the rigour of pedagogy?

A Structural Shift: Decoupling Decision from Language

The ES-LLMs architecture addresses this by separating what the tutor decides from how the tutor speaks.

Instead of relying on a single model to manage everything—assessment, feedback, scaffolding, motivation, and ethics—we introduce a structured, multi-agent system coordinated by a deterministic orchestrator.

At a high level, the system operates in three layers:

Learner and Context Modelling
The system continuously updates a student model using Bayesian Knowledge Tracing (BKT), alongside signals such as error streaks, hint usage, and latency. This produces a rich, interpretable snapshot of the learner’s state.
Deterministic Orchestration
A rule-based orchestrator evaluates this state and determines the appropriate pedagogical action. It enforces constraints explicitly—such as preventing hints before an attempt—ensuring that instructional policies are followed consistently.
Specialised Agent Ensemble + LLM Rendering
A set of specialised agents (AssessmentBot, FeedbackBot, ScaffoldBot, MotivatorBot, EthicsBot, and TutorBot) propose actions based on the learner’s state. The orchestrator selects the appropriate action, and only then is a single LLM call used to render the response in natural language.

The critical point is this:
The LLM no longer decides pedagogy—it only expresses it.

The Architecture in Practice

As illustrated in the system architecture (see Fig. 1 in the paper), the pipeline integrates:

A student model grounded in BKT mastery estimates
A context builder that aggregates learner state and curriculum
A deterministic orchestrator that applies priority rules and constraint checks
A set of specialised agents, each responsible for a distinct pedagogical function
A single LLM renderer that converts structured decisions into concise dialogue

All decisions, constraints, and agent activations are logged, enabling full traceability and auditability.

What This Enables

1. Interpretability by Design

Every instructional decision can be traced back to:

learner state (e.g. mastery probability)
rule-based policies
selected agent actions

This moves tutoring systems from opaque outputs to explainable processes.

2. Enforced Pedagogical Integrity

Unlike prompt-based systems, ES-LLMs guarantees adherence to constraints. In our simulations, the architecture achieved 100% compliance with rules such as “attempt before hint”.

This ensures that learning principles—such as productive struggle—are not overridden by model behaviour.

3. Coordinated Multi-Agent Teaching

The system can combine multiple pedagogical intents within a single response. For example, it can:

provide targeted remediation
offer encouragement during frustration
maintain ethical safeguards

These are not emergent behaviours—they are explicitly orchestrated.

4. Efficiency Without Compromise

Interestingly, the architecture also improves system performance:

54% reduction in token usage
22% reduction in latency
improved hint efficiency and learning fidelity

These gains arise from separating decision logic from language generation, allowing for more compact and focused LLM usage.

Empirical Validation

The system was evaluated through:

Monte Carlo simulations (N = 2,400)
Human expert review (N = 6)
Multi-LLM evaluation panels

Across these evaluations, ES-LLMs were preferred in:

91.7% of cases by human experts
79.2% of cases by LLM judges

It outperformed monolithic baselines across all seven pedagogical dimensions, particularly in scaffolding, feedback quality, and trustworthiness.

Beyond Education: A General Design Pattern

While developed for adaptive tutoring, the implications of ES-LLMs extend further.

The architecture exemplifies a broader principle:

Separate policy (decision-making) from generation (expression).

This pattern is relevant in any domain where:

rules must be enforced reliably
decisions must be auditable
language must remain flexible

Examples include healthcare systems, compliance workflows, and customer operations.

Closing Thoughts

The future of AI in education is not simply about more powerful models. It is about better-structured systems.

Monolithic LLMs gave us fluency.
What we need now is control, transparency, and pedagogical alignment.

The ES-LLMs public sandbox is now live. I would strongly encourage you to try it, test its limits, and see how a structured, interpretable approach to AI tutoring fundamentally changes the interaction.