One word is doing too much work. A long piece on what people actually mean when they say "AI" in 2026 — neural network computing, natural language processing, large language models, generative AI, and agentic AI — what each one is, how they fit together, and which conversation you are actually in.
The word AI has stopped carrying its weight. It used to be a research field. Then it became a product category. By 2026 it is a procurement label, a regulatory shorthand, a marketing prefix, a board-meeting agenda item, and a fundraising slogan — and the same three letters mean a different thing in each of those rooms.
This is not a small problem. When the audit committee asks how exposed are we to AI risk, the answer depends entirely on which AI they mean. When the regulator asks for an inventory of your AI systems, the answer depends entirely on which AI counts. When an engineer says we use AI for that, the next thirty minutes depend entirely on which AI they meant.
There are five conversations hiding inside the word, and they are not the same conversation. Neural network computing. Natural language processing. Large language models. Generative AI. Agentic AI. They overlap, they nest, and they have arrived in production at different times for different reasons. Pretending they are one thing is the source of most of the bad decisions being made about AI in boardrooms in 2026.
This piece sorts them out. It is long on purpose. If you only want the headline, here it is: neural networks are the substrate, NLP is the older discipline, LLMs are one kind of neural network applied to language, generative AI is what you get when you let those networks produce outputs rather than just classify inputs, and agentic AI is what you get when you wire those outputs to a loop that can act. The rest of the piece is the working.
Neural network computing — the substrate
A neural network, in the engineering sense, is a function approximator. You give it inputs, you adjust a very large number of internal parameters until it produces the outputs you want on the examples you trained it on, and you hope — within bounds you do not always understand — that it generalises to inputs it has not seen. That is the whole trick. The neural part is a borrowed metaphor from the 1940s; the architecture does not work the way a brain works, and the people who built the modern field stopped pretending otherwise years ago.
What makes a neural network different from a hand-written program is the source of its behaviour. A traditional program is a sequence of instructions a human wrote down. A neural network is a sequence of multiplications and additions whose specific numbers were determined by an automated training process. The behaviour is in the numbers — billions of them in a modern model — and nobody read each one. You can inspect any individual weight, but you will not learn much by doing so. The system's behaviour is emergent from the aggregate.
Three properties of neural networks matter for everything that follows. First, they are differentiable, which is what makes the training process work: you can compute, for every weight, how much that weight contributed to the error on a given example, and nudge it accordingly. Second, they are statistical, which is what makes them powerful and what makes them fail: they are extracting patterns from data, not learning rules, and a pattern that holds in the training distribution will not always hold elsewhere. Third, they are opaque to introspection by their authors, which is what makes governance hard: why did it do that is a research question, not a debugging question.
Neural networks are not new. The mathematics goes back to the 1940s. The training algorithms that made them tractable are from the 1980s. They went through a long winter in the 1990s, when other approaches were better for most practical problems, and came back into favour in the early 2010s when three things happened at once: there was suddenly a lot of training data on the internet, graphics cards turned out to be good at the kind of arithmetic neural networks need, and a series of architectural improvements made deeper networks trainable. By the mid-2010s neural networks were the state of the art across most of what was then called machine learning — image classification, speech recognition, machine translation, recommendation systems, fraud detection.
This is the point that gets lost in 2026. When a bank says they have been using AI for ten years, they mean neural networks. When a fraud team detects an unusual transaction pattern, that is a neural network — a small, purpose-built, deterministic one — doing what neural networks have been doing for over a decade. When a hospital's imaging system flags a possible tumour, that is a neural network. When your phone unlocks because it recognises your face, that is a neural network. None of these are large language models. None of them are generative. None of them are agents. They are the substrate, and most of the AI that actually runs in production in 2026 is still this kind of AI.
I have argued this point before and I will argue it again: the neural-network-as-classifier shape is still the dominant operational form of AI in regulated industries, and the conversation about AI risk that treats the classifier and the chatbot as the same category is making a category error that the underlying systems do not share. The classifier does not write you a sonnet. It does not invent a fact. It does one thing on a fixed schema and produces a number. The governance conversation about that kind of system is a decade old and largely solved. The governance conversation about the other kinds is barely started.
Natural language processing — the older discipline
Natural language processing was an academic field before it was a product category. The premise — can a computer be made to do useful work with text and speech — is old enough to have weathered three or four winters of its own. For most of its history, NLP was a collection of narrow, brittle, hand-tuned components: a part-of-speech tagger that knew book could be a noun or a verb, a parser that turned a sentence into a tree, a named-entity recogniser that knew London was a place, a sentiment classifier that knew terrible was negative.
These systems worked. They still work. The compliance team at a bank in 2026 that is scanning emails for regulatory keywords is, at the engineering layer, running an NLP pipeline that would look familiar to a researcher from 2008. The call-centre analytics that produces a sentiment score every Monday is a pipeline of trained classifiers. The transcript redaction tool that pulls personal data out of legal disclosure is a pipeline of recognisers. The fact that LLMs can now do these things in one shot does not make the older pipelines obsolete — it makes them an option, and often a cheaper, more deterministic, more auditable one.
Three things distinguish classical NLP from what came after. First, it is task-specific by design. You build a sentiment classifier for product reviews, you build a named-entity recogniser for medical notes, and you build a parser for English news text. Each component has a job. Second, it is composable. You stitch components into a pipeline — tokenise, tag, parse, classify — and the pipeline's behaviour is the composition of the components, each of which you can inspect and replace. Third, it is mostly deterministic at the consequential layer. The model that does the classification might be statistical, but the decision boundary it produces is fixed at training time, and the pipeline that uses it can be tested, monitored, and bounded.
What changed in the 2010s was the move from feature engineering to representation learning. Classical NLP relied on humans deciding which features of the text mattered — word stems, part-of-speech tags, syntactic structure — and the system learned how to combine those features. The shift began with word embeddings in 2013, which produced learned numerical representations of words that captured meaningful semantic relationships, and accelerated through 2017 when the transformer architecture appeared. The transformer did not just improve translation, which is what its original paper was about. It produced a general-purpose engine for sequence-to-sequence problems that absorbed most of NLP over the next five years.
By 2020, the boundary between NLP-as-a-field and the deep-learning research community had largely dissolved. By 2024, almost no new NLP system was being built from scratch with the classical-pipeline approach when a language model could be fine-tuned to do the job. By 2026, the field exists in two layers: the operational deployments that still run the older shape because they work, are cheap, and meet a governance bar that newer systems struggle to meet; and the frontier work, which is now indistinguishable from large language model research and has stopped being called NLP in the new literature.
The reason this matters for the wider 2026 conversation is that NLP did not disappear. It is the part of the picture that has been doing useful work in production for the longest, in the most regulated industries, with the best-understood governance. When a regulator asks about your AI use, the answer probably includes a half-dozen NLP systems that nobody on the management team thinks of as AI any more, because they have been there for so long. They count. They are also, often, the easiest part of the inventory to explain.
Large language models — what they actually are
A large language model is a neural network trained to predict the next token in a sequence. That is the entire mechanism. Everything else — the conversational behaviour, the apparent reasoning, the ability to write code, the willingness to follow instructions, the curious failures — is downstream of that single objective applied at very large scale across very large amounts of text.
The architecture is the transformer, which I will not re-derive in full here. The two ideas that matter are tokenisation and attention. Tokenisation breaks text into chunks — not always words, often pieces of words — and converts each chunk into a number. The model never sees text; it sees a sequence of integers. Attention is the mechanism that lets the model, when predicting the next token, weigh up which earlier tokens are relevant and how much. The combination of attention with the surrounding feed-forward layers, repeated many times in a stack, is what gives a transformer its capacity to track context, recognise structure, and produce outputs that look like reasoning even when the underlying computation is statistical pattern completion.
Training happens in stages. The first stage is pre-training: the model is shown enormous quantities of text — books, web pages, code repositories, academic papers, transcripts, forum posts — and trained to predict the next token over and over again until its predictions are good enough that the patterns of human language and human knowledge have been captured implicitly in its weights. The second stage is alignment or post-training: a much smaller volume of curated examples, often with reinforcement learning from human or AI feedback, teaches the model to behave conversationally, follow instructions, refuse certain categories of request, and reason in particular ways. By 2026, the post-training step has become at least as important to a model's character as the pre-training step, and the proprietary recipes are where most of the differentiation between frontier vendors now sits.
The scale is the part that does not have a good intuitive analogue. A frontier model in 2026 has on the order of hundreds of billions to a few trillion parameters. It was trained on trillions of tokens of text. Training the model cost somewhere between tens and hundreds of millions of pounds in compute alone. Inference — running the model to answer a question — is cheap by comparison, but still meaningful at scale, and the price-per-token has been falling roughly an order of magnitude per year for the same capability tier, which is the single most important economic fact about AI in 2026.
The behaviour of these models is what makes them interesting and what makes them dangerous to reason about loosely. They produce text that is coherent, fluent, often correct, and sometimes wrong in ways that are confidently expressed and difficult to spot. They have no internal mechanism for tracking which of their statements are grounded in their training data and which are pattern-completion fabrications — hallucinations, in the usual jargon, though the term flatters the model by implying something more cognitive than what is happening. They are not reasoning in the formal sense; they are producing text that looks like the text reasoning produces. Sometimes that is enough. Sometimes it is not.
The capabilities frontier in 2026 is genuinely impressive. Frontier models from Anthropic, OpenAI, Google, xAI, and DeepSeek now cluster within a few points of each other on the benchmarks that have not yet saturated. They can solve graduate-level reasoning problems that were considered out of reach two years ago. They can write production-grade code in most languages. They can hold a multi-step technical conversation with a domain expert and contribute usefully. They can also tell you with great confidence that a real library function exists when it does not, that a real person said something they did not say, and that a real event happened differently from how it did. Both behaviours come from the same mechanism.
The procurement-relevant fact about LLMs in 2026 is that there are now more legitimate options than there were eighteen months ago. Hosted frontier models from the big vendors are still the most capable for the hardest problems, and most regulated buyers still use them. Open-weight models — Llama, Mistral, Qwen, DeepSeek, GLM and others — have caught up at the middle of the capability range and are closing on the frontier. They can be self-hosted, fine-tuned, audited, and deployed on premises in a way that the hosted models cannot. For organisations with strict data-residency, audit, or sovereignty requirements, the open-weight option has moved from interesting in theory to legitimate in production inside about twelve months. This is the trajectory that will shape most of the procurement decisions of the next twenty-four months.
The thing to hold onto is that an LLM is a neural network. It sits inside the broader category of neural network computing; it is one application of that mathematics. It happens to be the application that has captured every conversation about AI in 2026, but it is not the whole of AI, and pretending otherwise will get the inventory question wrong.
Generative AI — the umbrella that does not quite hold up
Generative AI is a marketing category that has become a procurement category. Whether it is a useful technical category is more debatable. The idea is that there is a class of model whose job is to produce content — text, images, code, audio, video, three-dimensional models, molecular structures — rather than to classify or score existing content. The contrast is with discriminative models, which produce a verdict on an input. A discriminative model says this email is spam. A generative model produces a new email.
The distinction is real but the boundary is fuzzy. Most modern frontier systems do both. An LLM can classify your email as spam if you ask it to; the classifier is just a constrained instance of the generator producing a one-token answer. A generative image model has a discriminative model inside it during training. The terms are most useful for distinguishing the deployment shape rather than the underlying mathematics: when you deploy a model to produce something a person will read or look at, the operational risks are different from when you deploy it to produce a score someone will act on.
The generative AI category in 2026 includes large language models — by far the largest part of the spend and the attention — but also image generators (Midjourney, Imagen, the open-weight Flux family), video generators (Sora, Veo, Runway), audio generators (Suno, ElevenLabs for voice, the music-generation tools), and an increasingly sophisticated set of multi-modal systems that combine these in one model. The biggest 2026 shift has been the maturation of video generation. The synthetic video produced by frontier models is now consistent enough across shots, controllable enough by prompt, and cheap enough by minute that the practical question for most organisations is not can it be done but how do we know what we are looking at. The provenance and authentication question that started with the deepfake conversation in 2018 is now an operational problem with a procurement budget attached.
The risks of generative AI sit in two layers. The first is the risks the system itself creates: hallucinated facts, fabricated citations, plausible-looking code that does not work, images that fail in subtle ways, content that infringes copyright in ways that are hard to detect until someone notices. The second is the risks the system enables others to create: phishing emails that no longer have the spelling mistakes that made them spottable, voice clones used to authorise fraudulent payments, synthetic intimate imagery of real people, AI-generated child sexual abuse material at industrial volumes, automated influence operations. The 2026 reality is that both layers are now in everyday operational scope, and treating them as separate conversations — our use of AI versus AI being used against us — produces an incomplete picture of the actual exposure.
The procurement question about generative AI in 2026 is not should we use it. The conversation has moved past that. It is which categories of work in our organisation are appropriate for generative output, with what review layer, by whom, with what audit trail, and what is the failure mode when the output is wrong and acted upon anyway. The organisations that have answered those questions clearly are getting useful work out of generative systems. The organisations that have answered we have a policy — without the per-category detail — are mostly running uncontrolled experiments and finding out the failure modes the slow way.
Agentic AI — the layer on top
An agent, in the modern sense, is a system that can take an instruction, decide on a sequence of actions to carry it out, take those actions in the world, observe the results, and adjust. The headline distinction between an LLM and an agent is the loop. An LLM responds. An agent responds, then acts, then responds to what its action produced, then acts again, until some terminating condition is met or its supervisor stops it.
The components of an agent in 2026 are reasonably standardised. There is a model, usually an LLM, sometimes more than one, doing the reasoning. There is a set of tools — functions the model is allowed to call — that let it do things outside the conversation: read a file, query a database, search the web, send an email, run a shell command, call an API, post a payment. There is a memory of some kind, which might be the conversation history, a vector database, a structured store of facts the model has learned about the task, or all three. There is an orchestration layer that decides when to call the model, how to format its responses, what tools to expose, and when to stop. By the end of 2025 this stack had crystallised; by mid-2026 most enterprise agent deployments are some shape of it.
The conceptual breakthrough that made agents work in 2024–25 was the realisation that the language model itself could be used to plan, reflect, and select tools, rather than being treated as a single-shot completion engine. The Model Context Protocol, introduced by Anthropic in late 2024, has become the de-facto standard for connecting tools to models in 2026, and is doing for agent-to-tool integration what HTTP did for browser-to-server in the 1990s — making the components portable, the marketplace broader, and the lock-in lower. An MCP-compliant Salesforce connector works with an agent running on top of Claude, an agent running on top of GPT, or an agent running on top of a self-hosted Llama, without modification. This is the kind of standardisation that quietly reshapes a market.
The economics of agents are different from the economics of LLMs because the loop changes the cost. An LLM call costs a known amount; an agent call costs that amount multiplied by however many turns the loop takes, which is often unpredictable and occasionally pathological. The early agent deployments of 2024 were notorious for racking up bills because a model would get stuck retrying a tool that was failing, or take fifteen turns to do something that should have taken three, or hit a recursion that nobody had bounded. By 2026 most of the production frameworks have solved this with hard step limits, cost ceilings, and instrumentation that surfaces what the loop is actually doing. The vendors who skipped that step are now being sued by their customers.
The governance conversation about agents is younger than the agent technology, and that gap is where the interesting failures are happening. Agents inherit the credentials of whoever runs them. An agent operating under a senior engineer's credentials has senior engineer powers; an agent operating under an admin account has admin powers. There is no equivalent in current frameworks of the human hesitation that stops a senior engineer from dropping the production database during a long Tuesday afternoon. The famous nine-second incident — an AI coding agent deleting a production database and its backups in nine seconds, on the basis of an instruction the founder did not realise the agent would take as a command — is the operational shape of this risk, and it is the shape we will be seeing repeated across industries until the governance catches up.
The Gartner number that circulated in early 2026 — forty per cent of agentic AI deployments will be cancelled by 2027 — gets quoted as if it is a damning statistic about the technology. It is not. It is the normal pattern of an immature category meeting an enterprise procurement bar. The cancellations are happening because the early deployments were sold on capability and procured without the discipline that any production-class system needs: scope, audit, cost ceiling, irreversibility controls, human-in-the-loop for consequential actions, and a clear answer to what does this agent see and do that survives contact with the audit committee. The deployments that survive 2026 are the ones whose buyers asked those questions before they signed.
The categories of agent deployment that I see making it through to durable production work are the constrained ones: agents that operate inside a bounded action vocabulary, with deterministic decision-making at the consequential layer, with human review on irreversible actions, with audit trails that the agent cannot suppress, and with credential scopes designed for the agent rather than borrowed from a person. The categories that are mostly failing are the unbounded-operator agents — the demos where an agent is given a chat window and a credit card and told to just figure it out. They are entertaining at conferences. They are not yet trustworthy enough to put into a regulated production environment.
How the categories fit together
If you draw the picture cleanly, you get something like a stack. Neural networks are the bottom layer — the mathematics, the substrate, the way any of this works at all. NLP sits above that as one of the application domains that uses neural networks, alongside computer vision, speech, recommendation, and several other older fields. Large language models are one particular kind of neural network applied to language; they have absorbed most of what used to be called NLP and added a lot more besides. Generative AI is the deployment category that uses LLMs and their multi-modal cousins to produce content rather than score it. Agentic AI is the orchestration layer above generative AI, where you take a model and wire it into a loop that can act.
That picture is approximately right and worth keeping in your head. The bit that needs adding is the older operational reality I have stressed throughout this piece: the neural-network-classifier layer never went away. Most of the actual AI in production in a 2026 regulated environment is a fraud detector or an imaging classifier or a recommendation engine or an NLP pipeline that has been there for years and is not what anyone means when they say our AI strategy. Any honest inventory has to include those, because regulators, courts, and incident responders are going to.
The other complication is that the categories overlap. An LLM is a neural network and a piece of NLP and a generative model. A modern agent has at least one LLM inside it, often more, and frequently calls non-LLM neural networks as tools — a small classifier to triage an incoming email, a vision model to read a screenshot, a fraud-detection model whose verdict the agent then explains. A generative system can include a discriminative model as a quality filter on its outputs. A piece of classical NLP can be wrapped in an LLM that translates between the user and the pipeline. The clean stack is a useful first approximation; the actual systems running in 2026 are tangled in the way real systems always are.
Where 2026 actually sits
A short list of the things that are now operationally true that were not when most of this conversation started.
Frontier model performance is converging. The gap between the best closed model and the best open-weight model on most benchmarks is now small enough that for many production deployments the question of which to pick is decided by deployment shape rather than capability. The vendors who used to compete on our model is smarter are now competing on cost, latency, context length, multi-modal coverage, tool-use behaviour, and how their model behaves inside an agent loop. The capability frontier still moves; it just moves more slowly relative to the diffusion of last year's frontier into open weights.
Inference cost has fallen another order of magnitude. A token from a frontier-tier model in mid-2026 costs roughly a tenth of what an equivalent-quality token cost in mid-2025, which itself was roughly a tenth of mid-2024. The price-per-token cost curve has been the most consistent thing in the field for three years and shows no sign of breaking. The implication for procurement is that the cost model you signed off twelve months ago is materially wrong; the implication for security is that the same compute budget can now power ten times the activity, defensively or offensively.
Open-weight is a serious option. By mid-2026 the open-weight models — DeepSeek's reasoning models, the Llama family, Mistral, Qwen, GLM — are operationally credible alternatives to the hosted frontier for a wide range of work. For regulated buyers with data-residency or audit requirements, the on-premises deployment of a frontier-capable open-weight model is now a procurement option rather than a research project. The supply chain implications — what does it mean that several of the leading open-weight providers are based in China, what do we know about the training data, what is the threat model for a model whose weights we self-host — are being worked out in real time and will not be resolved this year.
MCP is the standard. The Model Context Protocol has become the integration layer for agentic systems the way ODBC became the integration layer for databases. Anyone shipping an agent in 2026 is either MCP-compliant or explaining why not. The marketplace of MCP-compliant tools is now in the thousands and is one of the better leading indicators of how this stack is going to evolve.
Regulation is no longer ahead of the technology, and is no longer behind it either. The UK Cyber Security and Resilience Bill, the ICO's five-step guide, the BoE/FCA/HMT joint statement on AI in financial services, the Cabinet Office letter to permanent secretaries — the regulatory apparatus has now produced enough material that the question of what does compliance look like has an answer, even if the answer is still being refined. The polite phase, as I described it elsewhere, is over. The procurement bar has risen accordingly.
The agentic-AI deployment failure rate is roughly what you would expect for an immature category. The cancellations make a good headline but the survivors are converging on a recognisable shape. Two years from now most of the discussion will be about the survivors and the shape, not about the cancellations and the hype.
What this picture should change for the reader
If you sit on a board, the practical thing to take from this is that the question are we using AI safely is not a question. It is a placeholder for at least five separate questions, and the management answer that conflates them is the same answer that does not tell you anything useful. The questions worth asking, in order:
Which of our existing systems use the older neural-network-classifier shape, and are they on the inventory? Which of our operational pipelines use classical NLP, and are they on the inventory? Where are we using generative models, by whom, with what review layer, and what is the failure mode? Which of our deployed agents have credentials at production scope, and what could each one do in nine seconds if it misinterpreted an instruction? What is our supply-chain exposure to model vendors, hosting providers, and the data-residency implications of each? An organisation that can answer those five with specifics is in better shape than most. An organisation that answers them with we have a policy is in the same place most boards were two years ago — which is to say, behind where the regulators are heading.
If you are a CISO or a senior practitioner, the operational shape that has emerged is the one I sketched in the agentic section: constrained action vocabularies, deterministic decision-making at the consequential layer, audit trails the agent cannot suppress, credential scopes designed for the agent, human-in-the-loop on irreversible actions. The vendors who engineered for that shape years ago are now well-positioned. The vendors who treated the conversational demo as the product are doing the work now. The category is going to get its proper procurement-class name some time in the next eighteen months; the architectural disciplines underneath it are already settled.
If you are an engineer, the change to your day-to-day is that the LLM has become a primitive in the same way that the database, the cache, and the HTTP client are primitives. You will be using it. The interesting work is no longer can the model do this — almost any model can do almost anything you ask it once — but can the system around the model do this reliably, at cost, under failure, with the audit story I need. That is a systems engineering problem dressed up as an AI problem, and the engineers who treat it as a systems engineering problem are producing systems that work.
A short close
The reason this piece is long is that the conversation has been allowed to be short for too long. AI became a single word that hid five different conversations, and every one of those conversations has its own technical shape, its own procurement story, its own governance pattern, and its own failure modes. Holding them as one category is the source of most of the bad decisions, and most of the good decisions begin with the question which one are we actually talking about.
Neural network computing is forty years old in its modern form and is doing most of the operational work in regulated industries today. Natural language processing is older than that and is still in production wherever a deterministic, auditable, narrow component is the right answer. Large language models are five years old in their public form and have changed the shape of what a single piece of software can do. Generative AI is the deployment category around them; it is real, it is consequential, and it is being held to a higher procurement bar than it was twelve months ago. Agentic AI is the youngest and the loudest of the five; it is finding its operational shape now, and the shape that survives is going to look much more like the disciplined classifiers of 2018 than like the unbounded operator demos of 2024.
In 2026 the right answer to what is AI is which of the five do you mean. Anyone who treats the word as if it has a single referent is going to be wrong about something important, soon, and probably in production.
— end of article