Open-source models and the on-prem option

Post 5 of the AI series. Llama 3, Mistral, Mixtral. The serious open-source LLM era arrived in 2024. What it means for security teams who do not want to send data to a hyperscaler, and how the on-prem path reads against EmilyAI's single-tin posture.

_Post 5 of the AI in cyber series._

The serious open-source LLM era arrived in April 2024. Meta released Llama 3 — 8B and 70B parameter variants — under terms that allowed commercial use up to a usage threshold most enterprises would not hit. Mistral had already shipped credible open-weight models through 2023. Mixtral 8x7B — a mixture-of-experts model that punches above its parameter count — was being adopted in enterprise pilots through the spring.

For security teams, this matters in a specific and underdiscussed way. The hosted LLM products — OpenAI's, Anthropic's, Google's — require sending data to the vendor's infrastructure. For some customers, in some jurisdictions, on some data classes, that is structurally not acceptable. The open-source LLM wave has, for the first time, given those customers a path to LLM-based security tooling that does not cross their data boundary.

This post is about what that path looks like, what it actually delivers, and how it reads against EmilyAI's single-tin on-premises deployment posture.

What is operationally different in mid-2024

Three things have shifted since the start of the year.

Open-source models have caught up with the hosted incumbents on most enterprise benchmarks. Llama 3 70B is, on most published evaluations, within striking distance of GPT-4 and Claude 3 Sonnet. Mixtral 8x22B is roughly comparable on some tasks and behind on others. The gap that existed in early 2023 — the hosted models are clearly better at the tasks I care about — is narrow enough that, for many use cases, the open-source option is the right one.

The hardware to run them locally is now affordable. A pair of NVIDIA L40S GPUs (the cards EmilyAI uses) can run Llama 3 70B at production-grade inference speeds. A single A100 or H100 will do likewise. The total hardware cost for a server that can serve a 70B-parameter model is in the range of £30,000 to £70,000 depending on configuration — well within the reach of any enterprise that considered the hosted LLM offering at all.

The inference tooling has matured. vLLM, TensorRT-LLM, SGLang, and several other inference servers can take an open-weight model and serve it at high throughput on a small cluster. The DIY-then-tune workflow is now a Saturday rather than a project.

The combination — open weights, affordable hardware, mature inference stack — means that, for the first time, an enterprise can deploy an LLM-based security tool entirely within its own infrastructure. The hosted LLM is no longer the only option.

What this does not solve

Three things the open-source path does not solve, that the hosted LLM does not solve either, and that are easy to overlook in the enthusiasm.

Non-determinism. Llama 3, Mistral, Mixtral, and every other current LLM is non-deterministic in the same way GPT-4 and Claude are. The discussion in post 2 of this series applies. Running an LLM locally does not, by itself, make it reproducible.

The continuous learning loop. The hosted LLMs have the benefit (and the risk) of being updated centrally as the provider retrains. Open-source models you download once and then operate. The model version is fixed at the moment you took the snapshot. Retraining or fine-tuning is your work to do.

The verdict-versus-augmentation question. An open-source LLM is the same shape of tool as a hosted one. It is good at the augmentation tasks (query translation, alert summarisation, knowledge-base navigation). It is not, in its raw form, a tier-two SOC analyst.

What it does solve

Specifically, three things.

Data sovereignty. Your customer data does not leave your infrastructure. For UK firms with data-residency requirements, this is the largest single benefit. The legal review of can we put this data into a US-hosted LLM becomes can we put this data into a model running on a UK-based server we control. The latter question is much easier to answer in most regulatory regimes.

Cost predictability. Hosted LLM inference is metered. Open-source LLM inference on your own hardware is paid up front. For a security team that processes high alert volumes, the on-prem option is meaningfully cheaper at scale — and the cost is predictable rather than usage-dependent.

Model stability. The model you deployed last quarter is the model you are still running, unless you choose to update. The hosted-LLM problem of the model has changed under us and our prompts no longer behave the same way is absent.

EmilyAI's single-tin posture

This is where the comparison gets interesting.

EmilyAI's on-premises deployment option — a single Dell PowerEdge R760, racked at the customer's site, with the same configuration management material that deploys our hosted topology — was designed in 2018 for customers who would not put data into a third-party-hosted SaaS. At the time, this was a small minority. By 2024, it is a significantly larger minority and increasingly the default expectation in some sectors (CNI, defence, certain financial services).

The R760 specification supports Llama 3 70B alongside EmilyAI's own analyst model. A customer who wants both — EmilyAI as the deterministic analyst, an open-source LLM as the augmentation tool for human analysts — can have both on the same box. The architecture supports it because the analyst core is decoupled from the augmentation layer.

The deliberate position we took in 2018 — the analyst is purpose-built and deterministic; the augmentation layer is a separate concern — turns out to be the right shape for the on-premises LLM era. The analyst is not where the LLM lives; the LLM is an augmentation tool the analyst can use, and the LLM is replaceable as the open-source models improve without touching the analyst.

The on-premises customer who buys an EmilyAI deployment in 2024 is buying, in effect, a complete on-premises SOC AI stack — analyst plus optional LLM augmentation, with the entire pipeline running on their hardware, with their data, with their auditable trail.

What I think happens next

Three predictions.

Open-source LLMs will keep closing the gap. Llama 4, Mistral's next release, the next Mixtral. The gap to the hosted incumbents will continue to narrow. By the end of 2024 the gap will be uninteresting for most security tasks.

The hosted LLM advantage will shift to high-end reasoning. OpenAI, Anthropic, and Google will keep moving up the model-size and reasoning-capability axis. The gap at the top will persist, and may grow, even as the gap in the middle disappears.

Hybrid deployment becomes the default. Customers will run open-source LLMs locally for the bulk of their work, with hosted LLM access for the small minority of tasks that need the high-end capability. The architecture that supports this — open-source LLM as the default, hosted LLM as the escalation path — is what most enterprise security stacks will look like by 2026.

What is next

In eight weeks (slightly off-cadence to take in CrowdStrike's July outage): a piece on single-vendor concentration in security AI, what the CrowdStrike Falcon update event teaches us about it, and the architectural defences worth building now.