Intelligence
Weekly BriefingThe Brain Problem
Offerings
Skill MakerCompressorOptimizer
Writings
BlogManifesto

Get the Briefing

One email. Every week. Free.

AI Tools

How to Run an LLM Locally: Tools, Models, and Step-by-Step Setup

Mon, Feb 23, 2026 · 10 min read
How to Run an LLM Locally: Tools, Models, and Step-by-Step Setup

Running a local llm means downloading one of the many available large language models to your own machine and running it without any internet connection or api calls. Your prompts and responses never leave your computer. No cloud services involved. No per-token pricing. No data being logged by OpenAI, Google, or anyone else.

The tools for running local llms have gotten remarkably simple in the past year. What used to require compiling C++ from source and manually downloading model weights is now a one-line install and a single command line instruction. This tutorial walks you through the five main ways to run llm locally, which models to choose for your hardware, and how to get from zero to a working local llm setup on windows, macos, or linux.

The five tools for running llms locally

1. Ollama — the developer favorite

Ollama is the most popular tool for running local llms. It's a cli application that handles downloading, managing, and serving models with minimal setup. Install it, then:

ollama pull llama3.2:3b
ollama run llama3.2:3b

That's it. You're running a local llm. Type a prompt, get a response. Ollama also exposes a local api on port 11434 that's compatible with the OpenAI api format, so you can point any chatbot, coding assistant, or automation tool at http://localhost:11434 and it works.

Ollama runs on macos, linux, and windows. On Apple Silicon Macs, it automatically uses the gpu for acceleration. On linux with an nvidia gpu, it detects cuda and uses the graphics card. The homepage at ollama.com has a one-click installer for each platform.

2. LM Studio — the desktop app

LM Studio gives you a graphical interface for downloading and running local models. It's the best option if you want to browse models visually, adjust parameters with sliders, and chat through a clean UI instead of a terminal.

LM Studio pulls models from Hugging Face and supports gguf format (the standard quantization format for local models). It runs on windows, macos, and linux, and includes a built-in localhost server so you can use it as a backend for other tools. For developers who want both a chat interface and an api endpoint, LM Studio covers both.

3. GPT4All — the all-in-one for beginners

GPT4All is designed for people who want to run models locally without any technical setup. Download the installer, pick a model from the built-in catalog, and start chatting. It works entirely offline and runs on cpu — no gpu required.

The trade-off is performance. GPT4All's models are optimized for cpu inference, which means they're smaller models and slower than gpu-accelerated alternatives. But for a first experience with running local llms, the convenience is hard to beat. It also supports local document search — drop in PDFs and ask questions about them.

4. llama.cpp — maximum control

llama.cpp is the open-source engine that powers most local llm tools, including Ollama and LM Studio under the hood. Using it directly gives you maximum control over quantization, batch sizes, context windows, and memory allocation.

It's a command line tool written in C/C++ that runs on anything — including machines without a gpu. Performance on cpu is surprisingly good for smaller models. When you see a gguf file on Hugging Face, that's a model packaged for llama.cpp.

./llama-cli -m ./models/llama3-8b-q4.gguf -p "Explain Docker in one paragraph" -n 200

This is for developers who want to understand exactly what's happening. Most people should use Ollama instead.

5. llamafile — single-file simplicity

llamafile from Mozilla packages a model and the inference engine into a single executable file. Download one file, make it executable, run it. No dependencies, no installer, no docker. It works on windows, macos, and linux from the same file.

chmod +x llama-3.2-1b.llamafile
./llama-3.2-1b.llamafile

This opens a webui in your browser at localhost:8080. llamafile is the ultimate "just make it work" solution — especially useful for sharing models with non-technical colleagues.

Choosing a model for your hardware

The model you choose depends on two things: how much vram (gpu memory) or RAM you have, and what you need the model to do.

GPU setups

VRAM Model size Recommended models
6-8GB Up to 7B parameters Llama 3.2 3B, Gemma 3 2B, Phi-3 Mini
12-16GB Up to 13B parameters Llama3 8B, DeepSeek Coder 6.7B, Mistral 7B
24GB+ Up to 70B parameters Llama 3.1 70B, Qwen 2.5 72B, Mixtral 8x7B

nvidia gpus with cuda support give the best performance. An rtx 4090 with 24GB vram is the sweet spot for running larger models at interactive speeds.

CPU-only setups

If you don't have a discrete gpu, you can still run models using system RAM:

RAM Recommended Speed
8GB Llama 3.2 1B, Phi-3 Mini ~8-12 tokens/sec
16GB Llama 3.2 3B, Gemma 3 4B ~5-10 tokens/sec
32GB+ Llama3 8B, Mistral 7B ~3-8 tokens/sec

Apple silicon Macs are the exception — they use unified memory, so a Mac with 16GB RAM has effectively 16GB of "vram" for model inference. This makes Macs disproportionately good at running local models compared to windows or linux machines with the same RAM but no dedicated gpu.

Quantization matters

Every model listed above assumes you're running a quantized version — typically Q4_K_M or Q5_K_M quantization. This compresses the model to use roughly 4-5 bits per parameter instead of the original 16 bits, cutting memory usage by 70-80% with minimal quality loss.

Ollama handles quantization automatically. LM Studio and llama.cpp let you choose specific quantization levels from the gguf files on Hugging Face. Lower quantization (Q2, Q3) saves more memory but degrades quality noticeably. Q4 is the sweet spot for most use cases.

Step-by-step: your first local llm (Ollama run)

Here's the complete tutorial from zero to a working chatbot:

1. Install Ollama

On macos:

brew install ollama

On linux:

curl -fsSL https://ollama.com/install.sh | sh

On windows: download the installer from the Ollama homepage.

2. Pull a model

ollama pull llama3.2:3b

This downloads a 2GB model. First run takes a few minutes depending on your connection.

3. Start chatting

ollama run llama3.2:3b

You're now running llms locally. Type a prompt, get a response. The model name in the command tells Ollama which model to load — ollama run loads the model, allocates memory, and starts an interactive session.

4. Use the api

In another terminal:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "What is Docker?"}]
  }'

This api endpoint is compatible with OpenAI's format — no api key required for local inference. Any tool that works with the ChatGPT api — python libraries, Node.js SDKs, workflow automation platforms — works with your local model by changing the base URL.

The best open source llms right now

Here's what's worth running locally in February 2026:

Meta-Llama 3.2 — the default recommendation. Available in 1B, 3B, 8B, and 70B sizes. The 3B model is the best balance of quality and speed for most hardware. Open source, no restrictions on commercial use.

DeepSeek Coder V2 — the best open source llms for coding tasks. Outperforms many cloud models on code benchmarks. Available in multiple sizes.

Mistral 7B / Mixtral — strong general-purpose models from the French lab Mistral AI. Mixtral uses a mixture-of-experts architecture that's faster than its parameter count suggests.

Gemma 3 — Google's compact models. The 2B and 4B versions are surprisingly capable for their size and run well on limited hardware.

Qwen 2.5 — Alibaba's model family. Excellent at multilingual tasks and coding. The 72B version is competitive with ChatGPT for many tasks.

Browse the full catalog at the Ollama model library or Hugging Face. New open source models ship weekly.

Use cases: what local llms are good for

Privacy-sensitive tasks — medical notes, legal documents, financial data. Anything you can't send to cloud services.

Offline work — running local llms on a laptop during flights, in remote locations, or in air-gapped environments.

Development and testing — prototyping ai model integrations without burning api credits. Test your prompts locally before deploying against a cloud endpoint.

AI agent backends — running an ai model for chatbot, coding assistant, or docs search without per-request costs. High-volume workflows that would cost hundreds per month on cloud services run for free locally.

Learning and experimentation — exploring how different function calls, parameters, and prompts affect model behavior. Local models let you experiment without worrying about costs. You can even start fine-tuning models on your own data using tools like Unsloth or the Hugging Face Transformers library.

The latency is real — local models on cpu are slower than a cloud api call. But for many use cases, the compatibility with your privacy requirements, zero pricing, and complete control over the ecosystem make that trade-off worth it.

Tips for getting the most out of local models

Start with Ollama. It has the lowest friction of any tool and supports the widest range of models. Once you're comfortable, explore LM Studio for the GUI experience or llama.cpp for maximum control.

Use the right model size for your hardware. Running larger models that exceed your vram forces the system to swap between gpu and cpu, which destroys performance. A fast smaller models experience is always better than a slow run with a larger model.

Keep models updated. New quantization techniques and model releases ship constantly. Ollama makes updating easy — ollama pull model-name always grabs the latest version available.

Set up a local api endpoint early. Even if you start with the cli for interactive chat, having the localhost api running lets you connect any tool in your docs, workflow, or development stack to your local model.

Monitor your resources. On linux, nvidia-smi shows gpu usage. On macos, Activity Monitor shows unified memory. Understanding where your bottlenecks are helps you choose better models and configurations.

The local llm ecosystem is moving fast. Models that required enterprise hardware two years ago run on laptops today. The gap between cloud and local keeps shrinking — and for privacy, cost, and control, running local llms is increasingly the right default.

Sources

Recent Posts