LocalLLM: Running Language Models On-Device

Not every AI application should send data to external APIs. Healthcare, legal, and financial applications often require on-device inference for privacy and compliance. LocalLLM makes running quantized language models locally as simple as calling an API.

Features

One-line installation: pip install localllm
OpenAI-compatible API: Drop-in replacement for existing code
Automatic quantization: Convert models to GGUF format
Memory management: Intelligent model loading/unloading
Streaming support: Real-time token generation

Quick Start

from localllm import LocalLLM

# Initialize with any GGUF model
llm = LocalLLM("mistral-7b-instruct-q4.gguf")

# Use like any LLM API
response = llm.complete("Explain quantum computing in simple terms")
print(response.text)

Performance

On an M2 MacBook Pro with 16GB RAM:

Model	Tokens/sec	Memory
Mistral 7B Q4	45	4.2 GB
Llama 2 13B Q4	28	7.8 GB
Phi-2 Q8	62	3.1 GB

Why I Built This

Cloud LLM APIs are convenient, but they’re not always appropriate. When building a medical note-taking app, I needed inference that never left the device. Existing solutions required too much setup.

LocalLLM abstracts the complexity of llama.cpp, quantization, and memory management into a simple Python package.

Collaboration

Built with Alex Chen, who contributed the FastAPI server and streaming implementation.