Not every AI application should send data to external APIs. Healthcare, legal, and financial applications often require on-device inference for privacy and compliance. LocalLLM makes running quantized language models locally as simple as calling an API.
Features
- One-line installation:
pip install localllm - OpenAI-compatible API: Drop-in replacement for existing code
- Automatic quantization: Convert models to GGUF format
- Memory management: Intelligent model loading/unloading
- Streaming support: Real-time token generation
Quick Start
from localllm import LocalLLM
# Initialize with any GGUF model
llm = LocalLLM("mistral-7b-instruct-q4.gguf")
# Use like any LLM API
response = llm.complete("Explain quantum computing in simple terms")
print(response.text)Performance
On an M2 MacBook Pro with 16GB RAM:
| Model | Tokens/sec | Memory |
|---|---|---|
| Mistral 7B Q4 | 45 | 4.2 GB |
| Llama 2 13B Q4 | 28 | 7.8 GB |
| Phi-2 Q8 | 62 | 3.1 GB |
Why I Built This
Cloud LLM APIs are convenient, but they’re not always appropriate. When building a medical note-taking app, I needed inference that never left the device. Existing solutions required too much setup.
LocalLLM abstracts the complexity of llama.cpp, quantization, and memory management into a simple Python package.
Collaboration
Built with Alex Chen, who contributed the FastAPI server and streaming implementation.