Not every AI application should send data to external APIs. Healthcare, legal, and financial applications often require on-device inference for privacy and compliance. LocalLLM makes running quantized language models locally as simple as calling an API.

Features

  • One-line installation: pip install localllm
  • OpenAI-compatible API: Drop-in replacement for existing code
  • Automatic quantization: Convert models to GGUF format
  • Memory management: Intelligent model loading/unloading
  • Streaming support: Real-time token generation

Quick Start

from localllm import LocalLLM

# Initialize with any GGUF model
llm = LocalLLM("mistral-7b-instruct-q4.gguf")

# Use like any LLM API
response = llm.complete("Explain quantum computing in simple terms")
print(response.text)

Performance

On an M2 MacBook Pro with 16GB RAM:

ModelTokens/secMemory
Mistral 7B Q4454.2 GB
Llama 2 13B Q4287.8 GB
Phi-2 Q8623.1 GB

Why I Built This

Cloud LLM APIs are convenient, but they’re not always appropriate. When building a medical note-taking app, I needed inference that never left the device. Existing solutions required too much setup.

LocalLLM abstracts the complexity of llama.cpp, quantization, and memory management into a simple Python package.

Collaboration

Built with Alex Chen, who contributed the FastAPI server and streaming implementation.