Inference
Alith is designed to provide comprehensive integration support for modern inference engines through a unified interface architecture. Our multi-backend solution will supports:
Core Inference Engines
- Llamacpp: Lightweight CPU inference with GGUF quantization support.
- MistralRs: Built in Rust, it leverages low-level optimizations for Mistral-family models, ideal for scenarios requiring low-latency streaming (e.g., chatbots).
- vLLM: High-throughput GPU serving with PagedAttention.
- SGLang: Advanced structured generation for complex workflows.
- ONNX Runtime: Production-grade execution with cross-platform optimizations.
- Python: Native Python runtime integration for prototyping and production, supporting popular frameworks (PyTorch, TensorFlow) and custom scripting.
Custom Operator Ecosystem
We extend framework capabilities through platform-specific optimizations:
- Triton custom kernels for PyTorch acceleration
- CUDA/HIP kernels for GPU-specific optimizations
Integrations
Llamacpp
use alith::{Agent, Chat, inference::LlamaEngine};
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
let model = LlamaEngine::new("/root/models/qwen2.5-1.5b-instruct-q5_k_m.gguf").await?;
let agent = Agent::new("simple agent", model);
println!("{}", agent.prompt("Calculate 10 - 3").await?);
Ok(())
}
Note: we need to open the
llamacpp
feature to run the code.
MistralRs
use alith::{Agent, Chat, inference::MistralRsEngine};
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
let model = MistralRsEngine::new("/root/models/qwen2.5-1.5b-instruct-q5_k_m.gguf").await?;
let agent = Agent::new("simple agent", model);
println!("{}", agent.prompt("Calculate 10 - 3").await?);
Ok(())
}
Note: we need to open the
mistralrs
feature to run the code.
ONNX Runtime
use alith::{
Agent, Chat,
inference::engines::ort::{GraphOptimizationLevel, ort_init, present::GPT2},
};
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
ort_init()?;
let model = GPT2::new(
"https://cdn.pyke.io/0/pyke:ort-rs/example-models@0.0.0/gpt2.onnx",
"tokenizer.json",
GraphOptimizationLevel::Level1,
1,
)?;
let agent = Agent::new("simple agent", model);
println!("{}", agent.prompt("Calculate 10 - 3").await?);
Ok(())
}
Note: we need to open the
ort
feature to run the code.
Last updated on