Skip to Content

Inference

Alith is designed to provide comprehensive integration support for modern inference engines through a unified interface architecture. Our multi-backend solution will supports:

Core Inference Engines

  • Llamacpp: Lightweight CPU inference with GGUF quantization support.
  • MistralRs: Built in Rust, it leverages low-level optimizations for Mistral-family models, ideal for scenarios requiring low-latency streaming (e.g., chatbots).
  • vLLM: High-throughput GPU serving with PagedAttention.
  • SGLang: Advanced structured generation for complex workflows.
  • ONNX Runtime: Production-grade execution with cross-platform optimizations.
  • Python: Native Python runtime integration for prototyping and production, supporting popular frameworks (PyTorch, TensorFlow) and custom scripting.

Custom Operator Ecosystem

We extend framework capabilities through platform-specific optimizations:

  • Triton custom kernels for PyTorch acceleration
  • CUDA/HIP kernels for GPU-specific optimizations

Integrations

Llamacpp

use alith::{Agent, Chat, inference::LlamaEngine}; #[tokio::main] async fn main() -> Result<(), anyhow::Error> { let model = LlamaEngine::new("/root/models/qwen2.5-1.5b-instruct-q5_k_m.gguf").await?; let agent = Agent::new("simple agent", model); println!("{}", agent.prompt("Calculate 10 - 3").await?); Ok(()) }

Note: we need to open the llamacpp feature to run the code.

MistralRs

use alith::{Agent, Chat, inference::MistralRsEngine}; #[tokio::main] async fn main() -> Result<(), anyhow::Error> { let model = MistralRsEngine::new("/root/models/qwen2.5-1.5b-instruct-q5_k_m.gguf").await?; let agent = Agent::new("simple agent", model); println!("{}", agent.prompt("Calculate 10 - 3").await?); Ok(()) }

Note: we need to open the mistralrs feature to run the code.

ONNX Runtime

use alith::{ Agent, Chat, inference::engines::ort::{GraphOptimizationLevel, ort_init, present::GPT2}, }; #[tokio::main] async fn main() -> Result<(), anyhow::Error> { ort_init()?; let model = GPT2::new( "https://cdn.pyke.io/0/pyke:ort-rs/example-models@0.0.0/gpt2.onnx", "tokenizer.json", GraphOptimizationLevel::Level1, 1, )?; let agent = Agent::new("simple agent", model); println!("{}", agent.prompt("Calculate 10 - 3").await?); Ok(()) }

Note: we need to open the ort feature to run the code.

Last updated on