PLATFORM // DEPLOY
Three deployment paths. One OpenAI-compatible API. Serve your fine-tuned model via Python SDK, NVIDIA NIM containers, or Vi Cloud with built-in monitoring, quantization, and model versioning.
DEPLOYMENT PATHS
Deploy your fine-tuned VLM exactly where you need it. Local inference with the Python SDK, containerized serving with NVIDIA NIM, or fully managed hosting on Vi Cloud.
Local Inference
Python SDK for running your fine-tuned model on any modern NVIDIA GPU (CUDA 11.2+). 4-bit and 8-bit quantization built in. Load, predict, and batch process with three lines of code.
Containerized Deployment
Production-grade Docker containers with NVIDIA NIM runtime. OpenAI-compatible API endpoint out of the box. Deploy on any Kubernetes cluster or cloud VM.
Managed Inference
Fully managed vLLM hosting on Datature infrastructure. Zero DevOps. Pay per token. Built-in monitoring, auto-scaling, and 99.9% SLA.
API REFERENCE
Every deployment path exposes the same OpenAI-compatible REST API. Switch from GPT-4V to your fine-tuned Vi model by changing one line: the base URL.
Multi-turn chat with image inputs. Supports streaming, temperature, top-p, max tokens, and stop sequences.
Single-shot prediction. Send an image and prompt, receive structured JSON output with bounding boxes, captions, and confidence scores.
List all deployed model versions. Returns model ID, quantization level, status, and creation timestamp.
Extract visual embeddings from images. Returns 1024-dim vectors for similarity search, clustering, and retrieval-augmented generation.
from openai import OpenAI
client = OpenAI(
base_url="https://api.vi.datature.com/v1",
api_key="vi-sk-..."
)
response = client.chat.completions.create(
model="vi-2.1-dpo",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe defects"},
{"type": "image_url", "image_url": {"url": "..."}}
]
}
],
stream=True
)
OBSERVABILITY
Real-time dashboards for every deployment. Track token usage, latency percentiles, error rates, and user feedback scores. Set alerts on any metric with Slack, PagerDuty, or webhook integrations.
48,291
Total Requests (24h)
27ms
Avg Latency (p50)
0.02%
Error Rate
99.98%
Uptime (30d)
Track input and output tokens per request, per model, per user. Hourly, daily, and monthly aggregations. Export usage data for billing reconciliation.
P50, P90, P95, and P99 latency tracking with real-time histograms. Automated alerting when P99 exceeds configurable thresholds.
Categorized error tracking: 4xx client errors, 5xx server errors, timeouts, and OOM events. Automatic retry metrics and circuit breaker status.
Collect thumbs-up/down and 1-5 star ratings on model responses. Aggregate scores by model version. Use low-scoring responses to build DPO preference pairs.
Define alert rules on any metric: latency spikes, error rate increases, token budget exceeded. Deliver via Slack, PagerDuty, email, or custom webhooks.
Full request and response logging with configurable retention. Search by prompt text, model version, user ID, or response content. GDPR-compliant redaction options.
QUANTIZATION
Trade precision for speed and memory. Vi supports GPTQ quantization at 4-bit and 8-bit levels, plus full FP16 for maximum accuracy. Choose per deployment based on your latency and hardware constraints.
Vi calibrates quantization parameters on your training data. No manual configuration needed. Calibration dataset is drawn from your validation split automatically.
Critical attention layers stay at higher precision while feedforward layers are aggressively quantized. This preserves accuracy where it matters most.
Quantize any checkpoint from the Training tab. Select your target precision, click Convert, and receive a deployment-ready artifact in minutes.
MODEL VERSIONING
Every training run produces a versioned checkpoint. Deploy any version, roll back in seconds, or run A/B tests between two versions with configurable traffic splits.
Semantic Versioning
Each checkpoint is tagged with a version ID (e.g., vi-2.1-dpo). Training metadata, hyperparameters, and eval scores are stored alongside the weights.
Instant Rollback
Roll back to any previous version with zero downtime. The previous model is kept warm in memory for instant failover. Average rollback time under 3 seconds.
A/B Serving
Split traffic between two model versions by percentage. Compare response quality, latency, and user feedback in real time. Promote the winner with one click.
Deployment History
Full audit trail of every deployment event: who deployed, when, which version, and the rollback chain. Exportable for compliance documentation.
[
{
"version": "vi-2.1-dpo",
"status": "active",
"traffic": 90,
"deployed_at": "2025-03-28T14:30:00Z",
"bert_f1": 0.96
},
{
"version": "vi-2.0-sft",
"status": "canary",
"traffic": 10,
"deployed_at": "2025-03-25T09:15:00Z",
"bert_f1": 0.93
},
{
"version": "vi-1.9-sft",
"status": "archived",
"traffic": 0,
"rollback_available": true
}
]
Vi SDK, NVIDIA NIM, and Vi Cloud deployment included. Start free today.