Deploy: Ship Fine-Tuned VLMs to Production | Datature Vi

DEPLOYMENT PATHS

THREE WAYS TO SHIP.

Deploy your fine-tuned VLM exactly where you need it. Local inference with the Python SDK, containerized serving with NVIDIA NIM, or fully managed hosting on Vi Cloud.

Vi SDK

Local Inference

Python SDK for running your fine-tuned model on any modern NVIDIA GPU (CUDA 11.2+). 4-bit and 8-bit quantization built in. Load, predict, and batch process with three lines of code.

pip install datature-vi
4-bit / 8-bit GPTQ Quantization
Runs on Any Modern NVIDIA GPU (8GB+)
Batch Inference with Auto-Batching
Offline Mode (No Internet Required)
JSON Decode Output Parsing

NVIDIA NIM

Containerized Deployment

Production-grade Docker containers with NVIDIA NIM runtime. OpenAI-compatible API endpoint out of the box. Deploy on any Kubernetes cluster or cloud VM.

docker pull and Run
OpenAI-Compatible /v1/ Endpoints
Streaming Token Generation
Kubernetes / Helm Charts Included
Auto-Scaling with HPA
Health Check and Readiness Probes

Vi Cloud

Managed Inference

Fully managed vLLM hosting on Datature infrastructure. Zero DevOps. Pay per token. Built-in monitoring, auto-scaling, and 99.9% SLA.

One-Click Deploy from Training
vLLM Backend with PagedAttention
Auto-Scaling (0 to N Replicas)
99.9% Uptime SLA
Built-in Rate Limiting
Usage-Based Pricing

API REFERENCE

OPENAI-COMPATIBLE. DROP-IN REPLACEMENT.

Every deployment path exposes the same OpenAI-compatible REST API. Switch from GPT-4V to your fine-tuned Vi model by changing one line: the base URL.

POST/v1/chat/completions

Multi-turn chat with image inputs. Supports streaming, temperature, top-p, max tokens, and stop sequences.

POST/v1/predict

Single-shot prediction. Send an image and prompt, receive structured JSON output with bounding boxes, captions, and confidence scores.

GET/v1/models

List all deployed model versions. Returns model ID, quantization level, status, and creation timestamp.

POST/v1/embeddings

Extract visual embeddings from images. Returns 1024-dim vectors for similarity search, clustering, and retrieval-augmented generation.

request.py

Python

from openai import OpenAI

client = OpenAI(

base_url="https://api.vi.datature.com/v1",

api_key="vi-sk-..."

)

response = client.chat.completions.create(

model="vi-2.1-dpo",

messages=[

{

"role": "user",

"content": [

{"type": "text", "text": "Describe defects"},

{"type": "image_url", "image_url": {"url": "..."}}

]

}

stream=True

)

OBSERVABILITY

MONITOR EVERY REQUEST.

Real-time dashboards for every deployment. Track token usage, latency percentiles, error rates, and user feedback scores. Set alerts on any metric with Slack, PagerDuty, or webhook integrations.

48,291

Total Requests (24h)

27ms

Avg Latency (p50)

0.02%

Error Rate

99.98%

Uptime (30d)

Token Usage Analytics

Track input and output tokens per request, per model, per user. Hourly, daily, and monthly aggregations. Export usage data for billing reconciliation.

Latency Percentiles

P50, P90, P95, and P99 latency tracking with real-time histograms. Automated alerting when P99 exceeds configurable thresholds.

Error Rate Monitoring

Categorized error tracking: 4xx client errors, 5xx server errors, timeouts, and OOM events. Automatic retry metrics and circuit breaker status.

Feedback Scoring

Collect thumbs-up/down and 1-5 star ratings on model responses. Aggregate scores by model version. Use low-scoring responses to build DPO preference pairs.

Custom Alerts

Define alert rules on any metric: latency spikes, error rate increases, token budget exceeded. Deliver via Slack, PagerDuty, email, or custom webhooks.

Request Logging

Full request and response logging with configurable retention. Search by prompt text, model version, user ID, or response content. GDPR-compliant redaction options.

QUANTIZATION

RIGHT-SIZE YOUR INFERENCE.

Trade precision for speed and memory. Vi supports GPTQ quantization at 4-bit and 8-bit levels, plus full FP16 for maximum accuracy. Choose per deployment based on your latency and hardware constraints.

Metric

4-Bit GPTQ

8-Bit GPTQ

FP16

VRAM Usage

4.2 GB

7.8 GB

14.6 GB

Throughput

142 tok/s

118 tok/s

86 tok/s

Latency (p50)

18ms

24ms

32ms

BERTScore F1

0.91

0.94

0.96

Min GPU

RTX 3060 (8GB)

RTX 3080 (10GB)

RTX 4090 (24GB)

Best For

Edge / High Volume

Balanced Production

Max Accuracy

Automatic Calibration

Vi calibrates quantization parameters on your training data. No manual configuration needed. Calibration dataset is drawn from your validation split automatically.

Mixed Precision Layers

Critical attention layers stay at higher precision while feedforward layers are aggressively quantized. This preserves accuracy where it matters most.

One-Click Conversion

Quantize any checkpoint from the Training tab. Select your target precision, click Convert, and receive a deployment-ready artifact in minutes.

MODEL VERSIONING

EVERY CHECKPOINT. TRACKED.

Every training run produces a versioned checkpoint. Deploy any version, roll back in seconds, or run A/B tests between two versions with configurable traffic splits.

▸
Semantic Versioning
Each checkpoint is tagged with a version ID (e.g., vi-2.1-dpo). Training metadata, hyperparameters, and eval scores are stored alongside the weights.
▸
Instant Rollback
Roll back to any previous version with zero downtime. The previous model is kept warm in memory for instant failover. Average rollback time under 3 seconds.
▸
A/B Serving
Split traffic between two model versions by percentage. Compare response quality, latency, and user feedback in real time. Promote the winner with one click.
▸
Deployment History
Full audit trail of every deployment event: who deployed, when, which version, and the rollback chain. Exportable for compliance documentation.

deployment-history.json

JSON

[

{

"version": "vi-2.1-dpo",

"status": "active",

"traffic": 90,

"deployed_at": "2025-03-28T14:30:00Z",

"bert_f1": 0.96

{

"version": "vi-2.0-sft",

"status": "canary",

"traffic": 10,

"deployed_at": "2025-03-25T09:15:00Z",

"bert_f1": 0.93

{

"version": "vi-1.9-sft",

"status": "archived",

"traffic": 0,

"rollback_available": true

}

]

DEPLOY YOUR VLM
TO PRODUCTION.

THREE WAYS TO SHIP.

Vi SDK

NVIDIA NIM

Vi Cloud

OPENAI-COMPATIBLE. DROP-IN REPLACEMENT.

MONITOR EVERY REQUEST.

Token Usage Analytics

Latency Percentiles

Error Rate Monitoring

Feedback Scoring

Custom Alerts

Request Logging

RIGHT-SIZE YOUR INFERENCE.

Automatic Calibration

Mixed Precision Layers

One-Click Conversion

EVERY CHECKPOINT. TRACKED.

DEPLOY TODAY.
SCALE TOMORROW.

DEPLOY YOUR VLMTO PRODUCTION.

THREE WAYS TO SHIP.

Vi SDK

NVIDIA NIM

Vi Cloud

OPENAI-COMPATIBLE. DROP-IN REPLACEMENT.

MONITOR EVERY REQUEST.

Token Usage Analytics

Latency Percentiles

Error Rate Monitoring

Feedback Scoring

Custom Alerts

Request Logging

RIGHT-SIZE YOUR INFERENCE.

Automatic Calibration

Mixed Precision Layers

One-Click Conversion

EVERY CHECKPOINT. TRACKED.

DEPLOY TODAY.SCALE TOMORROW.

DEPLOY YOUR VLM
TO PRODUCTION.

DEPLOY TODAY.
SCALE TOMORROW.