PLATFORM // DEPLOY

DEPLOY YOUR VLM
TO PRODUCTION.

Three deployment paths. One OpenAI-compatible API. Serve your fine-tuned model via Python SDK, NVIDIA NIM containers, or Vi Cloud with built-in monitoring, quantization, and model versioning.

Datature/Project/Deploy
Vi SDK● Connected
Deployment Consolevi-2.1-dpo
MODELvi-2.1-dpo
GPU: A100 80GBRegion: us-west-2Status: Running

DEPLOYMENT PATHS

THREE WAYS TO SHIP.

Deploy your fine-tuned VLM exactly where you need it. Local inference with the Python SDK, containerized serving with NVIDIA NIM, or fully managed hosting on Vi Cloud.

Vi SDK

Local Inference

Python SDK for running your fine-tuned model on any modern NVIDIA GPU (CUDA 11.2+). 4-bit and 8-bit quantization built in. Load, predict, and batch process with three lines of code.

  • pip install datature-vi
  • 4-bit / 8-bit GPTQ Quantization
  • Runs on Any Modern NVIDIA GPU (8GB+)
  • Batch Inference with Auto-Batching
  • Offline Mode (No Internet Required)
  • JSON Decode Output Parsing

NVIDIA NIM

Containerized Deployment

Production-grade Docker containers with NVIDIA NIM runtime. OpenAI-compatible API endpoint out of the box. Deploy on any Kubernetes cluster or cloud VM.

  • docker pull and Run
  • OpenAI-Compatible /v1/ Endpoints
  • Streaming Token Generation
  • Kubernetes / Helm Charts Included
  • Auto-Scaling with HPA
  • Health Check and Readiness Probes

Vi Cloud

Managed Inference

Fully managed vLLM hosting on Datature infrastructure. Zero DevOps. Pay per token. Built-in monitoring, auto-scaling, and 99.9% SLA.

  • One-Click Deploy from Training
  • vLLM Backend with PagedAttention
  • Auto-Scaling (0 to N Replicas)
  • 99.9% Uptime SLA
  • Built-in Rate Limiting
  • Usage-Based Pricing

API REFERENCE

OPENAI-COMPATIBLE. DROP-IN REPLACEMENT.

Every deployment path exposes the same OpenAI-compatible REST API. Switch from GPT-4V to your fine-tuned Vi model by changing one line: the base URL.

POST/v1/chat/completions

Multi-turn chat with image inputs. Supports streaming, temperature, top-p, max tokens, and stop sequences.

POST/v1/predict

Single-shot prediction. Send an image and prompt, receive structured JSON output with bounding boxes, captions, and confidence scores.

GET/v1/models

List all deployed model versions. Returns model ID, quantization level, status, and creation timestamp.

POST/v1/embeddings

Extract visual embeddings from images. Returns 1024-dim vectors for similarity search, clustering, and retrieval-augmented generation.

request.py
Python

from openai import OpenAI

client = OpenAI(

base_url="https://api.vi.datature.com/v1",

api_key="vi-sk-..."

)

response = client.chat.completions.create(

model="vi-2.1-dpo",

messages=[

{

"role": "user",

"content": [

{"type": "text", "text": "Describe defects"},

{"type": "image_url", "image_url": {"url": "..."}}

]

}

],

stream=True

)

OBSERVABILITY

MONITOR EVERY REQUEST.

Real-time dashboards for every deployment. Track token usage, latency percentiles, error rates, and user feedback scores. Set alerts on any metric with Slack, PagerDuty, or webhook integrations.

48,291

Total Requests (24h)

27ms

Avg Latency (p50)

0.02%

Error Rate

99.98%

Uptime (30d)

Token Usage Analytics

Track input and output tokens per request, per model, per user. Hourly, daily, and monthly aggregations. Export usage data for billing reconciliation.

Latency Percentiles

P50, P90, P95, and P99 latency tracking with real-time histograms. Automated alerting when P99 exceeds configurable thresholds.

Error Rate Monitoring

Categorized error tracking: 4xx client errors, 5xx server errors, timeouts, and OOM events. Automatic retry metrics and circuit breaker status.

Feedback Scoring

Collect thumbs-up/down and 1-5 star ratings on model responses. Aggregate scores by model version. Use low-scoring responses to build DPO preference pairs.

Custom Alerts

Define alert rules on any metric: latency spikes, error rate increases, token budget exceeded. Deliver via Slack, PagerDuty, email, or custom webhooks.

Request Logging

Full request and response logging with configurable retention. Search by prompt text, model version, user ID, or response content. GDPR-compliant redaction options.

QUANTIZATION

RIGHT-SIZE YOUR INFERENCE.

Trade precision for speed and memory. Vi supports GPTQ quantization at 4-bit and 8-bit levels, plus full FP16 for maximum accuracy. Choose per deployment based on your latency and hardware constraints.

Metric
4-Bit GPTQ
8-Bit GPTQ
FP16
VRAM Usage
4.2 GB
7.8 GB
14.6 GB
Throughput
142 tok/s
118 tok/s
86 tok/s
Latency (p50)
18ms
24ms
32ms
BERTScore F1
0.91
0.94
0.96
Min GPU
RTX 3060 (8GB)
RTX 3080 (10GB)
RTX 4090 (24GB)
Best For
Edge / High Volume
Balanced Production
Max Accuracy

Automatic Calibration

Vi calibrates quantization parameters on your training data. No manual configuration needed. Calibration dataset is drawn from your validation split automatically.

Mixed Precision Layers

Critical attention layers stay at higher precision while feedforward layers are aggressively quantized. This preserves accuracy where it matters most.

One-Click Conversion

Quantize any checkpoint from the Training tab. Select your target precision, click Convert, and receive a deployment-ready artifact in minutes.

MODEL VERSIONING

EVERY CHECKPOINT. TRACKED.

Every training run produces a versioned checkpoint. Deploy any version, roll back in seconds, or run A/B tests between two versions with configurable traffic splits.

  • Semantic Versioning

    Each checkpoint is tagged with a version ID (e.g., vi-2.1-dpo). Training metadata, hyperparameters, and eval scores are stored alongside the weights.

  • Instant Rollback

    Roll back to any previous version with zero downtime. The previous model is kept warm in memory for instant failover. Average rollback time under 3 seconds.

  • A/B Serving

    Split traffic between two model versions by percentage. Compare response quality, latency, and user feedback in real time. Promote the winner with one click.

  • Deployment History

    Full audit trail of every deployment event: who deployed, when, which version, and the rollback chain. Exportable for compliance documentation.

deployment-history.json
JSON

[

{

"version": "vi-2.1-dpo",

"status": "active",

"traffic": 90,

"deployed_at": "2025-03-28T14:30:00Z",

"bert_f1": 0.96

},

{

"version": "vi-2.0-sft",

"status": "canary",

"traffic": 10,

"deployed_at": "2025-03-25T09:15:00Z",

"bert_f1": 0.93

},

{

"version": "vi-1.9-sft",

"status": "archived",

"traffic": 0,

"rollback_available": true

}

]

DEPLOY TODAY.
SCALE TOMORROW.

Vi SDK, NVIDIA NIM, and Vi Cloud deployment included. Start free today.