Datasets · ENDATA

01By use case · datasets organized along the model lifecycle

From a model's lifecycle perspective, data needs are continuous: pre-training builds foundational language and multimodal capabilities; SFT/RLHF aligns to human preferences; RAG injects real-time knowledge; Agent data teaches the model to use tools. ENDATA delivers productized datasets covering the entire lifecycle.

PRE-TRAINING CORPUS · FLAGSHIP TB-scale · 120+ languages / dialects

Pre-training Corpora · vertical-native at scale

Drawn from film & TV scripts, long-form social posts, e-commerce reviews, vertical long articles and institutional reports. Five-layer processing — deduplication, cleaning, quality scoring, copyright verification, compliance filtering — directly supporting pre-training and continued pre-training.

// schema example { "task": "pretrain", "modality": ["text", "image"], "domain": "entertainment", "license": "commercial", "size_tb": 2.4, "languages": ["zh-CN", "en", "..."], "quality_score": 0.91 }

MinHash-LSH dedup Quality scoring Copyright traceable Compliance filtered Multilingual

SFT · RLHFMillions of dialogue pairs

SFT / RLHF Alignment Data

High-quality instruction pairs, multi-turn dialogue, chain-of-thought (CoT), and human preference annotations — supporting supervised fine-tuning and reinforcement learning alignment.

Instruction pairs Multi-turn Chain-of-thought Preference pairs

RAG KNOWLEDGE BASEStructured · Vectorized

RAG Knowledge Bases

Vertical knowledge graphs + vectorized chunks + metadata indexing. Industry-tunable (Film IP KB / Product KB / KOL KB / Social Event KB) and pluggable into retrieval-augmented pipelines out of the box.

Film IP KB Product KB KOL KB Social Event KB Embedding-ready

AGENT TOOL-CALLFunction-level annotation

Agent Tool-call Data

Built for AI Agent training: function signatures, call sequences, tool selection paths, multi-step reasoning chains, recovery trajectories. Covering real business scenarios across e-commerce, marketing and content agents.

Function signatures Multi-step traces Tool selection Recovery paths

02By modality · depth in each of the three modalities

Video, image, text — ENDATA has accumulated AI-grade processing chains in each modality. Far from being silos, the three modalities are cross-indexed on the same underlying data foundation.

VIDEO MODALITY2.3B+ clips

Video Datasets

Series, variety, short videos, livestream, product demos, licensed IP videos. Bundled with captions, action labels and POV classification.

30fps

Frame Rate

1080p+

Resolution

View VLA details →

IMAGE MODALITY2.1B+ SKU · 14M+ IP

Image Datasets

Posters & assets, KOL images, UGC images, product hero shots (SKU), licensed images. With captions, tags, aesthetic scoring and copyright status.

Caption-aligned Aesthetic scoring Copyright tags Multi-resolution

TEXT MODALITY23B+ corpora

Text Datasets

Scripts, reviews, danmaku, threads, posts, product reviews, contracts & licensing texts. Denoised, deduplicated, with sentiment labels and domain classification.

Sentiment Domain class Entity extraction Keywords

MULTIMODAL ALIGNEDNaturally cross-modal

Multimodal Alignment · tri-modal natively linked

Series video, poster art and review text from the same TV drama are cross-indexed inside ENDATA's data foundation — making this a true 3-D corpus. Supports multimodal model training, cross-modal retrieval evaluation, and video understanding benchmarks.

Video-Text pairs Image-Text pairs Video-Image align Audio-Visual Cross-modal retrieval Video understanding bench

METADATA SCHEMA

Data Specification · Schema Documentation

Each dataset ships with machine-readable metadata schema, field dictionary, licensing chain and compliance markers — directly ingestible by your scripts.

Download Schema Docs →

03Custom data services · end-to-end on demand

When productized datasets don't fit a unique training objective, ENDATA delivers a complete pipeline — acquisition, cleaning, structuring, labeling, embedding and compliance audit. Two-layer QA combining algorithmic pre-labeling and expert review, with private deployment supported.

01 · COLLECT

Acquisition

Multi-source acquisition + licensed partnerships — directed crawling, partner data ingestion, UGC recruiting.

02 · CLEAN

Clean & Dedupe

Denoising, deduplication (MinHash-LSH), low-quality sample filtering, compliance-risk sample removal.

03 · STRUCTURE

Structure / Label

Schema definition + algorithmic pre-labeling + expert review — three-layer annotation across text, image and video modalities.

04 · DELIVER

Embed / Deliver

Embedding processing + compliance audit + private deployment — directly pluggable into your training pipeline.

04Licensing & deployment models · matched to customer scale

From on-demand licensing to private deployment, ENDATA's data services match the needs of startups all the way to flagship vendors.

ON-DEMAND

On-Demand Licensing

Priced by dataset, scale or modality. Suited to startup teams and research institutions with well-defined data needs.

Per-token / record / GB billing
Self-serve dataset catalog
Standard commercial license
Baseline compliance backing

RECOMMENDED

SUBSCRIPTION

Subscription · Continuous Supply

Annual or multi-year subscriptions with continuously updated data, on-demand customization, and dedicated customer success.

Multimodal / multi-domain bundles
Monthly / quarterly updates
Priority lane for custom processing
Dedicated CSM
SLA-backed support

PRIVATE DEPLOYMENT

Private Deployment

Data delivered into the customer's private cloud or on-prem IDC — for foundation-model vendors and compliance-sensitive industries.

Private cloud / on-prem deployment
Full data governance toolchain
End-to-end model training support
Auditable compliance chain
Deep technical engagement

NAVIGATION

LANGUAGE

High-quality, compliant, vertical
data ammunition shelf DATASETS · END-TO-END DATA FOR AI

01By use case · datasets organized along the model lifecycle

Pre-training Corpora · vertical-native at scale

SFT / RLHF Alignment Data

RAG Knowledge Bases

Agent Tool-call Data

02By modality · depth in each of the three modalities

Video Datasets

Image Datasets

Text Datasets

Multimodal Alignment · tri-modal natively linked

Data Specification · Schema Documentation

03Custom data services · end-to-end on demand

Acquisition

Clean & Dedupe

Structure / Label

Embed / Deliver

04Licensing & deployment models · matched to customer scale

On-Demand Licensing

Subscription · Continuous Supply

Private Deployment

Which kind of dataset do you need?

NAVIGATION

LANGUAGE

High-quality, compliant, verticaldata ammunition shelf DATASETS · END-TO-END DATA FOR AI

01By use case · datasets organized along the model lifecycle

Pre-training Corpora · vertical-native at scale

SFT / RLHF Alignment Data

RAG Knowledge Bases

Agent Tool-call Data

02By modality · depth in each of the three modalities

Video Datasets

Image Datasets

Text Datasets

Multimodal Alignment · tri-modal natively linked

Data Specification · Schema Documentation

03Custom data services · end-to-end on demand

Acquisition

Clean & Dedupe

Structure / Label

Embed / Deliver

04Licensing & deployment models · matched to customer scale

On-Demand Licensing

Subscription · Continuous Supply

Private Deployment

Which kind of dataset do you need?

High-quality, compliant, vertical
data ammunition shelf DATASETS · END-TO-END DATA FOR AI