01By use case · datasets organized along the model lifecycle
From a model's lifecycle perspective, data needs are continuous: pre-training builds foundational language and multimodal capabilities; SFT/RLHF aligns to human preferences; RAG injects real-time knowledge; Agent data teaches the model to use tools. ENDATA delivers productized datasets covering the entire lifecycle.
Pre-training Corpora · vertical-native at scale
Drawn from film & TV scripts, long-form social posts, e-commerce reviews, vertical long articles and institutional reports. Five-layer processing — deduplication, cleaning, quality scoring, copyright verification, compliance filtering — directly supporting pre-training and continued pre-training.
SFT / RLHF Alignment Data
High-quality instruction pairs, multi-turn dialogue, chain-of-thought (CoT), and human preference annotations — supporting supervised fine-tuning and reinforcement learning alignment.
RAG Knowledge Bases
Vertical knowledge graphs + vectorized chunks + metadata indexing. Industry-tunable (Film IP KB / Product KB / KOL KB / Social Event KB) and pluggable into retrieval-augmented pipelines out of the box.
Agent Tool-call Data
Built for AI Agent training: function signatures, call sequences, tool selection paths, multi-step reasoning chains, recovery trajectories. Covering real business scenarios across e-commerce, marketing and content agents.
02By modality · depth in each of the three modalities
Video, image, text — ENDATA has accumulated AI-grade processing chains in each modality. Far from being silos, the three modalities are cross-indexed on the same underlying data foundation.
Video Datasets
Series, variety, short videos, livestream, product demos, licensed IP videos. Bundled with captions, action labels and POV classification.
View VLA details →Image Datasets
Posters & assets, KOL images, UGC images, product hero shots (SKU), licensed images. With captions, tags, aesthetic scoring and copyright status.
Text Datasets
Scripts, reviews, danmaku, threads, posts, product reviews, contracts & licensing texts. Denoised, deduplicated, with sentiment labels and domain classification.
Multimodal Alignment · tri-modal natively linked
Series video, poster art and review text from the same TV drama are cross-indexed inside ENDATA's data foundation — making this a true 3-D corpus. Supports multimodal model training, cross-modal retrieval evaluation, and video understanding benchmarks.
Data Specification · Schema Documentation
Each dataset ships with machine-readable metadata schema, field dictionary, licensing chain and compliance markers — directly ingestible by your scripts.
Download Schema Docs →03Custom data services · end-to-end on demand
When productized datasets don't fit a unique training objective, ENDATA delivers a complete pipeline — acquisition, cleaning, structuring, labeling, embedding and compliance audit. Two-layer QA combining algorithmic pre-labeling and expert review, with private deployment supported.
Acquisition
Multi-source acquisition + licensed partnerships — directed crawling, partner data ingestion, UGC recruiting.
Clean & Dedupe
Denoising, deduplication (MinHash-LSH), low-quality sample filtering, compliance-risk sample removal.
Structure / Label
Schema definition + algorithmic pre-labeling + expert review — three-layer annotation across text, image and video modalities.
Embed / Deliver
Embedding processing + compliance audit + private deployment — directly pluggable into your training pipeline.
04Licensing & deployment models · matched to customer scale
From on-demand licensing to private deployment, ENDATA's data services match the needs of startups all the way to flagship vendors.
On-Demand Licensing
Priced by dataset, scale or modality. Suited to startup teams and research institutions with well-defined data needs.
- Per-token / record / GB billing
- Self-serve dataset catalog
- Standard commercial license
- Baseline compliance backing
Subscription · Continuous Supply
Annual or multi-year subscriptions with continuously updated data, on-demand customization, and dedicated customer success.
- Multimodal / multi-domain bundles
- Monthly / quarterly updates
- Priority lane for custom processing
- Dedicated CSM
- SLA-backed support
Private Deployment
Data delivered into the customer's private cloud or on-prem IDC — for foundation-model vendors and compliance-sensitive industries.
- Private cloud / on-prem deployment
- Full data governance toolchain
- End-to-end model training support
- Auditable compliance chain
- Deep technical engagement