Metadata Schema
ENDATA datasets adopt a unified metadata Schema, compatible with mainstream formats such as RLDS, LeRobot v3, and WebDataset, with ENDATA-specific extension fields. Each dataset package ships with a metadata.json file describing structure, licensing, and quality metrics.
Standard Fields
| Field | Type | Description |
|---|---|---|
| dataset_id | string | Unique identifier, format EN-YYYYMM-XXXX |
| modality | array | ["video","image","text"] tri-modality enumeration |
| domain | string | film_tv / social / ecommerce / copyright |
| task_family | array | pretrain / sft / rlhf / rag / agent |
| license_type | string | commercial / research / restricted |
| language | array | ISO 639-1 language codes |
| size_bytes | int64 | Package size in bytes |
| record_count | int64 | Number of samples |
| quality_score | float | 0.0–1.0 ENDATA quality assessment score |
| compliance_audit | object | Copyright audit info incl. licensing chain ID |
Example metadata.json
{
"dataset_id": "EN-202603-0042",
"modality": ["video", "text"],
"domain": "film_tv",
"task_family": ["pretrain", "sft"],
"license_type": "commercial",
"language": ["zh", "en"],
"size_bytes": 2576980377,
"record_count": 142800,
"quality_score": 0.94,
"compliance_audit": {
"audit_id": "CA-2026-0178",
"copyright_chains": 3,
"verified_by": "ENDATA Legal Team"
}
}
VLA-specific Extensions
Video datasets include additional action-sequence fields, compatible with RLDS and LeRobot v3:
{
"vla_schema": "rlds_v1.2",
"fps": 30,
"resolution": "1080p",
"action_dim": 7,
"observation_keys": ["image", "depth", "proprio"],
"task_annotations": true,
"language_instruction": true
}
Sample Datasets
ENDATA provides free small-scale sample packages (100–5,000 records) for technical evaluation and pipeline integration testing. Sample data has identical structure to the full datasets, including complete metadata, annotations, and licensing information.
Available Samples
| Dataset | Size | Modality | Format | Apply |
|---|---|---|---|---|
| Film/TV SFT Sample | 1,000 records | Text | JSONL | Request → |
| Short-video VLA Action | 200 clips | Video | RLDS | Request → |
| E-commerce Multimodal | 500 pairs | Image+Text | Parquet | Request → |
| Social RAG Knowledge | 2,000 records | Text | JSONL | Request → |
| Cross-border Multilingual | 3,000 records | Text | CSV | Request → |
Request Process
① Email cs@endata.com.cn with company name, dataset name, and intended use
② Our commercial team replies within 1 business day with NDA signing link
③ Sample download link delivered within 24 hours of NDA signing (valid 7 days)
Integration Methods
ENDATA datasets support three integration methods to meet different security levels and infrastructure requirements:
| Method | Use Case | Security |
|---|---|---|
| ① API Pull | Standard subscription, continuous updates | TLS 1.3 encryption |
| ② Object Storage Push | Large batch data delivery | End-to-end encryption |
| ③ Private Cloud | High-security scenarios (finance/medical/defense) | Physical isolation |
① API Integration (Recommended)
POST https://api.endata.com.cn/v1/datasets/query
Authorization: Bearer <YOUR_API_KEY>
Content-Type: application/json
{
"dataset_id": "EN-202603-0042",
"filters": { "task_family": "sft", "language": "zh" },
"limit": 1000,
"format": "jsonl"
}
② Object Storage Push (OSS / S3)
Datasets can be pushed directly to your AWS S3, Aliyun OSS, or Tencent Cloud COS bucket. Encryption keys are held by the customer.
# Configuration example (Aliyun OSS)
{
"delivery_mode": "oss_push",
"endpoint": "oss-cn-shanghai.aliyuncs.com",
"bucket": "your-private-bucket",
"prefix": "endata/datasets/",
"encryption": "SSE-KMS",
"kms_key_id": "your-kms-key-id"
}
③ Private Cloud Deployment
For enterprises that prohibit data egress. ENDATA engineers provide:
· AES-256 encrypted offline media delivery
· Optional Docker images for private annotation platform
· Pipeline adapters (PyTorch / JAX / TF)
REST API v1
Base URL
https://api.endata.com.cn/v1
Authentication
All requests require an API Key in the Authorization header:
Authorization: Bearer YOUR_API_KEY
Main Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /datasets | List available datasets |
| GET | /datasets/{id} | Get dataset details |
| POST | /datasets/query | Conditional query and stream |
| GET | /datasets/{id}/schema | Get Schema definition |
| POST | /datasets/{id}/export | Trigger batch export |
| GET | /jobs/{job_id} | Query export job status |
Rate Limits
Default: 100 req/min (standard subscription). Enterprise subscriptions support custom limits. Returns HTTP 429 when exceeded.
pip install endata-sdk
from endata import Client
client = Client(api_key="YOUR_KEY")
# Get dataset
ds = client.datasets.get("EN-202603-0042")
# Stream training data
for batch in ds.stream(batch_size=256):
train(batch)
Copyright Authorization & Data Security
Three-Layer Authorization System
One of ENDATA's core advantages is a complete, traceable copyright authorization chain. Each dataset undergoes three layers of compliance review:
| Layer | Content | Documentation |
|---|---|---|
| Layer 1 | Picture/content rights holder authorization | Commercial License Agreement |
| Layer 2 | Music/score rights handling | Music license or removal/replacement statement |
| Layer 3 | Portrait/personal information | Portrait license or anonymization report |
Data Security Certifications
ISO 27701 Privacy Information Management System · GDPR + China PIPL compliance
ISO 20000 IT Service Management · full data product delivery lifecycle
Data Security Management Certification · State Administration for Market Regulation
AI Data Annotation Service Capability · China Academy of Information and Communications Technology
Customer Data Isolation
Each customer's datasets are stored in dedicated encrypted namespaces with RBAC-controlled access. Customer data is never used for ENDATA's internal model training or third-party sharing.
Compliance Document Package
Verified customers can request:
· Complete copyright authorization chain documentation
· ISO 27001 / 27701 certification copies
· Data Security Management Certificate
· Data Processing Agreement (DPA) template
· Compliance due diligence support materials
Frequently Asked Questions
Q: What's the minimum dataset size for purchase?
Standard datasets start from 100GB or 100K records. Custom subset orders can be tailored down to specific scenarios — please contact our team to discuss requirements.
Q: Do you provide custom annotation services?
Yes. ENDATA has CAICT-certified annotation capability for custom labeling requirements including bounding boxes, semantic segmentation, action labeling, and instruction tuning.
Q: How is data updated and versioned?
Standard subscription datasets are continuously refreshed. Each version has a unique version ID; customers can subscribe to incremental updates or pin to a specific version.
Q: What is the typical delivery timeline?
Standard datasets: 3–5 business days from contract. Custom datasets: 2–8 weeks depending on scale and annotation complexity.
Q: Do you support overseas data delivery?
Yes. ENDATA has cross-border data delivery experience and has completed China security assessment for overseas data egress for select datasets. Contact us for specific compliance review.