TECHNICAL DOCUMENTATION

Schema · Samples · Integration · API · Compliance

One-stop technical reference for ENDATA datasets — covering metadata specifications, sample data applications, multi-protocol integration, REST API and Python SDK, and copyright/security compliance notes.

SCHEMA SPECIFICATIONS

Metadata Schema

ENDATA datasets adopt a unified metadata Schema, compatible with mainstream formats such as RLDS, LeRobot v3, and WebDataset, with ENDATA-specific extension fields. Each dataset package ships with a metadata.json file describing structure, licensing, and quality metrics.

Standard Fields

FieldTypeDescription
dataset_idstringUnique identifier, format EN-YYYYMM-XXXX
modalityarray["video","image","text"] tri-modality enumeration
domainstringfilm_tv / social / ecommerce / copyright
task_familyarraypretrain / sft / rlhf / rag / agent
license_typestringcommercial / research / restricted
languagearrayISO 639-1 language codes
size_bytesint64Package size in bytes
record_countint64Number of samples
quality_scorefloat0.0–1.0 ENDATA quality assessment score
compliance_auditobjectCopyright audit info incl. licensing chain ID

Example metadata.json

{
  "dataset_id": "EN-202603-0042",
  "modality": ["video", "text"],
  "domain": "film_tv",
  "task_family": ["pretrain", "sft"],
  "license_type": "commercial",
  "language": ["zh", "en"],
  "size_bytes": 2576980377,
  "record_count": 142800,
  "quality_score": 0.94,
  "compliance_audit": {
    "audit_id": "CA-2026-0178",
    "copyright_chains": 3,
    "verified_by": "ENDATA Legal Team"
  }
}

VLA-specific Extensions

Video datasets include additional action-sequence fields, compatible with RLDS and LeRobot v3:

{
  "vla_schema": "rlds_v1.2",
  "fps": 30,
  "resolution": "1080p",
  "action_dim": 7,
  "observation_keys": ["image", "depth", "proprio"],
  "task_annotations": true,
  "language_instruction": true
}
SAMPLE DOWNLOADS

Sample Datasets

ENDATA provides free small-scale sample packages (100–5,000 records) for technical evaluation and pipeline integration testing. Sample data has identical structure to the full datasets, including complete metadata, annotations, and licensing information.

Available Samples

DatasetSizeModalityFormatApply
Film/TV SFT Sample1,000 recordsTextJSONLRequest →
Short-video VLA Action200 clipsVideoRLDSRequest →
E-commerce Multimodal500 pairsImage+TextParquetRequest →
Social RAG Knowledge2,000 recordsTextJSONLRequest →
Cross-border Multilingual3,000 recordsTextCSVRequest →

Request Process

① Email cs@endata.com.cn with company name, dataset name, and intended use
② Our commercial team replies within 1 business day with NDA signing link
③ Sample download link delivered within 24 hours of NDA signing (valid 7 days)

INTEGRATION GUIDE

Integration Methods

ENDATA datasets support three integration methods to meet different security levels and infrastructure requirements:

MethodUse CaseSecurity
① API PullStandard subscription, continuous updatesTLS 1.3 encryption
② Object Storage PushLarge batch data deliveryEnd-to-end encryption
③ Private CloudHigh-security scenarios (finance/medical/defense)Physical isolation

① API Integration (Recommended)

POST https://api.endata.com.cn/v1/datasets/query
Authorization: Bearer <YOUR_API_KEY>
Content-Type: application/json

{
  "dataset_id": "EN-202603-0042",
  "filters": { "task_family": "sft", "language": "zh" },
  "limit": 1000,
  "format": "jsonl"
}

② Object Storage Push (OSS / S3)

Datasets can be pushed directly to your AWS S3, Aliyun OSS, or Tencent Cloud COS bucket. Encryption keys are held by the customer.

# Configuration example (Aliyun OSS)
{
  "delivery_mode": "oss_push",
  "endpoint": "oss-cn-shanghai.aliyuncs.com",
  "bucket": "your-private-bucket",
  "prefix": "endata/datasets/",
  "encryption": "SSE-KMS",
  "kms_key_id": "your-kms-key-id"
}

③ Private Cloud Deployment

For enterprises that prohibit data egress. ENDATA engineers provide:

· AES-256 encrypted offline media delivery
· Optional Docker images for private annotation platform
· Pipeline adapters (PyTorch / JAX / TF)

API REFERENCE

REST API v1

Base URL

https://api.endata.com.cn/v1

Authentication

All requests require an API Key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Main Endpoints

MethodPathDescription
GET/datasetsList available datasets
GET/datasets/{id}Get dataset details
POST/datasets/queryConditional query and stream
GET/datasets/{id}/schemaGet Schema definition
POST/datasets/{id}/exportTrigger batch export
GET/jobs/{job_id}Query export job status

Rate Limits

Default: 100 req/min (standard subscription). Enterprise subscriptions support custom limits. Returns HTTP 429 when exceeded.

Python SDK Quick Start
pip install endata-sdk

from endata import Client
client = Client(api_key="YOUR_KEY")

# Get dataset
ds = client.datasets.get("EN-202603-0042")

# Stream training data
for batch in ds.stream(batch_size=256):
    train(batch)
COMPLIANCE NOTES

Copyright Authorization & Data Security

Three-Layer Authorization System

One of ENDATA's core advantages is a complete, traceable copyright authorization chain. Each dataset undergoes three layers of compliance review:

LayerContentDocumentation
Layer 1Picture/content rights holder authorizationCommercial License Agreement
Layer 2Music/score rights handlingMusic license or removal/replacement statement
Layer 3Portrait/personal informationPortrait license or anonymization report

Data Security Certifications

ISO 27001 Information Security Management System · covering full data lifecycle
ISO 27701 Privacy Information Management System · GDPR + China PIPL compliance
ISO 20000 IT Service Management · full data product delivery lifecycle
Data Security Management Certification · State Administration for Market Regulation
AI Data Annotation Service Capability · China Academy of Information and Communications Technology

Customer Data Isolation

Each customer's datasets are stored in dedicated encrypted namespaces with RBAC-controlled access. Customer data is never used for ENDATA's internal model training or third-party sharing.

Compliance Document Package

Verified customers can request:

· Complete copyright authorization chain documentation
· ISO 27001 / 27701 certification copies
· Data Security Management Certificate
· Data Processing Agreement (DPA) template
· Compliance due diligence support materials

FAQ

Frequently Asked Questions

Q: What's the minimum dataset size for purchase?

Standard datasets start from 100GB or 100K records. Custom subset orders can be tailored down to specific scenarios — please contact our team to discuss requirements.

Q: Do you provide custom annotation services?

Yes. ENDATA has CAICT-certified annotation capability for custom labeling requirements including bounding boxes, semantic segmentation, action labeling, and instruction tuning.

Q: How is data updated and versioned?

Standard subscription datasets are continuously refreshed. Each version has a unique version ID; customers can subscribe to incremental updates or pin to a specific version.

Q: What is the typical delivery timeline?

Standard datasets: 3–5 business days from contract. Custom datasets: 2–8 weeks depending on scale and annotation complexity.

Q: Do you support overseas data delivery?

Yes. ENDATA has cross-border data delivery experience and has completed China security assessment for overseas data egress for select datasets. Contact us for specific compliance review.