textfineinstructions/fineinstructions_nemotroninstruction-tuningsynthetic-datasftnemotroncommoncrawlllm-trainingenglishparquetfine-tuningrag
FineInstructions Nemotron — 1B+ Synthetic Instruction-Answer Pairs
About this data
Approximately 1 billion synthetic instruction-answer pairs (~300B tokens) generated via the FineInstructions pipeline over the Nemotron-CC high-quality CommonCrawl pre-training corpus. Parquet format with per-shard judge scoring files.
Schema
| Name | Type | Description |
|---|---|---|
| warc_record_id | VARCHAR | Unique identifier for the source WARC record from Nemotron-CC corpus. |
| text | VARCHAR | Original source document text from which instruction-answer pair was generated. |
| token_count | BIGINT | Token count of the source document text. |
| template_id | BIGINT | Identifier for the instruction generation template used in FineInstructions pipeline. |
| instantiated_instruction | VARCHAR | Synthetic user instruction grounded in source document. |
| answer | VARCHAR | Synthetic assistant response to the instruction. |
| synthetic_token_count | BIGINT | Token count of the generated answer/response. |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "FineInstructions Nemotron — 1B" })
// Found: 5ab3daf0-8c64-4c99-9e35-948d88e85fe9
get_download_url({ dataset_id: "5ab3daf0-8c64-4c99-9e35-948d88e85fe9" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/5ab3daf0-8c64-4c99-9e35-948d88e85fe9/download-url