textliteraturebookspublic-domaingutenbergNLPdigital-humanitiesgenre-classificationmultilinguallibrary-of-congressbibliometricsauthor-metadataliterary-history

Global Public Domain Books Catalog — 75,000+ Literary Works with Genre, Era & Classification (1971–2025)

Category
Text
Records
75,545 rows
Format
CSV
Update Frequency
One-time snapshot
Collection Method
uploaded
PII
None detected
Downloads
1

About this data

Comprehensive catalog of 75,545 public domain literary works from Project Gutenberg, enriched with genre classification, literary era mapping, and Library of Congress subject area categorization. Covers works in 58+ languages from ancient texts to early 20th-century literature. **Sources:** - Project Gutenberg digital library catalog (primary metadata: titles, authors, dates, subjects, Library of Congress Classification) - Library of Congress Classification scheme (subject area mapping) - Literary period taxonomy (era classification from Medieval through Contemporary) - Custom NLP-derived genre classification across 20+ categories **Schema (23 columns):** - `gutenberg_id` — Unique Project Gutenberg text identifier - `title` — Full title of the work - `author` — Primary author name (normalized to "First Last" format) - `author_birth_year` / `author_death_year` — Author life dates - `num_authors` — Number of credited authors - `language_code` — ISO language code - `language` — Full language name - `issued_date` — Date digitized/added to Project Gutenberg - `primary_subject` — Primary subject heading - `subject_count` — Total number of subject headings - `locc_classification` — Library of Congress Classification code(s) - `locc_area` — Mapped LoCC broad subject area - `genre` — Derived genre (Fiction, Poetry, History, Science Fiction, Mystery, etc.) - `literary_era` — Estimated literary period (Medieval, Renaissance, Romantic, Victorian, Modern, Contemporary) - `bookshelf` — Project Gutenberg bookshelf category - `source` — Data source identifier - `url` — Direct link to the work - `license` — License type (all Public Domain) - `title_word_count` — Number of words in title - `has_author` — Whether author is known (1/0) - `is_english` — English language flag (1/0) - `has_classification` — Has LoCC classification (1/0) **Coverage:** 75,545 unique works across 58+ languages. 60K+ English works plus significant French (4K), Finnish (3.5K), German (2.3K), and 50+ other language collections. Literary eras span from Ancient/Medieval through Contemporary. **Use cases:** Literary analysis, NLP training data catalogs, bibliometric research, digital humanities, author network analysis, genre classification benchmarking, language diversity studies, cultural heritage research.

Schema

NameTypeDescription
gutenberg_idstring
titlestring
authorstring
author_birth_yearstring
author_death_yearstring
num_authorsstring
language_codestring
languagestring
issued_datestring
primary_subjectstring
subject_countstring
locc_classificationstring
locc_areastring
genrestring
literary_erastring
bookshelfstring
sourcestring
urlstring
licensestring
title_word_countstring
has_authorstring
is_englishstring
has_classificationstring

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: 4.8 / 5
1 downloads
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "Global Public Domain Books Cat" })
// Found: 9e20a575-5493-47d9-b71b-ad0dc12be01a
get_download_url({ dataset_id: "9e20a575-5493-47d9-b71b-ad0dc12be01a" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/9e20a575-5493-47d9-b71b-ad0dc12be01a/download-url