Finepdfs-Sample-75K-Meta

Component description / functionalities

This dataset is a metadata-enriched multilingual sample of the original FinePDFs dataset. FinePDFs is a large-scale collection of document-level texts extracted from PDF files, sourced primarily from Common Crawl. The dataset emphasizes high-quality document extraction, structural coherence, and large-scale coverage of technical, scientific, educational, and administrative content commonly distributed in PDF form. This release contains a 75,000-document sample drawn from FinePDFs and enriched with additional symbolic metadata layers designed to improve interpretability, licensing awareness, and semantic analysis of document-based web content.

IPCEI CIS Reference Architecture

AI Layer
Data Layer

Open source license

ODC-BY-1.0
Keywords
pdf-text-corpus
document-ai
pdf-extraction-dataset
web-scraped-documents
commoncrawl-pdfs
large-scale-text-corpus
multilingual-documents
text-from-pdfs
dataset-sample
metadata-enriched-dataset
multilingual-dataset
english-italian-french-german-spanish
european-languages
cross-lingual-corpus
pretraining-corpus
llm-pretraining-data
language-model-training
document-understanding
long-context-training
retrieval-augmented-generation-data
structured-metadata
topic-classification
readability-metrics
quality-filtering
license-aware-dataset
common-crawl-derived
pdf-ocr-extraction
docling-rolmocr
web-document-mining
data-governance
dataset-enrichment