Component description / functionalities
This dataset is a metadata-enriched multilingual sample of the original FinePDFs dataset. FinePDFs is a large-scale collection of document-level texts extracted from PDF files, sourced primarily from Common Crawl. The dataset emphasizes high-quality document extraction, structural coherence, and large-scale coverage of technical, scientific, educational, and administrative content commonly distributed in PDF form. This release contains a 75,000-document sample drawn from FinePDFs and enriched with additional symbolic metadata layers designed to improve interpretability, licensing awareness, and semantic analysis of document-based web content.