Building FineWeb-Legal: A 10B Token Pilot

I’ve been wanting to work on legal AI for a while. The problem is data. The good stuff (case law, statutes, court filings) lives behind expensive paywalls. Westlaw and LexisNexis aren’t exactly giving away access.

So I built something myself.

The idea

FineWeb is this massive dataset of web crawl data that HuggingFace released. It’s already cleaned up and ready to use. I figured: there’s got to be legal content in there. Court opinions get published online. Government sites post regulations. Law schools put up research papers.

Someone just has to filter that data.

What I built

A heuristic filter to find candidate legal documents (looking for keywords like “plaintiff”, “statute”, citation patterns like “U.S.C.” or “F.3d”)
A classifier trained on 6,500 samples I annotated with Mistral
A pipeline to score millions of documents

The classifier ended up at 97.99% F1, which is better than I expected. It’s a LoRA adapter on Gemma-Embedding-300M, so it’s small and fast. I am renting some GPUs for few experiments but I don’t have infinite budget so I try to keep things at tiny scale most of the time.

The result

52,132 documents. 66.9 million words. Mostly case law from sites like openjurist.org and findacase.com, plus federal register filings and some academic content.

I split it into three tiers:

default: everything with a score ≥ 3 (52k docs)
high_quality: score ≥ 4 (32k docs, the good stuff)
supreme: score ≥ 4.8 (16k docs, the very good stuff)

What’s next

This is just the 10-billion token sample. The full FineWeb corpus is 18.5T. Scaling up the pipeline is the next step (though it might be expensive GPU wise).

If you want to use the classifier :

https://huggingface.co/datasets/NoeFlandre/fineweb-legal-pilot

If you want to use the dataset :

https://github.com/NoeFlandre/fineweb-legal

3rd January 2026

Happy New Year!