Migrating to v0.9.x
Multiformat and binary formats
chunk_file() encoding
chunk_file() / Chunker.chunk_file() now take an explicit encoding argument (default utf-8). If you relied on implicit defaults only, behavior is unchanged for UTF-8 text files.
Structured formats (.ipynb, .tex, .pdf, .docx) are routed through dedicated loaders when using chunk_file() / chunk_directory() — do not read .pdf / .docx as plain text before chunking; pass the path to chunk_file().
Chunker.chunk() with strings
.ipynb and .tex can be passed as string content to chunk(). .pdf and .docx cannot (binary); use chunk_file().
New APIs (additive)
dedup_chunks()for near-duplicate removalevaluate_chunks()and CLIomnichunk evalfor offline quality metricschunk_from_dict()for JSONL round-trips
Breaking changes
Review changelog for any edge-specific behavior changes in loaders; typical chunking pipelines remain compatible.