Skip to content

Migrating to v0.9.x

Multiformat and binary formats

chunk_file() encoding

chunk_file() / Chunker.chunk_file() now take an explicit encoding argument (default utf-8). If you relied on implicit defaults only, behavior is unchanged for UTF-8 text files.

Structured formats (.ipynb, .tex, .pdf, .docx) are routed through dedicated loaders when using chunk_file() / chunk_directory()do not read .pdf / .docx as plain text before chunking; pass the path to chunk_file().

Chunker.chunk() with strings

.ipynb and .tex can be passed as string content to chunk(). .pdf and .docx cannot (binary); use chunk_file().

New APIs (additive)

  • dedup_chunks() for near-duplicate removal
  • evaluate_chunks() and CLI omnichunk eval for offline quality metrics
  • chunk_from_dict() for JSONL round-trips

Breaking changes

Review changelog for any edge-specific behavior changes in loaders; typical chunking pipelines remain compatible.