Skip to content

bytesense v0.1.0

Release date: 2025-03-26

This is the first public release of bytesense: a charset / encoding detector for raw bytes that stays dependency-free at runtime, explains its decisions with a why field, and can optionally use a Rust extension for faster scoring—without shipping ML models or extra Python packages.

Highlights

  • Drop-in style API next to chardet / charset-normalizer via detect(), plus richer helpers (from_bytes, from_path, from_fp, is_binary).
  • Streaming-first design: detect_stream(), StreamDetector with stability-based stopping, snapshots, and optional HTML/XML in-band encoding hints (meta charset, XML declaration).
  • Mojibake repair: repair() / repair_bytes(), is_mojibake(), structured RepairResult.
  • Multi-encoding documents: detect_multi() with DocumentSegment results.
  • Standalone HTTP/content hints: hint_from_http_headers(), hint_from_content(), best_hint().
  • Fingerprinting over byte distributions with a pre-computed lookup table for consistent, fast candidate ranking.
  • CLI entry point: bytesense <file>.
  • Optional acceleration: pip install "bytesense[fast]" for the native wheel where available.
  • Typing: py.typed and full annotations; mypy-friendly layout.
  • CI: GitHub Actions across Linux, macOS, Windows and Python 3.8–3.13.

Notes for integrators

  • Runtime Python dependencies remain zero; the Rust piece is optional and falls back to pure Python.
  • Benchmarks and accuracy tables in the repo are intended as regression checks on synthetic and published corpora—always validate on your own traffic.

Full change list

See CHANGELOG.md for the detailed Added / Changed / Fixed sections for this version.