rsync_ext: Unicode decode errors block all subsequent lines #31

Open
opened 2026-02-05 04:05:42 +00:00 by snegov · 0 comments
Owner

rsync_ext: Unicode decode errors block all subsequent lines

Problem

When rsync_ext() encounters a line with invalid UTF-8 bytes, it gets stuck and skips all remaining lines, even valid ones.

Root Cause

Lines 123-128 in curateipsum/fs.py:

try:
    prev_line = prev_line.decode("utf-8").strip()
except UnicodeDecodeError:
    _lg.error("Can't process rsync line: %s", prev_line)
    continue  # ← BUG: skips both yield AND prev_line update
_lg.debug("Rsync itemize line: %s", prev_line)
yield _parse_rsync_output(prev_line)
prev_line = line

The continue on line 128 skips:

  • Line 130: the yield (expected, we want to skip the invalid line)
  • Line 131: prev_line = line (NOT expected, causes the bug)

Example

If rsync outputs:

  1. b"\xff\xfe invalid" (invalid UTF-8)
  2. b">f+++++++++ valid1.txt\n"
  3. b">f+++++++++ valid2.txt\n"

Actual behavior:

  • Iteration 1: Store invalid line in prev_line
  • Iteration 2: Try decode invalid → error → continue → prev_line STILL invalid
  • Iteration 3: Try decode invalid → error → continue → prev_line STILL invalid
  • Result: 0 lines processed

Expected behavior:

  • Skip the invalid line, process valid1.txt and valid2.txt
  • Result: 2 lines processed

Impact

If a backup contains files with names in non-UTF-8 encoding (e.g., legacy Windows-1251 Cyrillic filenames), rsync_ext() will fail to process ANY files after the first invalid filename, silently losing sync data.

Solution

Update line 128 to restore the invariant before continuing:

except UnicodeDecodeError:
    _lg.error("Can't process rsync line: %s", prev_line)
    prev_line = line  # ← Add this line
    continue

Discovery

Found during test implementation for rsync_ext() function. See test_handles_unicode_decode_error in tests/test_fs.py which documents this actual behavior.

# rsync_ext: Unicode decode errors block all subsequent lines ## Problem When `rsync_ext()` encounters a line with invalid UTF-8 bytes, it gets stuck and skips all remaining lines, even valid ones. ## Root Cause Lines 123-128 in `curateipsum/fs.py`: ```python try: prev_line = prev_line.decode("utf-8").strip() except UnicodeDecodeError: _lg.error("Can't process rsync line: %s", prev_line) continue # ← BUG: skips both yield AND prev_line update _lg.debug("Rsync itemize line: %s", prev_line) yield _parse_rsync_output(prev_line) prev_line = line ``` The `continue` on line 128 skips: - Line 130: the `yield` (expected, we want to skip the invalid line) - Line 131: `prev_line = line` (NOT expected, causes the bug) ## Example If rsync outputs: 1. `b"\xff\xfe invalid"` (invalid UTF-8) 2. `b">f+++++++++ valid1.txt\n"` 3. `b">f+++++++++ valid2.txt\n"` **Actual behavior:** - Iteration 1: Store invalid line in `prev_line` - Iteration 2: Try decode invalid → error → continue → `prev_line` STILL invalid - Iteration 3: Try decode invalid → error → continue → `prev_line` STILL invalid - Result: **0 lines processed** **Expected behavior:** - Skip the invalid line, process valid1.txt and valid2.txt - Result: **2 lines processed** ## Impact If a backup contains files with names in non-UTF-8 encoding (e.g., legacy Windows-1251 Cyrillic filenames), `rsync_ext()` will fail to process ANY files after the first invalid filename, silently losing sync data. ## Solution Update line 128 to restore the invariant before continuing: ```python except UnicodeDecodeError: _lg.error("Can't process rsync line: %s", prev_line) prev_line = line # ← Add this line continue ``` ## Discovery Found during test implementation for `rsync_ext()` function. See `test_handles_unicode_decode_error` in `tests/test_fs.py` which documents this actual behavior.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: snegov/cura-te-ipsum#31