(Flawed) Iterative heap sort for files

I've recently been working on a tiny self-contained Python time-series database library for a side project. It stores its data in human-readable text files, one per "table" per day. For speed reasons, there are no checks to make sure appended data is in the correct order, i.e. that newly added times are the same or later than the previously last time in the file. That means there is a chance that data can become disordered if you happen to write two sensor readings with different times out of order. That means you might want to re-sort data later, so, the library needs a sorting function.

Partially out of a requirement to keep the library suitable for small internet of things devices, but also as a challenge to myself, I have tried to make the database as memory-efficient as possible by relying on Python's iterators and generators wherever possible. With Python 3 this is usually pretty easy, but when it comes to sorting files it gets a little more difficult: most examples in Python for sorting files on the web say something along the lines of "it's really simple: just load the file contents into a list and call its (very fast) sort method, then write the list back to the file!" But, in my case, I don't want to consume potentially gigabytes of memory loading a huge text file for sorting, I want to do it reading only one line at a time into memory.

After a bit of research, and wracking my brain for hints from my early undergraduate years studying computer science, I figured I should use a heapsort, where the heaps are other files on the file system. With heapsort, I would scan through the unsorted file line-by-line, and separate lines into two files depending on whether the time is in order or not. That gives me the original data split between two files, one sorted and one unsorted. I can then recursively call the same sort method on the unsorted file until I end up with nothing to go into a new unsorted file. At this point I will have n sorted files, and I can call Python's built-in heapq.merge method to merge the contents of those files in date order line-by-line and save the result back to the original file. By using files as my heaps, I trade off extra file system usage during the sort while using only enough memory to run the Python interpreter plus a few variables and the call stack.

Here is my implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
from pathlib import Path
from tempfile import NamedTemporaryFile
from heapq import merge
from datetime import time

def sort(path):
    """Sort file.

    This works best for almost-sorted files. The algorithm is essentially a memory-efficient
    heapsort.

    Like all heapsorts, this is not stable. Measurements made at identical times may be
    swapped in the sorted file.
    """
    def subsort_file(fp):
        """Sort a file using heapsort.

        Returns a list of sorted, open temporary file pointers.
        """
        # Open two files: one for sorted lines, one for unsorted. The sorted file is not deleted
        # on close, but rather deleted later when we merge it into the main file. The unsorted
        # file is deleted at the end of this function.
        with NamedTemporaryFile(
            mode="r+",
            prefix=f"{path.name}_sorted",
            dir=str(path.parent),
            delete=False,
        ) as sub_fp_sorted, NamedTemporaryFile(
            mode="r+",
            prefix=f"{path.name}_unsorted",
            dir=str(path.parent),
            delete=True,
        ) as sub_fp_unsorted:
            last_time = None
            has_unsorted = False

            for line in fp.readlines():
                line_time, line_data = _parse_line_time(line)

                # Choose whether to write this line into the sorted or unsorted file.
                if last_time is None or line_time > last_time:
                    target = sub_fp_sorted
                    # Update the last sorted time.
                    last_time = line_time
                else:
                    target = sub_fp_unsorted
                    has_unsorted = True

                target.write(_format_line(line_time, line_data))

            sorted_paths = [Path(sub_fp_sorted.name)]

            if has_unsorted:
                # Sort the unsorted heap.
                sorted_paths.extend(subsort_file(sub_fp_unsorted))

            return sorted_paths

    with open(str(path), "r") as fp_existing, NamedTemporaryFile("w", delete=False) as fp_temp:
        heap_paths = subsort_file(fp_existing)
        # Open the heap files for reading.
        heap_files = [heap_path.open() for heap_path in heap_paths]
        # Merge the sorted subfiles and write to buffer.
        merged_lines = merge(*heap_files, key=_parse_line_time)
        fp_temp.writelines(merged_lines)

        # Delete the temporary files.
        for heap_file, heap_path in zip(heap_files, heap_paths):
            heap_file.close()
            heap_path.unlink()

    # Overwrite the unsorted file with the sorted one.
    Path(fp_temp.name).rename(path)

def _parse_line_time(line):
    pieces = line.split()
    return time.fromisoformat(pieces[0]), pieces[1:]

def _format_line(tick, data):
    return " ".join([str(tick)] + [str(piece) for piece in data]) + "\n"

This works pretty well for the purpose. Sorting a 1000 line file with a few out-of-order lines takes a fraction of a second. A 1 million line file takes about 20 seconds. All of this while parsing and formatting dates - not bad.

But why is does the title say this is "flawed"?

Where it falls down is when a file contains more than just a few out-of-order lines: because of the algorithm's recursive splitting into two files, this results in a new file for every subsequent line that's out-of-order. Generating a million random readings and trying to sort it, I eventually ran out of hard drive space because the algorithm had recursively made as many temporary files with unsorted data as there were out-of-order lines. That is of the order half a million copies of the million line file... oh dear (I should still point out, memory use was barely scratching 17 MB though!). It also gets horrendously slow because it's continually having to buffer data to temporary files on the hard drive.

So, there are definitely situations where this approach works well, like when sorting almost-ordered data in a RAM-constrained environment, and there are definitely situations where it doesn't work well (or at all), like for highly disordered data. In the latter case, employing other algorithms and not constraining memory usage quite so tightly will give much better results.

It's also worth noting that heapsort is not stable, so lines with identical times are not guaranteed to appear in the same order in the sorted file. For me this didn't matter, since if two readings have the same time I don't care which order they get stored, but for other uses this may matter.

And finally, of course, it's hard to beat C. Calling UNIX's sort command on the file takes about 1 second to sort the 1 million line, randomly ordered file that my algorithm used to nuke the free space on my hard drive:

sort -n -k 1 -o myfile.txt myfile.txt

Memory usage here peaks at about 80 MB. So, it turns out that this algorithm is more or less only useful if you really can't afford an extra 60 or so MB, or if you're on a system with no equivalent sort tool. "Flawed" indeed...