This makes me really happy 🙂
https://youtu.be/GBkzloes26UQuickBASIC comes with help files that are encoded in a proprietary binary format (with two layers of compression, no less!). Fortunately, someone out there put in the work reverse-engineering the format and wrote a document accurately describing how to decode and interpret the bytes. The document does an excellent job describing what could otherwise be a nightmare of complex binary formats.
The QuickHelp format employes two layers of compression. First, the text is compressed using a combination of tokenization and run-length encoding.
The tokenization involves identifying words or phrases that are repeated and adding them to a table, after which every instance of them in the text can be replaced with a reference to that table. Run-length encoding is much simpler: If you see the same byte multiple times in a row, just encode how many times it was repeated. A horizontal rule of 78 horizontal line characters can then just be "repeat this character 78 times" rather than literally 78 characters directly.
Then, after the tokenization pass, the resulting byte stream is compressed using Huffman compression. Huffman compression is based on a simple idea: Instead of using a rigid scheme of 8 bits for every byte, use fewer bits for bytes that show up more commonly, at the expense of rare bytes which then take longer than 8 bits. As long as there is a noticeable bias to some byte values, it can quite effectively compress things. As with many compression algorithms, Huffman compression requires you to treat a file that is really a stream of bytes as a stream of bits, automatically transitioning from one byte to the next as needed.
To get back the original data, you have to apply these steps in reverse, first the Huffman compression and then the Keyword compression. If you get even one detail wrong, everything after that point will almost certainly be indecipherable noise. Fortunately, though, the documentation I found was precise and accurate enough that it was relatively straightforward to write the code and it works a treat. 😃
Once the encoding in the file is sorted out, you then have the semantic meaning of the data to worry about. In a QuickHelp database, as it's called, there's a list of Topics, and each Topic can be linked to by one or more Context Strings. Within the text of a Topic, each line stores its text as a series of spans, each with its own formatting, and then a series of links, each of which specifies a start & end character on the row and the Context String or Topic Index (index into the list of topics) to which to link.
The lookup of help topics for keywords is pretty simple. The context strings simply are the keyword. But, help topics providing contextual help for menus and dialogs are a bit less obvious. Reverse-engineering the mappings required some trial and error with the actual QuickBASIC, checking which help pages with what text appeared from each dialog. QBX only has a small subset of the dialogs anyway, but it was important that the help context strings be mapped correctly.
In this video, the program you're seeing is my QBX project, but the help data is coming from the actual BAS7QCK.HLP file from a QuickBASIC 7.1 installation.
No comments:
Post a Comment