Rzk Formatter¶

Version 0.7.0 (and later 0.7.3 with bug fixes) of Rzk has recently been released, and it brings with it an exciting new feature: the auto formatter! It has been under work for over 3 weeks and came with unique design challenges that I will attempt to unpack in this post.

Goal¶

Basically, the target is to write a program that takes in Rzk code as input and makes sure it abides by the style guide (developed mainly by Fredrik Bakke and other contributors to the sHoTT project) with minimal modification otherwise. For example, comments should stay intact and semantics should be preserved.

We decided to not support any customization (at least initially), making the formatter heavily opinionated.

Challenges¶

Designing a formatter has both technical and nontechnical challenges. From the nontechnical side, and especially since the formatter is not customizable, there is the challenge of having the formatting requirements formulated in a precise enough language that can be translated easily into code, assuming there is even consensus among the community on these requirements.

From the technical perspective, the parser generator currently used (BNFC) loses a lot of the information necessary for reconstructing the source program, so a pretty-printer based on the parse result is not feasible (for example, all comments would be lost). Even after this difficulty is overcome, there are still more challenges regarding the formatting itself. For example, if a given line is too long and it is necessary to wrap it by inserting a new line somewhere in the line, there are \(O(2^n)\) possible choices (where \(n\) is the number of locations in the code where a new line could be inserted) to compare between for inserting line breaks. For example, consider the following Rzk code:

#lang rzk-1
#define id (A : U) (x : A) : A := x

which has 11 positions where a new line can be inserted, as shown by these markers:

#define id (A : U) (x : A) : A := x
       ^  ^  ^ ^  ^  ^ ^  ^ ^ ^  ^

This amounts to 2048 combinations that have to be ranked for such a simple line. Even the ranking itself is not a straightforward problem to tackle.

Additionally, to properly indent each line, it is necessary to understand the program semantics, which increases the complexity of the formatter's parser.

Possible approaches (and their tradeoffs)¶

Multiple possible solutions exist for the above difficulties, each with their pros and cons. The alternatives we considered for Rzk's formatter can be summarized as follows:

Implement a Prettier plugin
Reuse the existing parser generated by BNFC
Write a dedicated parser for the formatter specifically
Reuse the exiting lexer generated by BNFC

Prettier plugin¶

Seeing as some formalization projects were already making use of Prettier, a plugin to support Rzk seemed like a reasonable option. However, that option didn't seem so attractive for a few reasons. First of all, it would mean forcing all users to install Prettier, which requires having the JavaScript toolchain (in particular, npm) installed, but not all Rzk developers use JavaScript. It would be better to have an in-house solution.

Additionally, Prettier plugins can only be written in JavaScript, which means having to maintain a separate parser written from scratch in JS, which is not maintainable or even desirable.

BNFC's parser¶

Rzk makes use of BNFC which produces a lexer, parser, AST, and pretty-printer for a given Labelled BNF grammar file. The generated pretty-printer is not configurable, so it would take a lot of dirty patching to have it follow the style guide of Rzk code.

As for the abstract syntax tree (AST) it generates, it is also not suitable particularly for being too abstract. It loses all information about spaces, parentheses, and basically any syntactic construction, which is useful for a compiler or typechecker, but not so much for a formatter.

Dedicated parser¶

An alternative to the generated parser would be to manually write another parser that keeps all the information relevant to the formatter in a concrete syntax tree (CST). This would be the most customizable option, but it would also be a lot of repeated work (since we already have the generated parser) that is additionally more error-prone.

Writing a parser manually is time-consuming and redundant since it can easily be automatically generated from the much more readable grammar file.

BNFC's lexer¶

Since BNFC generates a lexer that is used by the parser, a more rudimentary approach would be to make use of that lexer for our formatter. As the lexer simply tokenizes the input source code, it does not make any decisions regarding what to throw away and keeps all syntax constructions (except comments). Therefore, this approach would have the least amount of syntactic information lost among the automatically generated tools.

On the hand, working with a stream of tokens is not as easy as working with a parsed tree. First, tokens do not carry almost any semantics with them, so it is difficult to understand whether an identifier refers to a parameter, a function application, or if it's even still being declared. Also, without having semantics, it is difficult to judge out which line split would make more sense when inserting line breaks in a long line, or to figure out how much a given line should be indented.

Our chosen solution¶

The approach that seemed the most suitable of the above was the last one: use BNFC's lexer. After going through the style guide a few times, it was clear that the vast majority of the rules/recommendations did not involve understanding much semantics. For example, Unicode replacement for ASCII sequences, moving := to a new line, and adding a space after \\ in a lambda term can all be applied blindly.

For the few rules that require having some semantics (like distinguishing between : for definition conclusion and for other types), it was enough to just store some state that is preserved (and updated) while traversing the tokens; no additional semantic analysis was necessary.

The actual implementation in Haskell is just pattern matching on tokens with some conditions and applying edits to the text in response. For example, the following is an excerpt of the pattern match for opening parenthesis:

-- Ensure exactly one space after the first open paren of a line
go :: FormatState -> [Token] -> [FormattingEdit]
go s (Token "(" line col : tks)
  | precededBySingleCharOnly && spacesAfter /= 1 && not isLastNonSpaceChar
    = FormattingEdit line (col + 1) line (spaceColcol + 1 + spacesAfter) " "
    : go (incParensDepth s) tks
  -- Remove extra spaces if it's *not* the first open paren on a new line
  | not precededBySingleCharOnly && spacesAfter > 0
    = FormattingEdit line (col + 1) line (col + 1 + spacesAfter) ""
    : go (incParensDepth s) tks
  | otherwise = go (incParensDepth s) tks
  where
    -- Redacted for brevity

where go is a helper function containing the core logic used inside format :: String -> String. It simply operates on the current state and the matched token(s) and returns a list of edits to be applied to the text. The FormattingEdit structure provides compatibility with LSP (for the VS Code extension) and is easy to reuse for the CLI. Additionally, we try to make sure that these edits perform the minimal changes needed so that nothing else is affected by mistake (such as comments).

While this implementation is more error-prone than an auto-generated pretty-printer, it is much more flexible and unit tests have been added that verify the various rules to make sure no functionality gets broken by mistake. In fact, the unit tests helped catch a bug with the formatter right as we thought we were ready to release it.

Implemented rules¶

The formatter currently implements the following features:

Add a space after the opening parenthesis to help with the code tree structure
Put the definition conclusion (type, starting with :) and construction (body, starting with :=) on new lines
Add a space after the \\ of a lambda term and around binary operators (like , and =)
Move binary operators to the beginning of the next line if they appear at the end of the previous one
Replace common ASCII sequences with their Unicode equivalent
Removes extraneous spaces in some places like before ) or between #lang and rzk-1 (though it's not comprehensive yet)

Shortcomings¶

No software is ever perfect, especially from the first release, and our formatter is definitely no exception 🙂.

For example, if the source code contained tab characters, instead of throwing its hands up in the air and angrily shouting "I can't deal with this", the formatter decides to silently take revenge and mess up your code by deleting random characters around the tab character, but only every once in a while! (I really wish I was joking 😅)

Thankfully, the accompanying VS Code extension configures some default settings to avoid this situation by always inserting spaces.

Aside from bugs like this, there are also features that are simply not yet implemented. For example, these include inserting line breaks in very long lines and inferring the correct indentation level a line should have. For the latter, we decided that it might be too constraining to force a specific indentation on the users when the line might be more readable using a different indentation, and the formatter is not smart enough to decide when it should adjust indentation or leave it as is.

Lastly, some desirable features have been identified as belonging more to the linter (which we do not have yet) rather than the formatter, such as detecting unnecessary parenthesis, deprecated syntax, and expressions that could be simplified. The linter is currently not on the roadmap yet, but we hope to have it sometime in the future.

How to use the formatter¶

The formatter is shipped as part of the rzk executable under the subcommand format. It is also integrated in the language server and used by the VS Code extension. Additionally, the GitHub Action now has an option to check the formatting as well.

CLI¶

The command looks like so:

rzk format [--check] [--write] [file_path]...

It follows the convention of other formatters (Prettier and Fourmolu, to name a couple) in that usage with any flags (rzk format file.rzk) outputs the result to stdout, using --write will modify the files in place, and --check is only for verifying that the files are already well-formatted (and exits with a non-zero code if they aren't). Naturally, --write and --check are mutually exclusive and it would be an error to specify both.

If no file paths are given, then they will be read from rzk.yaml if it exists (just like with typecheck).

VS Code¶

The accompanying release of the VS Code extension integrated formatter support so you can use the built-in Format Document command to format Rzk and Literate Rzk Markdown files. It also comes with new default settings for Rzk file that ensure consistent indentation (2 spaces), enable formatting on save, and add a ruler at 80 chars to help users identify long lines since the formatter cannot wrap them just yet.

The formatter can be turned off manually by setting the rzk.format.enable option to false.

GitHub Action¶

A new option has been added to the rzk-action called check-formatting that, well, checks the formatting of the input files. It is off by default, and cannot be used with rzk-version lower than v0.7.1.

Conclusion and Feedback¶

This was an exciting milestone for the Rzk toolset and for me personally as I have never written a similar piece of code before. That was an interesting engineering and research challenge, and I enjoyed it all the way.

Contributions have been made to popular Rzk formalization repos to autoformat their content, such as sHoTT and the HoTT Book.

I encourage everyone to use the formatter and help us by reporting any possible bugs on our GitHub issues page (using the label formatter). There also exists a Zulip thread about the autoformatter, so feedback and discussions are welcome there as well. Many thanks to Fredrik Bakke for his helpful feedback on the sHoTT formatting PR!

See you in future blog posts :)