Parsing Protobuf Definitions with Tree-sitter
If you work with Protocol Buffers (protobuf
), you can really save time,
boredom, and headache by parsing your definitions to build tools and generate
code.
The usual tool for doing that is protoc
. It supports plugins to generate
output of various kinds: language bindings, documentation etc. But, if you want
to do anything custom, you are faced with either using something limited like
protoc-gen-gotemplate or
writing your own plugin. protoc-gen-gotemplate
works well, but you can’t
build complex logic into the workflow. You are limited to what is possible in a
simple Go template.
It’s also possible to use protoreflect
from Go to process the compiled results
at runtime. This is painful. Really painful.
So, at work, we had made limited use of the protobuf
definitions other than
for their main purpose and for documentation and package configuration via
custom options (these are supported in protobuf
). Writing the protoreflect
code to make that work is not something I want to repeat.
Then I recently revamped my editor setup and moved from
Vim to Neovim. In the process I
realized how awesome the
Tree-sitter parsing library is
and that it probably was going to support extracting everything I wanted to get
from our protobuf
definitions. Neovim uses Tree-sitter extensively.
Why This Matters
Our evented and event-sourced backend at Mozi relies on
protobuf
for schema definitions and serialization of events. We use these
same schemas everywhere from the frontend all the way to the backend. This
means our whole system is working on the exact same entity definitions
throughout. Good Stuff™.
In Go, the bindings are not really native structs and require a lot of
GetXYZ()
and GetValue()
calls chained with nil checking to work around the
fact that nil
and zero values are encoded the same way in Protobuf. You also
can’t use them in conjunction with anything that uses struct
tags because you
can’t apply tags. I am told by the mobile devs that the Swift bindings are
similarly unfriendly.
We use a mapping layer to paper over this and to make these easier to work with in in our Go code, in data stores, and with off-the-shelf libraries.
We were maintaining custom mappings by hand. That’s a waste of time and even getting GPT to write the transformations back and forth is annoying, and invariably requires tweaking. So I wanted a solution that was much more automatic and repeatable.
Here’s what I did.
Example Definition
First we’ll have a look at one protobuf
definition. Then we’ll talk about
extracting the information we want from it.
Imagine that we’re working with the following fairly typical protobuf
message
definition. We want to be able to extract the name of the message, the enum
names and values, and the fields and their types. Here we are not particularly
interested in the field numbers, but you could also extract them, of course.
This typical message contains a single enum type and 4 fields. Real life
messages will contain many more fields, but this is enough for us in this post.
Looking at this, we could hack something to parse this fairly simple example
using regexes or other string matching. But it would end up being pretty
brittle. You could even break most trivial parsers by commenting out one or
more lines of valid code with /* */
style. So let’s take a look at how we
could get the data we need using a real parser: Tree-sitter.
Parsing and Querying the Document
Tree-sitter has numerous bindings that enable parsing programming languages and
data formats and protobuf is supported. There are also good Go bindings for
Tree-sitter that make it possible to interact with all of this in a
straightforward way from Go code. We’ll use the
github.com/smacker/go-tree-sitter
package and the associated protobuf
bindings.
The library supports various methods of access to the parsed tree, but the one we’ll use here is a query expression that will extract only the data we care about.
We can use an S expression to query the parsed tree. But, we need to understand what the parsed tree looks like before we can query it. How do we visualize what is in the AST? One way would be to use the online playground, but that lacks support for Protobuf. Because I was already working in Neovim, I decided to use the excellent built-in visualization and query tools!
Inside Neovim you can run :InspectTree
on any open document where the bindings
are included, and see a nice tree. Here is me running the inspector on the
source code for this blog post. (See if you can spot the code error)
In :InspectTree
, if I highlight things in the document, I see them reflected
in the tree, and vice versa. This is invaluable for working with the queries,
since we can identify what each element in the AST actually is in the document,
live.
We can do the same thing for our Protobuf document. Then, it’s a matter of constructing a query to find and extract the parts of the document we want:
- Message Name
- Enum names, keys, and values
- Field names and types
Writing a query using the Neovim tools is also nice, and straightforward. From
the :InspectTree
panel, you can open the query editor by typing :EditQuery
.
This brings up another pane where we can type queries and see them reflected in
the original document via highlighting and annotation.
This is what writing a query looks like in the Neovim query window:
When I put the cursor over the named capture @name
in the query, it
highlights any matched parts of the document. There are many ways to write the
queries that we might use here. You essentially just walk through the tree in
the viewer and mark the things you’d like to return as named captures.
The simplest query, shown in the screenshot, is to simply extract the message name:
Here we found by inspecting the tree, that a message_name
type is always
followed by an identifier
. If we capture the identifier
as @name
we can
then refer to that capture when we want the message name. Then we can just
build it up from there.
Here you can see me traversing a query that I built, and how the editor highlights the matches:
This is an example of a single query that will extract all of our required data
from the protobuf
definition:
Captures from the document will be returned by Tree-sitter in order. This is very helpful. We can then walk the results to generate a structure more easily reference in code. So let’s take a look at some Go code to interact with this document using the query we built.
Working with Tree-Sitter from Go
We need to import the two packages mentioned earlier. This is truncated for clarity: you will need other simple stdlib import.
We need some kind of data structure to store our parsed info in. The simplest starting point is something like this:
You could, of course, use a more structured type if that suits your purpose better.
Then we need a function to read in the file and run it through the parser:
You will note in the above that the majority of the hard work is being done by
a function we have not seen yet: GetMessageFields()
. That should look
something like this:
Here we define the query, ask Tree-sitter to kick off the query, and then we loop over the matches, inspecting their name and then building up the maps.
The last piece of code to show is the queryTree()
function that kicks off the
query and cursor. It looks like this:
And that’s pretty much the meat of it. We can call ParseMessage()
and we get back
a Message{}
struct that is populated with our message name, fields, and enums. In
JSON representation, it would look something like this:
And that’s it! It’s up to you what you do with this, but that gets you started. If you need to parse sub-types, you could design a query to do that. If you want to parse RPC definitions, you could do that, too. We use this information to generate out our bindings (which includes some logic).
Conclusion
This basis for tooling has been pretty good for us. I will undoubtedly bring Tree-sitter and the Neovim tooling to bear on other problems in the future. Hopefully this overview gets you a starting point.