In this post, I will detail the steps I followed in building a JSON parser well enough for anyone with some programming knowledge and willingness to know what a parser
and lexer
are to follow along.
This JSON parser is built as part of the coding challenges by John Cricket and this particular attempt at the challenge is probably not the best out there. I expect there are better ways to do several things, but the primary aim for me was to have fun building đ
STEP 0 - Grammar as the grand start
Grammar, in the theory of computation, refers to a set of rules that define acceptable strings in a language. Quite similar to grammar in the English language if you ask me. Think of how you know âI can singâ is a correct sentence but âI sing canâ isnât.
The role of the parserâin a compilation processâis to validate syntactic structure. It accepts a sequence of tokensâto be discussed nextâand checks that the sequence is valid according to the language’s grammar.
Right before the parser gets into action, there is the lexer. The lexer, also usually a part of the compiler, is responsible for reading the text you write and creating meaningful tokens of it. A token is the simplest meaningful unit in a language. Take a look at the examples here to better understand what tokens are.
Back to the grammar; for the parser to know what sequence of tokens is valid, we clearly define the acceptable sequence i.e. the grammar. As a similar reference, in the English language modal verbs, such as âcanâ, always come before the main verb, such as âsingâ hence why the sentence âI can singâ is valid and the other isnât.
I came up with the following grammar for JSON. The grammar references json.org for notation and structure.
The notation above may be strange if youâre unfamiliar with the topic; It is a common way of representing grammar using production rulesâeach line with an arrow. This particular set of production rules is interpreted as the following from top to bottom:
- An object (O), identified with an open brace, a list of members(N), followed by the closed brace. A member refers to a key-value pair
- A list of members (N) represented either by an empty string, a member(M), or a member(M) followed by a comma, followed by a list of members (N)
- A member(M) is identified as a string followed by a colon(:) followed by a value (V)
- A value(V) identified as either an object (O), an array (A), a boolean (b), null (n), or a string(s)
- An array(A) is identified as a left bracket followed by a list of elements(L) followed by a right bracket
- A list of elements identified as a single value, a value followed by a comma, followed by another list of elements of an empty string
STEP 1.1 Lexer, but just open and close
For this step, I built a lexer that recognizes the {
and }
characters as tokens. To begin with the lexer, a Lexer
class:
The class has a readCharacter
method, which returns the character to the current position
on every call and increments the position. The idea is to use the position as a pointer and iterate through every character in the text input. I hope to use the line_number
attribute for error reporting đ€
To recognize the different types of tokens, I use a Token
class with a type attribute and a literal for actually storing the string/character:
The only types known for now are the left brace, the right brace, and an EOF type I use to recognize when we have read all of the text.
The parser mostly needs the lexer to produce the next token in the sequence so I make a method in the Lexer class for this; every time itâs called, the lexer returns the next token in the sequence
Notes on the Exceptions using
self.position
The idea of raising errors with messages having the position was to replicate error messages that point you to the particular position on a line where the error occurred. This worked well with single line text but to get it accurate for multiple lines would require work I am yet to get to as of now
Thatâs all for the basic lexer đ€©Â . Way easier than the earlier explanation of what a lexer is right?
I wrote some unit tests to test things out but Iâll leave that out of this post. The final set of tests can be found in the GitHub repo.
STEP 1.2 Parser, but just open and close
Next comes building the parser capable of parsing an empty object {}
. As with the Lexer, I use a Parser class for everything related:
The class uses an instance of the lexer class for getting tokens â from the input text. The nextToken
is used for setting the currentToken
attribute to the next token to be evaluated. A peekToken
attribute also exists in the class for the token right after the current token. This allows the parser to take a âpeekâ ahead.
To parse the JSON, I explored the concept of a recursive descent parser. More about that here. Parsers are split into top-down parsers and bottom-up parsers. Recursive descent parsing is an approach to top-down parsing I found fairly easy to implement. The idea â of recursive descent parsing â as I came to understand it is to have a procedure/function representing each non-terminal in my grammar â the big letters in the above grammar. Each of these functions will attempt to parse the following tokens according to the right-hand side of the production rule. If another(or the same) non-terminal is part of the production rule, the function for this non-terminal is called, hence the recursive part of the name.
Getting into actual coding; First I need a function that actually starts the parsing
The parseProgram
method is to be called to begin the process. It calls self.nextToken
twice to initialize the currentToken
and peekToken
to actual tokens. JSON â an object notation â is a bunch of members wrapped in an object so the first token to expect in a valid JSON should be the start of an object, {
â the left brace.
If the currentToken
is a left brace, then itâs fair to assume weâre dealing with an object and try to parse the object by calling the parseObject
method shown below:
In this method, I check that the next token is a right brace. And thatâs all, the parser now recognizes the JSON {}
Nodes - Note on the node objects seen so far
The ObjectNode and ProgramNode classes seen so far refer to nodes of an Abstract Syntax Tree(AST). The parser as part of the compilation process is responsible for generating an abstract syntax tree to be utilized by the following stages of the process. In the code so far, I have created these nodes but I have chosen not to discuss the nodes or the AST to leave the post less complex.
STEP 2.1 More strings, more power to the Lexer
To allow the Lexer to recognize strings both for member names and member values, I add a string token type to the set of known types
In the lexerâs nextToken
method, we recognize strings as a sequence of characters that start and end with a quotation mark.
When the currently read character is â
the set of characters that follow should make up a string, and I use the readString
method to read these characters into one Python string.
The nextToken
method uses the string literal returned in creating a token and returns that token.
While weâre on strings; the lexer should also recognize commas and colons. The token types are added as that for strings was:
The nextToken
can easily recognize these characters so the tokens are easily created
STEP 2.2 What about spaces
So far the lexer has iterated through the text assuming every character is a recognized one or it throws an error. But in writing JSON {"hey":"there"}
is just as valid as { "hey": "there"}
. The space between tokens should be ignored. The lexer currently will throw an error whenever it meets a space character. I fix this by adding every space character as a recognized one but not returning as a token, instead, keep reading the next characters until we get to read a valid token
The skip spaces method keeps reading characters until we get to a non-space character, then we call nextToken again so we get the next viable token
Skipping the spaces is a good enough solution for handling spaces for JSON, In a programming language where space is actually a valid/required part of the syntax the space may actually be kept as a token â I think?
STEP 2.3 Parserâs first steps
To take some baby steps, the parser should be able to parse {"name": "samuel"}
â Object, with a single member whose key and value are strings. The lexer already recognizes string tokens and the semi-colon token so I only need the Object to call the procedure to parse a member. Remember I said a procedure/method will exist for each non-terminal.
To add this, I updated the parseObject
procedure as such:
The change adds a check to see if the current token is a string type. If it is, itâs okay to assume weâre now dealing with a member (a key-value pair), so I call the function to parse a member. The function to parse a member checks that the next token after the string (the key) is a colon, otherwise it throws an error and skips the colon to the next token, then checks that this current token is also a string(the value).
With these changes, the parse can now parse the JSON {"name": "samuel"}
.
STEP 2.4 Enough steps to walk
For the next step, I update the parser to recognize JSON with multiple key-value pairs like {"name": "samuel", "random": "yes"}
. To do this the parser needs to know to expect a comma after every member except the last one.
With this change, the parser after parsing a member checks if the next token is a comma, if it is, then it skips the comma for the loop to parse the next member in the next iteration. If it isnât a comma, then itâs expected to be a right brace.
Thatâs it! the parser can now parse JSON with all its values as strings.
STEP 3.1 A bigger lexer
To improve the lexer’s capability, I updated it to recognize numerics, booleans, null, left bracket, and right bracket as tokens.
As usual, I added the new token types;
With the token types added, the lexer can be updated to identify the relevant tokens. Identifying the left and right brackets is easier to do:
Identifying booleans, null, and numeric required extra functions for actually reading the next sequence of characters. A readNumeric
method in the lexer for reading numerics :[
The method reads a new character until it gets to one that isnât numeric. Right before it returns I adjust the position attribute one step backward so the next call to nextToken
reads the right character.
Also, a readKeyword
method in which I read booleans(true or false) and the null type.
This method is similar to readString
but with the addition of the if condition which checks if the string we have so far is in the known list of keywords. Thereâs also the important difference with when the functions are called. The readString
is called when we get to a quote character â identifies the beginning of a string â while the readKeyword
is called when we get to an alphabet character.
with that sorted, the lexer can now read and identify those token types:
STEP 3.2 A bigger parser
With the lexer able to identify all possible tokens, the parser can be updated to parse all the possible types for a value.
I modified the procedure for parsing a member; instead of just checking that the value is a string the parser should attempt to parse the value using the procedure parseValue
The procedure checks that the current token type is in one of the accepted value types. Youâll also notice another procedure for parsing arrays. If the current token type is a left bracket then we should be at the beginning of an array so the function parseArray
is called
Parsing of an array generally involves checking that until we meet a right bracket that closes the array, we keep getting a sequence of acceptable tokens in an array always split by a comma.
In the parseValue
method, youâll also notice a parseObject
call; this calls the same function we call when we begin to parse the program because remember, the first thing in JSON is an object.
Thatâs it. The parser is now capable of parsing JSON as complex as this:
There are several tests for the parser in the codebase for automated tests but running this parser would look something like this:
The parser in this case would attempt to parse test.json
and throw an error if the JSON in the file is invalid.
I had fun building and look forward to doing more similar projects. Until then, cheers đ„