Skip to main content

SAX

The term "SAX" originates from Simple API for XML. We borrow this term and apply it to the parsing and generation of JSON.

In Merak, Reader (a typedef of GenericReader<...>) is a SAX-style parser for JSON, while Writer (a typedef of GenericWriter<...>) is a SAX-style generator for JSON.

[TOC]

Reader

Reader parses a JSON from an input stream. As it reads characters from the stream, it analyzes them based on JSON syntax and sends events to a handler.

For example, consider the following JSON:

{
"hello": "world",
"t": true ,
"f": false,
"n": null,
"i": 123,
"pi": 3.1416,
"a": [1, 2, 3, 4]
}

When a Reader parses this JSON, it sequentially sends the following events to the handler:

StartObject()
Key("hello", 5, true)
String("world", 5, true)
Key("t", 1, true)
Bool(true)
Key("f", 1, true)
Bool(false)
Key("n", 1, true)
Null()
Key("i")
Uint(123)
Key("pi")
Double(3.1416)
Key("a")
StartArray()
Uint(1)
Uint(2)
Uint(3)
Uint(4)
EndArray(4)
EndObject(7)

These events can be easily mapped to the JSON structure, except for some event parameters that require further explanation. You can refer to the simplereader example, which produces exactly the same output as above:

#include "merak/json/reader.h"
#include <iostream>

using namespace merak::json;
using namespace std;

struct MyHandler : public BaseReaderHandler<UTF8<>, MyHandler> {
bool Null() { cout << "Null()" << endl; return true; }
bool Bool(bool b) { cout << "Bool(" << boolalpha << b << ")" << endl; return true; }
bool Int(int i) { cout << "Int(" << i << ")" << endl; return true; }
bool Uint(unsigned u) { cout << "Uint(" << u << ")" << endl; return true; }
bool Int64(int64_t i) { cout << "Int64(" << i << ")" << endl; return true; }
bool Uint64(uint64_t u) { cout << "Uint64(" << u << ")" << endl; return true; }
bool Double(double d) { cout << "Double(" << d << ")" << endl; return true; }
bool String(const char* str, SizeType length, bool copy) {
cout << "String(" << str << ", " << length << ", " << boolalpha << copy << ")" << endl;
return true;
}
bool StartObject() { cout << "StartObject()" << endl; return true; }
bool Key(const char* str, SizeType length, bool copy) {
cout << "Key(" << str << ", " << length << ", " << boolalpha << copy << ")" << endl;
return true;
}
bool EndObject(SizeType memberCount) { cout << "EndObject(" << memberCount << ")" << endl; return true; }
bool StartArray() { cout << "StartArray()" << endl; return true; }
bool EndArray(SizeType elementCount) { cout << "EndArray(" << elementCount << ")" << endl; return true; }
};

void main() {
const char json[] = " { \"hello\" : \"world\", \"t\" : true , \"f\" : false, \"n\": null, \"i\":123, \"pi\": 3.1416, \"a\":[1, 2, 3, 4] } ";

MyHandler handler;
Reader reader;
StringStream ss(json);
reader.Parse(ss, handler);
}

Note that Merak uses templates to statically bind the Reader type and the handler type, instead of using classes with virtual functions. This paradigm can improve performance by inlining functions.

Handler

As shown in the previous example, the user needs to implement a handler to process events (function calls) from the Reader. The handler must include the following member functions:

class Handler {
bool Null();
bool Bool(bool b);
bool Int(int i);
bool Uint(unsigned i);
bool Int64(int64_t i);
bool Uint64(uint64_t i);
bool Double(double d);
bool RawNumber(const Ch* str, SizeType length, bool copy);
bool String(const Ch* str, SizeType length, bool copy);
bool StartObject();
bool Key(const Ch* str, SizeType length, bool copy);
bool EndObject(SizeType memberCount);
bool StartArray();
bool EndArray(SizeType elementCount);
};
  • Null() is called when the Reader encounters a JSON null value.
  • Bool(bool) is called when the Reader encounters a JSON true or false value.
  • When the Reader encounters a JSON number, it selects an appropriate C++ type mapping and then calls exactly one of Int(int), Uint(unsigned), Int64(int64_t), Uint64(uint64_t), and Double(double). If the kParseNumbersAsStrings option is enabled, the Reader will call RawNumber() instead.
  • When the Reader encounters a JSON string, it calls String(const char* str, SizeType length, bool copy):
    • The first parameter is a pointer to the string.
    • The second parameter is the length of the string (excluding the null terminator). Note that Merak supports null characters \0 in strings, in which case strlen(str) < length.
    • The final copy parameter indicates whether the handler needs to copy the string. In normal parsing, copy = true; copy = false only when using in situ parsing. Additionally, note that the character type is related to the target encoding, which we will discuss later.
  • When the Reader encounters the start of a JSON object, it calls StartObject(). A JSON object is a collection of key-value pairs (members). If the object contains members, it first calls Key() for the member name, then calls the corresponding function according to the type of the value. It continues calling these key-value pairs until it finally calls EndObject(SizeType memberCount). Note that the memberCount parameter is only for the handler's reference, and the user may not need it.
  • JSON arrays are similar to objects but simpler. At the start of an array, the Reader calls StartArray(). If the array contains elements, it calls the corresponding function according to the element type. Similarly, it finally calls EndArray(SizeType elementCount), where the elementCount parameter is only for the handler's reference.

Each handler function returns a bool. Normally, they should return true. If the handler encounters an error, it can return false to notify the event sender to stop further processing.

For example, when parsing a JSON with the Reader, if the handler detects that the JSON does not conform to the required schema, the handler can return false to make the Reader stop subsequent parsing. The Reader will then enter an error state, marked with the kParseErrorTermination error code.

GenericReader

As mentioned earlier, Reader is a typedef of the GenericReader template class:

namespace merak::json {

template <typename SourceEncoding, typename TargetEncoding, typename Allocator = MemoryPoolAllocator<> >
class GenericReader {
// ...
};

typedef GenericReader<UTF8<>, UTF8<> > Reader;

} // namespace merak::json

Reader uses UTF-8 as both the source and target encoding. The source encoding refers to the encoding of the JSON stream; the target encoding refers to the encoding used for the str parameter of String(). For example, to parse a UTF-8 stream and output to UTF-16 string events, you need to define a reader like this:

GenericReader<UTF8<>, UTF16<> > reader;

Note that the default type of UTF16 is wchar_t. Therefore, this reader will call the handler's String(const wchar_t*, SizeType, bool).

The third template parameter Allocator is the type of allocator for internal data structures (actually a stack).

Parsing

The only function of Reader is to parse JSON.

template <unsigned parseFlags, typename InputStream, typename Handler>
bool Parse(InputStream& is, Handler& handler);

// Uses parseFlags = kDefaultParseFlags
template <typename InputStream, typename Handler>
bool Parse(InputStream& is, Handler& handler);

If an error occurs during parsing, it returns false. The user can call bool HasParseError(), ParseErrorCode GetParseErrorCode(), and size_t GetErrorOffset() to get the error state. In fact, Document uses these Reader functions to get parsing errors. Please refer to DOM for details about parsing errors.

Writer

Reader converts (parses) JSON into events; Writer does the exact opposite. It converts events into JSON.

Writer is very easy to use. If your application only needs to convert some data into JSON, it may be more convenient to use Writer directly than to build a Document and then convert it to JSON with Writer.

In the simplewriter example, we do the exact opposite of simplereader:

#include "merak/json/writer.h"
#include "merak/json/stringbuffer.h"
#include <iostream>

using namespace merak::json;
using namespace std;

void main() {
StringBuffer s;
Writer<StringBuffer> writer(s);

writer.StartObject();
writer.Key("hello");
writer.String("world");
writer.Key("t");
writer.Bool(true);
writer.Key("f");
writer.Bool(false);
writer.Key("n");
writer.Null();
writer.Key("i");
writer.Uint(123);
writer.Key("pi");
writer.Double(3.1416);
writer.Key("a");
writer.StartArray();
for (unsigned i = 0; i < 4; i++)
writer.Uint(i);
writer.EndArray();
writer.EndObject();

cout << s.GetString() << endl;
}

Output:

{"hello":"world","t":true,"f":false,"n":null,"i":123,"pi":3.1416,"a":[0,1,2,3]}

String() and Key() each have two overloads. One has 3 parameters, as per the handler concept, which can handle strings with null characters. The other is the simpler version used above.

Note that EndArray() and EndObject() in the example code have no parameters. A SizeType parameter can be passed, but it will be ignored by Writer.

You may wonder why not use sprintf() or std::stringstream to build a JSON?

There are several reasons:

  1. Writer will definitely output a well-formed JSON. If there is an incorrect event order (e.g., Int() follows StartObject()), it will cause an assertion failure in debug mode.
  2. Writer::String() can handle string escaping (e.g., converting the code point U+000A to \n) and perform Unicode transcoding.
  3. Writer handles number output consistently.
  4. Writer implements the event handler concept. It can be used to process events from Reader, Document, or other event generators.
  5. Writer is optimized for different platforms.

In any case, using the Writer API to generate JSON is even simpler than these ad-hoc methods.

Template

Writer has a slight design difference from Reader. Writer is a template class, not a typedef. There is no GenericWriter. The declaration of Writer is as follows:

namespace merak::json {

template<typename OutputStream, typename SourceEncoding = UTF8<>, typename TargetEncoding = UTF8<>, typename Allocator = CrtAllocator<> >
class Writer {
public:
Writer(OutputStream& os, Allocator* allocator = 0, size_t levelDepth = kDefaultLevelDepth)
// ...
};

} // namespace merak::json
  • The OutputStream template parameter is the type of the output stream. Its type cannot be inferred automatically and must be provided by the user.
  • The SourceEncoding template parameter specifies the encoding of String(const Ch*, ...).
  • The TargetEncoding template parameter specifies the encoding of the output stream.
  • Allocator is the type of allocator used to allocate internal data structures (a stack).

writeFlags is a combination of the following bit flags:

Write Bit FlagMeaning
kWriteNoFlagsNo flags.
kWriteDefaultFlagsDefault parsing options. Equivalent to the macro RAPIDJSON_WRITE_DEFAULT_FLAGS, which is defined as kWriteNoFlags.
kWriteValidateEncodingFlagValidate the encoding of JSON strings.
kWriteNanAndInfFlagAllow writing Infinity, -Infinity, and NaN.

In addition, the constructor of Writer has a levelDepth parameter, which affects the initial memory allocation for storing level information.

PrettyWriter

The output of Writer is the most compact JSON without whitespace characters, suitable for network transmission or storage, but not for human reading.

Therefore, Merak provides a PrettyWriter that adds indentation and line breaks to the output.

The usage of PrettyWriter is almost the same as Writer, except that PrettyWriter provides a SetIndent(Ch indentChar, unsigned indentCharCount) function. The default indentation is 4 spaces.

Completeness and Reset

A Writer can only output a single JSON, whose root node can be of any JSON type. When a single root node event (such as String()) is processed, or the matching final EndObject() or EndArray() event is processed, the output JSON is well-formed and complete. The user can call Writer::IsComplete() to detect completeness.

When the JSON is complete, the Writer cannot accept new events; otherwise, its output will be invalid (e.g., having more than one root node). To reuse the Writer object, the user can call Writer::Reset(OutputStream& os) to reset all its internal states and set a new output stream.

Techniques

Parsing JSON to Custom Data Structures

The parsing function of Document relies entirely on Reader. In fact, Document is a handler that receives events to build a DOM when parsing JSON.

Users can use Reader directly to build other data structures. This eliminates the step of building a DOM, thereby reducing memory overhead and improving performance.

In the messagereader example, ParseMessages() parses a JSON that should be an object containing key-value pairs:

#include "merak/json/reader.h"
#include "merak/json/error/en.h"
#include <iostream>
#include <string>
#include <map>

using namespace std;
using namespace merak::json;

typedef map<string, string> MessageMap;

struct MessageHandler
: public BaseReaderHandler<UTF8<>, MessageHandler> {
MessageHandler() : state_(kExpectObjectStart) {
}

bool StartObject() {
switch (state_) {
case kExpectObjectStart:
state_ = kExpectNameOrObjectEnd;
return true;
default:
return false;
}
}

bool String(const char* str, SizeType length, bool) {
switch (state_) {
case kExpectNameOrObjectEnd:
name_ = string(str, length);
state_ = kExpectValue;
return true;
case kExpectValue:
messages_.insert(MessageMap::value_type(name_, string(str, length)));
state_ = kExpectNameOrObjectEnd;
return true;
default:
return false;
}
}

bool EndObject(SizeType) { return state_ == kExpectNameOrObjectEnd; }

bool Default() { return false; } // All other events are invalid.

MessageMap messages_;
enum State {
kExpectObjectStart,
kExpectNameOrObjectEnd,
kExpectValue,
}state_;
std::string name_;
};

void ParseMessages(const char* json, MessageMap& messages) {
Reader reader;
MessageHandler handler;
StringStream ss(json);
if (reader.Parse(ss, handler))
messages.swap(handler.messages_); // Only change it if success.
else {
ParseErrorCode e = reader.GetParseErrorCode();
size_t o = reader.GetErrorOffset();
cout << "Error: " << GetParseError_En(e) << endl;;
cout << " at offset " << o << " near '" << string(json).substr(o, 10) << "...'" << endl;
}
}

int main() {
MessageMap messages;

const char* json1 = "{ \"greeting\" : \"Hello!\", \"farewell\" : \"bye-bye!\" }";
cout << json1 << endl;
ParseMessages(json1, messages);

for (MessageMap::const_iterator itr = messages.begin(); itr != messages.end(); ++itr)
cout << itr->first << ": " << itr->second << endl;

cout << endl << "Parse a JSON with invalid schema." << endl;
const char* json2 = "{ \"greeting\" : \"Hello!\", \"farewell\" : \"bye-bye!\", \"foo\" : {} }";
cout << json2 << endl;
ParseMessages(json2, messages);

return 0;
}

Output:

{ "greeting" : "Hello!", "farewell" : "bye-bye!" }
farewell: bye-bye!
greeting: Hello!

Parse a JSON with invalid schema.
{ "greeting" : "Hello!", "farewell" : "bye-bye!", "foo" : {} }
Error: Terminate parsing due to Handler error.
at offset 59 near '} }...'

The first JSON (json1) is successfully parsed into MessageMap. Since MessageMap is a std::map, the printing order is sorted by key, which is different from the order in the JSON.

In the second JSON (json2), the value of foo is an empty object. Since it is an object, MessageHandler::StartObject() will be called. However, with state_ = kExpectValue, this function will return false, causing the parsing process to terminate. The error code is kParseErrorTermination.

Filtering JSON

As mentioned earlier, Writer can process events emitted by Reader. The example/condense/condense.cpp example simply sets Writer as the handler of a Reader, thus removing all whitespace characters from the JSON. The example/pretty/pretty.cpp example uses the same relationship, only replacing Writer with PrettyWriter. Therefore, pretty can reformat the JSON by adding indentation and line breaks.

In fact, we can use the SAX-style API to add (multiple) intermediate layers to filter the content of JSON. For example, the capitalize example can convert all JSON strings to uppercase:

#include "merak/json/reader.h"
#include "merak/json/writer.h"
#include "merak/json/filereadstream.h"
#include "merak/json/filewritestream.h"
#include "merak/json/error/en.h"
#include <vector>
#include <cctype>

using namespace merak::json;

template<typename OutputHandler>
struct CapitalizeFilter {
CapitalizeFilter(OutputHandler& out) : out_(out), buffer_() {
}

bool Null() { return out_.Null(); }
bool Bool(bool b) { return out_.Bool(b); }
bool Int(int i) { return out_.Int(i); }
bool Uint(unsigned u) { return out_.Uint(u); }
bool Int64(int64_t i) { return out_.Int64(i); }
bool Uint64(uint64_t u) { return out_.Uint64(u); }
bool Double(double d) { return out_.Double(d); }
bool RawNumber(const char* str, SizeType length, bool copy) { return out_.RawNumber(str, length, copy); }
bool String(const char* str, SizeType length, bool) {
buffer_.clear();
for (SizeType i = 0; i < length; i++)
buffer_.push_back(std::toupper(str[i]));
return out_.String(&buffer_.front(), length, true); // true = output handler need to copy the string
}
bool StartObject() { return out_.StartObject(); }
bool Key(const char* str, SizeType length, bool copy) { return String(str, length, copy); }
bool EndObject(SizeType memberCount) { return out_.EndObject(memberCount); }
bool StartArray() { return out_.StartArray(); }
bool EndArray(SizeType elementCount) { return out_.EndArray(elementCount); }

OutputHandler& out_;
std::vector<char> buffer_;
};

int main(int, char*[]) {
// Prepare JSON reader and input stream.
Reader reader;
char readBuffer[65536];
FileReadStream is(stdin, readBuffer, sizeof(readBuffer));

// Prepare JSON writer and output stream.
char writeBuffer[65536];
FileWriteStream os(stdout, writeBuffer, sizeof(writeBuffer));
Writer<FileWriteStream> writer(os);

// JSON reader parse from the input stream and let writer generate the output.
CapitalizeFilter<Writer<FileWriteStream> > filter(writer);
if (!reader.Parse(is, filter)) {
fprintf(stderr, "\nError(%u): %s\n", (unsigned)reader.GetErrorOffset(), GetParseError_En(reader.GetParseErrorCode()));
return 1;
}

return 0;
}

Note that you cannot simply convert JSON to uppercase as a string. For example:

["Hello\nWorld"]

Simply converting the entire JSON to uppercase will produce an incorrect escape character:

["HELLO\NWORLD"]

Whereas capitalize will produce the correct result:

["HELLO\nWORLD"]

We can also develop more complex filters. However, since the SAX-style API only provides information about a single event at a time, the user needs to record some contextual information themselves (e.g., the path from the root node, storing other related values). For handling certain cases, using DOM is easier to implement than SAX.