SAX
The term "SAX" originates from Simple API for XML. We borrow this term and apply it to the parsing and generation of JSON.
In Merak, Reader (a typedef of GenericReader<...>) is a SAX-style parser for JSON, while Writer (a typedef of GenericWriter<...>) is a SAX-style generator for JSON.
[TOC]
Reader
Reader parses a JSON from an input stream. As it reads characters from the stream, it analyzes them based on JSON syntax and sends events to a handler.
For example, consider the following JSON:
{
"hello": "world",
"t": true ,
"f": false,
"n": null,
"i": 123,
"pi": 3.1416,
"a": [1, 2, 3, 4]
}
When a Reader parses this JSON, it sequentially sends the following events to the handler:
StartObject()
Key("hello", 5, true)
String("world", 5, true)
Key("t", 1, true)
Bool(true)
Key("f", 1, true)
Bool(false)
Key("n", 1, true)
Null()
Key("i")
Uint(123)
Key("pi")
Double(3.1416)
Key("a")
StartArray()
Uint(1)
Uint(2)
Uint(3)
Uint(4)
EndArray(4)
EndObject(7)
These events can be easily mapped to the JSON structure, except for some event parameters that require further explanation. You can refer to the simplereader example, which produces exactly the same output as above:
#include "merak/json/reader.h"
#include <iostream>
using namespace merak::json;
using namespace std;
struct MyHandler : public BaseReaderHandler<UTF8<>, MyHandler> {
bool Null() { cout << "Null()" << endl; return true; }
bool Bool(bool b) { cout << "Bool(" << boolalpha << b << ")" << endl; return true; }
bool Int(int i) { cout << "Int(" << i << ")" << endl; return true; }
bool Uint(unsigned u) { cout << "Uint(" << u << ")" << endl; return true; }
bool Int64(int64_t i) { cout << "Int64(" << i << ")" << endl; return true; }
bool Uint64(uint64_t u) { cout << "Uint64(" << u << ")" << endl; return true; }
bool Double(double d) { cout << "Double(" << d << ")" << endl; return true; }
bool String(const char* str, SizeType length, bool copy) {
cout << "String(" << str << ", " << length << ", " << boolalpha << copy << ")" << endl;
return true;
}
bool StartObject() { cout << "StartObject()" << endl; return true; }
bool Key(const char* str, SizeType length, bool copy) {
cout << "Key(" << str << ", " << length << ", " << boolalpha << copy << ")" << endl;
return true;
}
bool EndObject(SizeType memberCount) { cout << "EndObject(" << memberCount << ")" << endl; return true; }
bool StartArray() { cout << "StartArray()" << endl; return true; }
bool EndArray(SizeType elementCount) { cout << "EndArray(" << elementCount << ")" << endl; return true; }
};
void main() {
const char json[] = " { \"hello\" : \"world\", \"t\" : true , \"f\" : false, \"n\": null, \"i\":123, \"pi\": 3.1416, \"a\":[1, 2, 3, 4] } ";
MyHandler handler;
Reader reader;
StringStream ss(json);
reader.Parse(ss, handler);
}
Note that Merak uses templates to statically bind the Reader type and the handler type, instead of using classes with virtual functions. This paradigm can improve performance by inlining functions.
Handler
As shown in the previous example, the user needs to implement a handler to process events (function calls) from the Reader. The handler must include the following member functions:
class Handler {
bool Null();
bool Bool(bool b);
bool Int(int i);
bool Uint(unsigned i);
bool Int64(int64_t i);
bool Uint64(uint64_t i);
bool Double(double d);
bool RawNumber(const Ch* str, SizeType length, bool copy);
bool String(const Ch* str, SizeType length, bool copy);
bool StartObject();
bool Key(const Ch* str, SizeType length, bool copy);
bool EndObject(SizeType memberCount);
bool StartArray();
bool EndArray(SizeType elementCount);
};
Null()is called when theReaderencounters a JSON null value.Bool(bool)is called when theReaderencounters a JSON true or false value.- When the
Readerencounters a JSON number, it selects an appropriate C++ type mapping and then calls exactly one ofInt(int),Uint(unsigned),Int64(int64_t),Uint64(uint64_t), andDouble(double). If thekParseNumbersAsStringsoption is enabled, theReaderwill callRawNumber()instead. - When the
Readerencounters a JSON string, it callsString(const char* str, SizeType length, bool copy):- The first parameter is a pointer to the string.
- The second parameter is the length of the string (excluding the null terminator). Note that Merak supports null characters
\0in strings, in which casestrlen(str) < length. - The final
copyparameter indicates whether the handler needs to copy the string. In normal parsing,copy = true;copy = falseonly when using in situ parsing. Additionally, note that the character type is related to the target encoding, which we will discuss later.
- When the
Readerencounters the start of a JSON object, it callsStartObject(). A JSON object is a collection of key-value pairs (members). If the object contains members, it first callsKey()for the member name, then calls the corresponding function according to the type of the value. It continues calling these key-value pairs until it finally callsEndObject(SizeType memberCount). Note that thememberCountparameter is only for the handler's reference, and the user may not need it. - JSON arrays are similar to objects but simpler. At the start of an array, the
ReadercallsStartArray(). If the array contains elements, it calls the corresponding function according to the element type. Similarly, it finally callsEndArray(SizeType elementCount), where theelementCountparameter is only for the handler's reference.
Each handler function returns a bool. Normally, they should return true. If the handler encounters an error, it can return false to notify the event sender to stop further processing.
For example, when parsing a JSON with the Reader, if the handler detects that the JSON does not conform to the required schema, the handler can return false to make the Reader stop subsequent parsing. The Reader will then enter an error state, marked with the kParseErrorTermination error code.
GenericReader
As mentioned earlier, Reader is a typedef of the GenericReader template class:
namespace merak::json {
template <typename SourceEncoding, typename TargetEncoding, typename Allocator = MemoryPoolAllocator<> >
class GenericReader {
// ...
};
typedef GenericReader<UTF8<>, UTF8<> > Reader;
} // namespace merak::json
Reader uses UTF-8 as both the source and target encoding. The source encoding refers to the encoding of the JSON stream; the target encoding refers to the encoding used for the str parameter of String(). For example, to parse a UTF-8 stream and output to UTF-16 string events, you need to define a reader like this:
GenericReader<UTF8<>, UTF16<> > reader;
Note that the default type of UTF16 is wchar_t. Therefore, this reader will call the handler's String(const wchar_t*, SizeType, bool).
The third template parameter Allocator is the type of allocator for internal data structures (actually a stack).
Parsing
The only function of Reader is to parse JSON.
template <unsigned parseFlags, typename InputStream, typename Handler>
bool Parse(InputStream& is, Handler& handler);
// Uses parseFlags = kDefaultParseFlags
template <typename InputStream, typename Handler>
bool Parse(InputStream& is, Handler& handler);
If an error occurs during parsing, it returns false. The user can call bool HasParseError(), ParseErrorCode GetParseErrorCode(), and size_t GetErrorOffset() to get the error state. In fact, Document uses these Reader functions to get parsing errors. Please refer to DOM for details about parsing errors.
Writer
Reader converts (parses) JSON into events; Writer does the exact opposite. It converts events into JSON.
Writer is very easy to use. If your application only needs to convert some data into JSON, it may be more convenient to use Writer directly than to build a Document and then convert it to JSON with Writer.
In the simplewriter example, we do the exact opposite of simplereader:
#include "merak/json/writer.h"
#include "merak/json/stringbuffer.h"
#include <iostream>
using namespace merak::json;
using namespace std;
void main() {
StringBuffer s;
Writer<StringBuffer> writer(s);
writer.StartObject();
writer.Key("hello");
writer.String("world");
writer.Key("t");
writer.Bool(true);
writer.Key("f");
writer.Bool(false);
writer.Key("n");
writer.Null();
writer.Key("i");
writer.Uint(123);
writer.Key("pi");
writer.Double(3.1416);
writer.Key("a");
writer.StartArray();
for (unsigned i = 0; i < 4; i++)
writer.Uint(i);
writer.EndArray();
writer.EndObject();
cout << s.GetString() << endl;
}
Output:
{"hello":"world","t":true,"f":false,"n":null,"i":123,"pi":3.1416,"a":[0,1,2,3]}
String() and Key() each have two overloads. One has 3 parameters, as per the handler concept, which can handle strings with null characters. The other is the simpler version used above.
Note that EndArray() and EndObject() in the example code have no parameters. A SizeType parameter can be passed, but it will be ignored by Writer.
You may wonder why not use sprintf() or std::stringstream to build a JSON?
There are several reasons:
Writerwill definitely output a well-formed JSON. If there is an incorrect event order (e.g.,Int()followsStartObject()), it will cause an assertion failure in debug mode.Writer::String()can handle string escaping (e.g., converting the code pointU+000Ato\n) and perform Unicode transcoding.Writerhandles number output consistently.Writerimplements the event handler concept. It can be used to process events fromReader,Document, or other event generators.Writeris optimized for different platforms.
In any case, using the Writer API to generate JSON is even simpler than these ad-hoc methods.
Template
Writer has a slight design difference from Reader. Writer is a template class, not a typedef. There is no GenericWriter. The declaration of Writer is as follows:
namespace merak::json {
template<typename OutputStream, typename SourceEncoding = UTF8<>, typename TargetEncoding = UTF8<>, typename Allocator = CrtAllocator<> >
class Writer {
public:
Writer(OutputStream& os, Allocator* allocator = 0, size_t levelDepth = kDefaultLevelDepth)
// ...
};
} // namespace merak::json
- The
OutputStreamtemplate parameter is the type of the output stream. Its type cannot be inferred automatically and must be provided by the user. - The
SourceEncodingtemplate parameter specifies the encoding ofString(const Ch*, ...). - The
TargetEncodingtemplate parameter specifies the encoding of the output stream. Allocatoris the type of allocator used to allocate internal data structures (a stack).
writeFlags is a combination of the following bit flags:
| Write Bit Flag | Meaning |
|---|---|
kWriteNoFlags | No flags. |
kWriteDefaultFlags | Default parsing options. Equivalent to the macro RAPIDJSON_WRITE_DEFAULT_FLAGS, which is defined as kWriteNoFlags. |
kWriteValidateEncodingFlag | Validate the encoding of JSON strings. |
kWriteNanAndInfFlag | Allow writing Infinity, -Infinity, and NaN. |
In addition, the constructor of Writer has a levelDepth parameter, which affects the initial memory allocation for storing level information.
PrettyWriter
The output of Writer is the most compact JSON without whitespace characters, suitable for network transmission or storage, but not for human reading.
Therefore, Merak provides a PrettyWriter that adds indentation and line breaks to the output.
The usage of PrettyWriter is almost the same as Writer, except that PrettyWriter provides a SetIndent(Ch indentChar, unsigned indentCharCount) function. The default indentation is 4 spaces.
Completeness and Reset
A Writer can only output a single JSON, whose root node can be of any JSON type. When a single root node event (such as String()) is processed, or the matching final EndObject() or EndArray() event is processed, the output JSON is well-formed and complete. The user can call Writer::IsComplete() to detect completeness.
When the JSON is complete, the Writer cannot accept new events; otherwise, its output will be invalid (e.g., having more than one root node). To reuse the Writer object, the user can call Writer::Reset(OutputStream& os) to reset all its internal states and set a new output stream.
Techniques
Parsing JSON to Custom Data Structures
The parsing function of Document relies entirely on Reader. In fact, Document is a handler that receives events to build a DOM when parsing JSON.
Users can use Reader directly to build other data structures. This eliminates the step of building a DOM, thereby reducing memory overhead and improving performance.
In the messagereader example, ParseMessages() parses a JSON that should be an object containing key-value pairs:
#include "merak/json/reader.h"
#include "merak/json/error/en.h"
#include <iostream>
#include <string>
#include <map>
using namespace std;
using namespace merak::json;
typedef map<string, string> MessageMap;
struct MessageHandler
: public BaseReaderHandler<UTF8<>, MessageHandler> {
MessageHandler() : state_(kExpectObjectStart) {
}
bool StartObject() {
switch (state_) {
case kExpectObjectStart:
state_ = kExpectNameOrObjectEnd;
return true;
default:
return false;
}
}
bool String(const char* str, SizeType length, bool) {
switch (state_) {
case kExpectNameOrObjectEnd:
name_ = string(str, length);
state_ = kExpectValue;
return true;
case kExpectValue:
messages_.insert(MessageMap::value_type(name_, string(str, length)));
state_ = kExpectNameOrObjectEnd;
return true;
default:
return false;
}
}
bool EndObject(SizeType) { return state_ == kExpectNameOrObjectEnd; }
bool Default() { return false; } // All other events are invalid.
MessageMap messages_;
enum State {
kExpectObjectStart,
kExpectNameOrObjectEnd,
kExpectValue,
}state_;
std::string name_;
};
void ParseMessages(const char* json, MessageMap& messages) {
Reader reader;
MessageHandler handler;
StringStream ss(json);
if (reader.Parse(ss, handler))
messages.swap(handler.messages_); // Only change it if success.
else {
ParseErrorCode e = reader.GetParseErrorCode();
size_t o = reader.GetErrorOffset();
cout << "Error: " << GetParseError_En(e) << endl;;
cout << " at offset " << o << " near '" << string(json).substr(o, 10) << "...'" << endl;
}
}
int main() {
MessageMap messages;
const char* json1 = "{ \"greeting\" : \"Hello!\", \"farewell\" : \"bye-bye!\" }";
cout << json1 << endl;
ParseMessages(json1, messages);
for (MessageMap::const_iterator itr = messages.begin(); itr != messages.end(); ++itr)
cout << itr->first << ": " << itr->second << endl;
cout << endl << "Parse a JSON with invalid schema." << endl;
const char* json2 = "{ \"greeting\" : \"Hello!\", \"farewell\" : \"bye-bye!\", \"foo\" : {} }";
cout << json2 << endl;
ParseMessages(json2, messages);
return 0;
}
Output:
{ "greeting" : "Hello!", "farewell" : "bye-bye!" }
farewell: bye-bye!
greeting: Hello!
Parse a JSON with invalid schema.
{ "greeting" : "Hello!", "farewell" : "bye-bye!", "foo" : {} }
Error: Terminate parsing due to Handler error.
at offset 59 near '} }...'
The first JSON (json1) is successfully parsed into MessageMap. Since MessageMap is a std::map, the printing order is sorted by key, which is different from the order in the JSON.
In the second JSON (json2), the value of foo is an empty object. Since it is an object, MessageHandler::StartObject() will be called. However, with state_ = kExpectValue, this function will return false, causing the parsing process to terminate. The error code is kParseErrorTermination.
Filtering JSON
As mentioned earlier, Writer can process events emitted by Reader. The example/condense/condense.cpp example simply sets Writer as the handler of a Reader, thus removing all whitespace characters from the JSON. The example/pretty/pretty.cpp example uses the same relationship, only replacing Writer with PrettyWriter. Therefore, pretty can reformat the JSON by adding indentation and line breaks.
In fact, we can use the SAX-style API to add (multiple) intermediate layers to filter the content of JSON. For example, the capitalize example can convert all JSON strings to uppercase:
#include "merak/json/reader.h"
#include "merak/json/writer.h"
#include "merak/json/filereadstream.h"
#include "merak/json/filewritestream.h"
#include "merak/json/error/en.h"
#include <vector>
#include <cctype>
using namespace merak::json;
template<typename OutputHandler>
struct CapitalizeFilter {
CapitalizeFilter(OutputHandler& out) : out_(out), buffer_() {
}
bool Null() { return out_.Null(); }
bool Bool(bool b) { return out_.Bool(b); }
bool Int(int i) { return out_.Int(i); }
bool Uint(unsigned u) { return out_.Uint(u); }
bool Int64(int64_t i) { return out_.Int64(i); }
bool Uint64(uint64_t u) { return out_.Uint64(u); }
bool Double(double d) { return out_.Double(d); }
bool RawNumber(const char* str, SizeType length, bool copy) { return out_.RawNumber(str, length, copy); }
bool String(const char* str, SizeType length, bool) {
buffer_.clear();
for (SizeType i = 0; i < length; i++)
buffer_.push_back(std::toupper(str[i]));
return out_.String(&buffer_.front(), length, true); // true = output handler need to copy the string
}
bool StartObject() { return out_.StartObject(); }
bool Key(const char* str, SizeType length, bool copy) { return String(str, length, copy); }
bool EndObject(SizeType memberCount) { return out_.EndObject(memberCount); }
bool StartArray() { return out_.StartArray(); }
bool EndArray(SizeType elementCount) { return out_.EndArray(elementCount); }
OutputHandler& out_;
std::vector<char> buffer_;
};
int main(int, char*[]) {
// Prepare JSON reader and input stream.
Reader reader;
char readBuffer[65536];
FileReadStream is(stdin, readBuffer, sizeof(readBuffer));
// Prepare JSON writer and output stream.
char writeBuffer[65536];
FileWriteStream os(stdout, writeBuffer, sizeof(writeBuffer));
Writer<FileWriteStream> writer(os);
// JSON reader parse from the input stream and let writer generate the output.
CapitalizeFilter<Writer<FileWriteStream> > filter(writer);
if (!reader.Parse(is, filter)) {
fprintf(stderr, "\nError(%u): %s\n", (unsigned)reader.GetErrorOffset(), GetParseError_En(reader.GetParseErrorCode()));
return 1;
}
return 0;
}
Note that you cannot simply convert JSON to uppercase as a string. For example:
["Hello\nWorld"]
Simply converting the entire JSON to uppercase will produce an incorrect escape character:
["HELLO\NWORLD"]
Whereas capitalize will produce the correct result:
["HELLO\nWORLD"]
We can also develop more complex filters. However, since the SAX-style API only provides information about a single event at a time, the user needs to record some contextual information themselves (e.g., the path from the root node, storing other related values). For handling certain cases, using DOM is easier to implement than SAX.