On this manual you'll find the following section:
Section 1: Basic elements to understand XML and Axl
Section 2: Manipulating and producing XML documents
Section 3: Doing validation on your documents
Section 4: Advanced topics
Apendix
XML 1.0 definition allows to build documents that could be used to represents textual information, remote procedure invocations or dynamic user interfaces. Its definition is based on very simple principles, that allows to developers to compose them to create bigger abstractions that are roughly on every place in modern computer software design.
It is a "quite" human readable format, so you will find that is not the best format if you are looking for space efficiency. What XML 1.0 provides you on the other hand is the ability to quickly prototype and produce working formats that encapsulate your data, and, as your system evolves, XML 1.0 will do it with you.
Among other things, XML 1.0 provides you ways to validate your documents to ensure your code will read XML documents in the format expected, reducing the time and development cost due to additional checkings required.
Before continuing, we will explain some concepts that are required to understand XML 1.0 and why the Axl API was built this way.
Here is a simple example of a XML 1.0 document:
<?xml version="1.0"> <!-- This is a comment --> <complex> <data> <simple>10</simple> <empty attr1="value1" /> </data> </complex>
Previous XML document represents an structure with a top level node, called complex, that has one single child called data which in turn have two childs. The first one is the child called simple that have content and other one, called empty, which is a node usually called an empty xml node.
The XML representation for previous document is the following:
Document representation
Several issues must be considered while interpreting previous diagram and how Axl library parse and expose those elements through the API to the client application:
Every XML document have a root node (axl_doc_get_root). Without exception. In this case, the root node for our example is complex.
If a node have content, that content is not represented with another node. That content is associated to the node and could be retrieved using several function (axl_node_get_content, axl_node_get_content_copy and axl_node_get_content_trans).
Alternatively, while using the MIXED API, you can traverse child items stored for a particular node, detecting those items that are ITEM_CONTENT or ITEM_CDATA (using axl_item_get_type).
Having a node (axlNode) with content doesn't mean to have a node with childs. The child notion is only about having more xml nodes (axlNode) as childs.
This is particularly important if you take into consideration that a node could have content (ITEM_CONTENT), comments (ITEM_COMMENT), application process instructions (ITEM_PI), CDATA content (uninterpreted content ITEM_CDATA), all of them mixed with more xml nodes (ITEM_NODE).
A final node which is empty because it doesn't have content or childs, is usually referred to as EMPTY type node. A final node with content but no childs is usually referred to as PCDATA. A node that have content mixed with references to more child xml nodes is referred to as MIXED.
At the empty node, you'll find that it has an attribute called attr1 with a value value1. A node could have any number of attributes but, it should be named differently. Again, if a node is empty, it keeps empty even if it has attributes.
So, to summarize, we have a root node, that could contain more nodes, that could contain PCDATA, or content, and those nodes could contain named attributes with values.
XML 1.0 is used for a variety of purposes, some of them requires the CHILDREN API and the rest the MIXED API. To require, we mean that it fits better, so you will get better results, your application will react in a proper manner and you'll have to do less work.
The reason for this API is simple. XML 1.0 definition allows to mix content with more nodes, comments and many more elements to be placed as childs for a particular node.
This definition, found at the standard, have moved many XML implementations to support only an API that support all these features, that is, an interface that is complicated and overloaded, that gives you a power that you don't require, making your development more inefficient.
As a result, when a developer only requires a usual form of xml, called CHILDREN, that means nodes have only another childs nodes or content but not both at the same time. This kind of xml is really useful, easy to parse, easy to make a DTD definition, more compact and extensible.
Lets see an example for both formats to clarify:
<?xml version='1.0' ?>
<document>
<!-- Children XML format example: as you can see -->
<!-- nodes only contains either nodes or node content -->
<!-- but nothing mixed at the same level -->
<node1>
This is node1 content
</node1>
<node2>
<node3>
This is node3 content
</node3>
<node4 />
</node2>
</document>
While an MIXED xml document could be:
<?xml version='1.0' ?>
<document>
<!-- Children XML format example: as you can see -->
<!-- nodes only contains either nodes or node content -->
<!-- but nothing mixed at the same level -->
<node1>
This is node1 content
</node1>
Content mixed with xml nodes at the same level.
<node2>
More content....
<node3>
This is node3 content
</node3>
<node4 />
</node2>
</document>
Both approaches, which are valid using the XML 1.0 standard, are appropriate for particular situations:
Having introduced the context of the problem, Axl Library takes no position, providing an API that fits while developing xml content that follows a CHILDREN description and an API for the MIXED description.
In this context, which API you use, will only affect to the way you traverse the document. The CHILDREN API is mainly provided by the Axl Node interface and the MIXED API is mainly provided by the Axl Item interface.
You don't need to do any especial operation to activate both APIs, both are provided at the same time. Lets see an example:
Supposing the previous mixed example, the following code will get access to the <node2> reference:
// supposing "doc" reference contains the document loaded axlNode * node; // get the document root, that is <document> node = axl_doc_get_root (doc); // get the first child for the document root (<node1>) node = axl_node_get_first_child (node); // get the next child (brother of <node1>, that is <node2>) node = axl_node_get_next (node);
However, with the MIXED API you can get every detail, every item found for a particular node. This is how:
// supposing "doc" reference contains the document loaded axlNode * node; axlItem * item; // get the document root, that is <document> node = axl_doc_get_root (doc); // get the first item child for the document root that is the comment: // "Children XML format example: as you can see". item = axl_item_get_first_child (node); // now skip the following two comments item = axl_item_get_next (item); item = axl_item_get_next (item); // now the next item is holding the <node1> item = axl_item_get_next (item); node = axl_item_get_data (item); // now get the content between the <node1> and <node2> item = axl_item_get_next (item); // and finally, get the next child (brother of <node1>, that is // <node2>) item = axl_item_get_next (item); node = axl_item_get_data (item);
Obviously, the mixed example contains more code and it is more fragile to xml document changes. The problem is that the MIXED API is more general than the CHILDREN, making XML libraries to only provide that API.
As a consequence:
We have seen how an XML document is. Now we are going to see how to parse those document into data structures that are usable to inspect the content. All parsing functions are available at the Axl Doc interface.
Let's start with a very simple example:
#include <axl.h> #include <stdio.h> int main (int argc, char ** argv) { axlError ** error; // top level definitions axlDoc * doc = NULL; // initialize axl library if (! axl_init ()) { printf ("Unable to initialize Axl library\n"); return -1; } // get current doc reference doc = axl_doc_parse_from_file ("large.xml", error); if (doc == NULL) { axl_error_free (error); return axl_false; } // DO SOME WORK WITH THE DOCUMENT HERE // release the document axl_doc_free (doc); // cleanup axl library axl_end (); return axl_true; }
Once the document is loaded you can use several function to traverse the document.
First you must use axl_doc_get_root to get the document root (axlNode) which contains all the information. Then, according to the interface you are using, you must call to either axl_node_get_first_child or axl_item_get_first_child.
Once you have access to the first element, you can use the following set of function to get more references to other nodes or items:
MIXED API:
CHILDREN API:
There are alternative APIs that will allow you to iterate the document, providing a callback: axl_doc_iterate.
Another approach is to use axl_doc_get and axl_doc_get_content_at to get fast access to a particular node using a really limited XPath syntax.
One feature that comes with Axl Library is ability to modify the content, replacing it with other content and transferring node node to another place.
Check the following function while operating with axlNode elements:
Check the following functions while operating with axlItem elements:
Axl Library comes with several functions to perform xml memory dump operations, allowing to translate a xml representation (axlDoc or axlNode) into a string:
In the case you want to produce xml content taking as reference a particular node use:
Once you are familiar with the Axl API, or any other XML toolkit, it turns that it is not a good practice to write lot of source code to check node names expected or how they are nested. This makes your program really weak to changes and makes your to write more code that is not actual work but a simple environment check.
You may also need to check that some XML document received follows a defined XML structure, but it is too complex to be done.
For this purpose, XML 1.0 defines DTD or (Document Type Definition) which allows to specify the document grammar, how are nested nodes, which attributes could contain, or if the are allocated to be empty nodes or nodes that must have another child nodes.
Let start with the DTD syntax used to configure restrictions about node structure:
<!-- sequence specification --> <!ELEMENT testA (test1, test2, test3)> <!-- choice specification --> <!ELEMENT testB (test1 | test2 | test3)>
DTD <!ELEMENT is modeled on top of two concepts which are later expanded with repetition patterns. We will explain then later. For now, this two top level concepts are: sequence and choice.
Sequence specification (elements separated by , (comma), the one used to apply restriction to the node testA, are used to denote that testA have as childs test1, followed by test2 and ended by test3. The order specified must be followed and all instances must appear. This could be tweaked using repetition pattern.
In the other hand, choice specification (elements separated by | (pipe), are used to specify that the content of a node is built using nodes of the choice list. So, in this case, testB node could have either one instance of test1 or test2 or test3.
Now you know these to basic elements to model how childs are organized for a node, what it is need is to keep on adding more <!ELEMENT directives until all nodes are specified. You will end your DTD document with final nodes that are either empty ones or have PCDATA. At this moment MIXED nodes are not supported.
Suppose that all nodes that are inside testA and testB are final ones. Then this could be its DTD specification:
<!-- test1 is a node that only have content --> <!ELEMENT test1 (#PCDATA)> <!-- test2 is a node that is always empty --> <!ELEMENT test1 EMPTY> <!-- test3 is a node that could have either test1 or test2 --> <!ELEMENT test3 (test1 | test2)>
Sequences and choices could be composed to create richer DTD expressions that combines sequences of choices and so on.
At this point all required elements to model choices, sequences and final nodes are explained, but, we have to talk about repetition pattern. They are symbols that are appended to elements inside choices (or sequences) including those list specifications.
Patterns available are: +, ? and *. By default, if no pattern is applied to the element, it means that the match should be produced one and only one time.
The + pattern is used to model that element should be matched one, and at least one, or more.
The * pattern is used to model elements that should be matched zero or any times.
The ? pattern is used to model elements that should be matched zero or one times.
For the exampled initially explained, let's suppose we want that the content inside testA have sequences repeated at leat one time, being that sequence: test1, test2 and test3. We only need to add a + repetition pattern as follows:
<!-- sequence specification --> <!ELEMENT testA (test1, test2, test3)+>
So, we are saying to our validation engine that the sequence inside testA could be found one or many times, but the entire sequence match be found every time.
Here is an simple example that loads an XML document, then loads an DTD file, and then validates the XML document:
bool test_12 (axlError ** error) { axlDoc * doc = NULL; axlDtd * dtd = NULL; // parse gmovil file (an af-arch xml chunk) doc = axl_doc_parse_from_file ("channel.xml", error); if (doc == NULL) return axl_false; // parse af-arch DTD dtd = axl_dtd_parse_from_file ("channel.dtd", error); if (dtd == NULL) return axl_false; // perform DTD validation if (! axl_dtd_validate (doc, dtd, error)) { return axl_false; } // free doc reference axl_doc_free (doc); // free dtd reference axl_doc_free (dtd); return axl_true; }
Until now, we have seen how to check xml structure. But this do not cover xml node attributes. This is checked by using <!ATTLIST> declaration.
In the case we have a node testA with two attribuets attr1 and attr2 the first one optional and the second one mandatory, we can declare something like:
<!-- attribute validation for node testA -->
<!ATTLIST testA
attr1 CDATA #IMPLIED
attr2 CDATA #REQUIRED>
XML 1.0 initial design didn't take care about situations where several software vendors could introduce content inside the same XML documents. This has several benefits, but one problem to solve: how to avoid xml node names (tags) to clash from each other.
Think about using <table> as a tag for your document. Many XML applications uses <table> as a valid tag for its XML language set. However, each of them has a different meaning and must be handled by the proper XML software.
While developing applications with XML, and supposing such XML documents will be used by more applications than yours, you are likely to be interested in use XML Namespaces. In other words, many of the new XML standards that are appearing uses XML Namespaces to allow defining its xml node names, while allowing users/developers to use their own set of xml tags, under their own XML Namespaces, in order they can use them in the same document.
XML Namespaces support inside Axl Library is handled through a separated library, which requires the base library to function. Here are some instructions to get Axl Library Namespace installed.
This library provides functions that replaces some of the functions used by XML applications that don't require XML Namespaces. In particular, some of them are:
See also API documentation for all functions that are provided to enable your application with XML Namespaces:
Default axl library implementation (libaxl) assumes it will receive and produce UTF-8 content.
Because the subset of characters that are used to properly parse XML content are located in the ASCII range, still valid UTF-8, but at same time valid in other encodings such ISO 646, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values (See section F. Autodetecting of Character Encodings at http://www.w3.org/TR/REC-xml/), causes the library to properly parse the content, even if it is not UTF-8.
In many cases this is not important for you since your application do not care about content codification (such configuration files) or they are in UTF-8.
However, this could present problems if you are handling different documents with several encoding types. The idea is to have an unified way to handle such different encoded documents, with a single, run-time encoding: UTF-8.
libaxl-babel provides support to read content in supported codifications and translate it into UTF-8 at run-time (checking result to be valid UTF-8):
Reading documents and handle them as they were in UTF-8
The library works as an extension that configures a set of handlers making the library to open XML documents and translating them into UTF-8 if required.
To activate the library, you must use axl_babel_init at the begining of your application or library. Here is an example:
// optional axlError declaration axlError * error; // init axl babel if (! axl_babel_init (&error)) { printf ("Failed to start axl babel: %s...\n", axl_error_get (error)); axl_error_free (error); return axl_false; }
Once done, every call to the base API (such axl_doc_parse, axl_doc_parse_from_file) will open the document as usual. It is not required to perform any additional special operation.
It is not required to call to axl_babel_finish on application exit. However, in the case you want to deactivate libaxl-babel but still keep on using axl base library, you can use axl_babel_finish.
See axl_babel_init for currently supported formats.
Axl Library is implemented in a modular way to ensure you are only linked against those software elements that you really require. Additionally, the library allows the following to reduce the library footprint to the minimum:
Remove log information:
Axl library uses a console log mechanism to report what's happening during processing. See Axl Log reporting module for more information. However, under production environments this console log isn't necessary, so you can safely remove it, at compile time, using --axl-log-disable as follow:
>> ./configure --axl-log-disable
According to our results, the library including the log to console information is about 366K. Without log to console information the library takes about 288K.
Remove debugging information from the library:
You can also remove debugging information from your library on production environments doing the following once finished compilation process:
>> make install-strip
According to our results, the library without log to console and debugging information takes about 100K.
Previous information applies to the Axl base Library (libaxl.so/.dll), however the same happens for the rest of software components bundle with Axl.
You can also check API documentation for a complete detailed explanation about the library.
Please, if you find that something isn't properly documented or you think that something could be improved, contact us in the mailing list. We are building Axl Library with the aim to produce a high quality, commercial grade, open source XML development kit, so, any help received will be welcome.
Remember you can always contact us at the mailing list for any question not properly answered by this documentation. See Axl Library website documentation to get more information about mailing list.