First of all I have to admit that I love computer standards. If everyone followed the industry's standards, the Internet would be a better medium. The use of standardized data exchange formats makes open and platform-independent computing models feasible. That's why I'm an XML enthusiast.
Fortunately, my favorite scripting language not only supports XML but is increasingly supporting it. PHP allows me to quickly publish XML documents to the Internet, collect statistical information about XML documents, and convert XML documents into other formats. For example, I often use PHP's XML processing capabilities to manage articles and books I write in XML.
In this article, I will discuss any use of PHP's built-in Expat parser to process XML documents. Through examples, I will demonstrate the processing method of Expat. At the same time, the example can show you how to:
Create your own processing function
Converting XML documents into your own PHP data structures
Introduction Expat
XML's parser, also called an XML processor, allows programs to access the structure and content of XML documents. Expat is an XML parser for the PHP scripting language. It is also used in other projects, such as Mozilla, Apache and Perl.
What is an event-based parser?
There are two basic types of XML parsers:
Tree-based parsers: convert XML documents into tree structures. This type of parser parses the entire article while providing an API to access each element of the resulting tree. Its common standard is DOM (Document Object Model).
Event-based parser: Treat XML documents as a series of events. When a special event occurs, the parser will call the function provided by the developer to handle it.
The event-based parser has a data-focused view of the XML document, which means that it focuses on the data portion of the XML document rather than its structure. These parsers process the document from beginning to end and report events like - start of element, end of element, start of feature data, etc. - to the application through callback functions. The following is an example XML document for "Hello-World":
<greeting>
Hello World
</greeting>
The event-based parser will report as three events:
start element: greeting
The beginning of the CDATA item, the value is: Hello World
Ending element: greeting
Unlike tree-based parsers, event-based parsers do not produce a structure that describes the document. In CDATA items, the event-based parser will not let you get the greeting information of the parent element.
However, it provides a lower level access, which allows for better utilization of resources and faster access. This way, there is no need to fit the entire document into memory; in fact, the entire document can even be larger than the actual memory value.
Expat is such an event-based parser. Of course, if you use Expat, it can also generate a complete native tree structure in PHP if necessary.
The Hello-World example above includes the complete XML format. But it is invalid because there is neither a DTD (Document Type Definition) associated with it nor an embedded DTD.
For Expat, this makes no difference: Expat is a parser that does not check validity and therefore ignores any DTD associated with the document. It should be noted, however, that the document still needs to be fully formatted, otherwise Expat (like other XML-compliant parsers) will stop with an error message.
As a parser that does not check validity, Exapt's speed and lightweight make it well suited for Internet applications.
Compiling Expat
Expat can be compiled into PHP3.0.6 version (or above). Starting from Apache 1.3.9, Expat has been included as part of Apache. On Unix systems, you can compile it into PHP by configuring PHP with the -with-xml option.
If you compile PHP as an Apache module, Expat will be included as part of Apache by default. In Windows, you must load the XML dynamic link library.
XML Examples: XMLstats
One way to learn about Expat's functions is through examples. The example we are going to discuss is using Expat to collect statistics on XML documents.
For each element in the document, the following information will be output:
the number of times the element is used in the document
The amount of character data in this element
the element's parent element
child elements of element
Note: For the sake of demonstration, we use PHP to generate a structure to save the parent element and child element of the element.
prepared
for generating the XML parser instance is xml_parser_create(). This instance will be used for all future functions. This idea is very similar to the connection tag of the MySQL function in PHP. Before parsing the document, event-based parsers usually require you to register a callback function - to be called when a specific event occurs. Expat has no exception events. It defines the following seven possible events:
object XML parsing function description
element xml_set_element_handler() start and end
character data of the element xml_set_character_data_handler() start of character data
external entity xml_set_external_entity_ref_handler() external entity Unparsed
external entity xml_set_unparsed_entity_decl_handler () The occurrence of an unresolved external entity
processing instruction xml_set_processing_instruction_handler() The occurrence of a processing instruction
notation declaration xml_set_notation_decl_handler() The occurrence of a notation declaration
default xml_set_default_handler() Other events without a specified handler function
All callback functions must use an instance of the parser as Its first parameter (there are other parameters in addition).
For the sample script at the end of this article. What you need to note is that it uses both element processing functions and character data processing functions. The element's callback handler function is registered through xml_set_element_handler().
This function takes three parameters:
an instance of the parser
The name of the callback function that handles the start element
The name of the callback function that handles the closing element
The callback function must exist when parsing the XML document begins. They must be defined consistent with the prototypes described in the PHP manual.
For example, Expat passes three arguments to the handler function for the start element. In the script example, it is defined as follows:
function start_element($parser, $name, $attrs)
The first parameter is the parser identifier, the second parameter is the name of the starting element, and the third parameter contains all attributes and values of the element array.
Once you start parsing the XML document, Expat will call your start_element() function and pass the parameters whenever it encounters the start element.
XML's Case Folding option
uses the xml_parser_set_option () function to turn off the Case folding option. This option is on by default, causing element names passed to handler functions to be automatically converted to uppercase. But XML is case sensitive (so case is very important for statistical XML documents). For our example, the case folding option must be turned off.
Parsing the document
After completing all the preparations, now the script can finally parse the XML document:
Xml_parse_from_file(), a custom function, opens the file specified in the parameter and parses it in 4kb size
xml_parse(), like xml_parse_from_file(), will return false when an error occurs, that is, when the XML document is not fully formatted.
You can use the xml_get_error_code() function to get the numeric code of the last error. Pass this numeric code to the xml_error_string() function to get the error text information.
Outputs the current line number of XML, making debugging easier.
During the parsing process, the callback function is called.
Describing the document structure
When parsing a document, the question that needs to be addressed with Expat is: how to maintain a basic description of the document structure?
As mentioned earlier, the event-based parser itself does not produce any structural information.
However, the tag structure is an important feature of XML. For example, the element sequence <book><title> means something different than <figure><title>. That said, any author will tell you that book titles and picture titles have nothing to do with each other, even though they both use the term "title." Therefore, in order to process XML efficiently with an event-based parser, you must use your own stacks or lists to maintain structural information about the document.
In order to mirror the document structure, the script needs to know at least the parent element of the current element. This is not possible with Exapt's API. It only reports events of the current element without any contextual information. Therefore, you need to build your own stack structure.
The script example uses a first-in-last-out (FILO) stack structure. Through an array, the stack will save all starting elements. For the start element processing function, the current element will be pushed to the top of the stack by the array_push() function. Correspondingly, the end element processing function removes the top element through array_pop().
For the sequence <book><title></title></book>, the stack is populated as follows:
start element book: assign "book" to the first element of the stack ($stack[0]).
Start element title: Assign "title" to the top of the stack ($stack[1]).
End element title: Remove the top element from the stack ($stack[1]).
End element title: Remove the top element from the stack ($stack[0]).
PHP3.0 implements the example by manually controlling the nesting of elements through a $depth variable. This makes the script look more complex. PHP4.0 uses the array_pop() and array_push() functions to make the script look more concise.
Collecting data
In order to collect information about each element, the script needs to remember the events for each element. Save all the different elements in the document by using a global array variable $elements. The items of the array are instances of the element class and have 4 properties (variables of the class)
$count - the number of times the element was found in the document
$chars - Number of bytes of character events in the element
$parents - parent element
$childs - child elements
As you can see, saving class instances in an array is a piece of cake.
Note: A feature of PHP is that you can traverse the entire class structure through a while(list() = each()) loop, just like you traverse the entire corresponding array. All class variables (and method names when you use PHP3.0) are output as strings.
When an element is found, we need to increment its corresponding counter to keep track of how many times it appears in the document. The count element in the corresponding $elements item is also incremented by one.
We also need to let the parent element know that the current element is its child element. Therefore, the name of the current element will be added to the item in the $childs array of the parent element. Finally, the current element should remember who its parent is. Therefore, the parent element is added to the item in the $parents array of the current element.
Displaying statistics
The remaining code loops through the $elements array and its subarrays to display its statistics. This is the simplest nested loop. Although it outputs the correct results, the code is neither concise nor has any special skills. It is just a loop that you may use every day to complete your work.
The script examples are designed to be invoked from the command line via PHP's CGI approach. Therefore, the statistical result output format is text format. If you want to use the script on the Internet, then you need to modify the output function to generate HTML format.
Summary
Exapt is an XML parser for PHP. As an event-based parser, it does not produce a structural description of the document. But by providing low-level access, this allows for better utilization of resources and faster access.
As a parser that does not check for validity, Expat ignores DTDs attached to XML documents, but it will stop with an error message if the document is not well-formed.
Provide event handlers to process documents
Build your own event structures such as stacks and trees to take advantage of XML structured information markup.
New XML programs appear every day, and PHP's support for XML is constantly being strengthened (for example, support for the DOM-based XML parser LibXML was added).
With PHP and Expat, you can prepare for the coming standards that are valid, open, and platform-independent.
Example
<?
/****************************************************** ******************************
* Name: XML parsing example: XML document information statistics
* describe
* This example uses PHP's Expat parser to collect and count XML document information (for example: the number of occurrences of each element, parent elements and child elements
* XML file as a parameter./xmlstats_PHP4.php3 test.xml
* $Requires: Expat Requirements: Expat PHP4.0 is compiled into CGI mode
*************************************************** ***************************/
// The first parameter is the XML file
$file = $argv[1];
// Initialization of variables
$elements = $stack = array();
$total_elements = $total_chars = 0;
//Basic class of elements
class element
{
var $count = 0;
var $chars = 0;
var $parents = array();
var $childs = array();
}
// Function to parse XML files
function xml_parse_from_file($parser, $file)
{
if(!file_exists($file))
{
die("Can't find file "$file".");
}
if(!($fp = @fopen($file, "r")))
{
die("Can't open file "$file".");
}
while($data = fread($fp, 4096))
{
if(!xml_parse($parser, $data, feof($fp)))
{
return(false);
}
}
fclose($fp);
return(true);
}
// Output result function (box form)
function print_box($title, $value)
{
printf("n+%'-60s+n", "");
printf("|%20s", "$title:");
printf("%14s", $value);
printf("%26s|n", "");
printf("+%'-60s+n", "");
}
// Output result function (line form)
function print_line($title, $value)
{
printf("%20s", "$title:");
printf("%15sn", $value);
}
// Sorting function
function my_sort($a, $b)
{
return(is_object($a) && is_object($b) ? $b->count - $a->count: 0);
}
function start_element($parser, $name, $attrs)
{
global $elements, $stack;
// Is the element already in the global $elements array?
if(!isset($elements[$name]))
{
// No - add a class instance of an element
$element = new element;
$elements[$name] = $element;
}
// Increment the counter of this element by one
$elements[$name]->count++;
// Is there a parent element?
if(isset($stack[count($stack)-1]))
{
// Yes - assign the parent element to $last_element
$last_element = $stack[count($stack)-1];
// If the parent element array of the current element is empty, initialize it to 0
if(!isset($elements[$name]->parents[$last_element]))
{
$elements[$name]->parents[$last_element] = 0;
}
// Increment the parent element counter of this element by one
$elements[$name]->parents[$last_element]++;
// If the child element array of the current element's parent element is empty, it is initialized to 0
if(!isset($elements[$last_element]->childs[$ name]))
{
$elements[$last_element]->childs[$name] = 0;
}
// Add one to the child element counter of the element's parent element.
$elements[$last_element]->childs[$name]++;
}
//Add the current element to the stack
array_push($stack, $name);
}
function stop_element($parser, $name)
{
global $stack;
// Remove the top element from the stack
array_pop($stack);
}
function char_data($parser, $data)
{
global $elements, $stack, $depth;
// Increase the number of characters of the current element
$elements[$stack][count($stack)-1]]->chars += strlen(trim($data));
}
// Generate parser instance
$parser = xml_parser_create();
// Set processing function
xml_set_element_handler($parser, "start_element", "stop_element");
xml_set_character_data_handler($parser, "char_data");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
// Parse the file
$ret = xml_parse_from_file($parser, $file);
if(!$ret)
{
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($parser)),
xml_get_current_line_number($parser)));
}
// Release the parser
xml_parser_free($parser);
// Free the helper element
unset($elements["current_element"]);
unset($elements["last_element"]);
// Sort according to the number of elements
uasort($elements, "my_sort");
// Loop through $elements to collect element information
while(list($name, $element) = each($elements))
{
print_box("Element name", $name);
print_line("Element count", $element->count);
print_line("Character count", $element->chars);
printf("n%20sn", "* Parent elements");
// Loop through the parent of the element and output the result
while(list($key, $value) = each($element->parents))
{
print_line($key, $value);
}
if(count($element->parents) == 0)
{
printf("%35sn", "[root element]");
}
// Loop through the child of this element and output the result
printf("n%20sn", "* Child elements");
while(list($key, $value) = each($element->childs))
{
print_line($key, $value);
}
if(count($element->childs) == 0)
{
printf("%35sn", "[no children]");
}
$total_elements += $element->count;
$total_chars += $element->chars;
}
// final result
print_box("Total elements", $total_elements);
print_box("Total characters", $total_chars);
?>