Reading and writing XML DOM with PHP

Author：Eve Cole Update Time：2009-06-01 18:21:45

There are many techniques for reading and writing XML with PHP. This article provides three methods for reading XML: using a DOM library, using a SAX parser, and using regular expressions. Writing XML using DOM and PHP text templates is also covered.

Reading and writing Extensible Markup Language (XML) in PHP can seem a little scary. In fact, XML and all its related technologies can be scary, but reading and writing XML in PHP doesn't have to be a scary task. First, you need to learn a little bit about XML—what it is and what you can do with it. Then, you need to learn how to read and write XML in PHP, and there are many ways to do this.

This article provides a brief introduction to XML and then explains how to read and write XML with PHP.

What is XML?

XML is a data storage format. It does not define what data is saved, nor does it define the format of the data. XML simply defines tags and the attributes of those tags. Well-formed XML markup looks like this:

＜name＞Jack Herrington＜/name＞

This <name> tag contains some text: Jack Herrington.

XML markup without text looks like this:

＜powerUp/＞

There's more than one way to write something in XML. For example, this tag forms the same output as the previous tag:

＜powerUp＞＜/powerUp＞

You can also add attributes to XML tags. For example, this <name> tag contains first and last attributes:

＜name first="Jack" last="Herrington" /＞

Special characters can also be encoded in XML. For example, the & symbol can be encoded like this:

&

An XML file containing tags and attributes is well-formed if it is formatted like the example, which means that the tags are symmetrical and the characters are encoded correctly. Listing 1 is an example of well-formed XML.

Listing 1. XML book list example

 
  ＜books＞ 
  ＜book＞ 
  ＜author＞Jack Herrington＜/author＞ 
  ＜title＞PHP Hacks＜/title＞ 
  ＜publisher＞O'Reilly＜/publisher＞ 
  ＜/book＞ 
  ＜book＞ 
  ＜author＞Jack Herrington＜/author＞ 
  ＜title＞Podcasting Hacks＜/title＞ 
  ＜publisher＞O'Reilly＜/publisher＞ 
  ＜/book＞ 
  ＜/books＞

The XML in Listing 1 contains a list of books. The parent <books> tag contains a set of <book> tags, each of which contains <author>, <title>, and <publisher> tags.

An XML document is correct when its markup structure and content are verified by an external schema file. Schema files can be specified in different formats. For this article, all that is needed is well-formed XML.

If you think XML looks a lot like Hypertext Markup Language (HTML), you're right. XML and HTML are both markup-based languages and have many similarities. However, it is important to point out that while an XML document may be well-formed HTML, not all HTML documents are well-formed XML. The newline tag (br) is a good example of the difference between XML and HTML. This newline tag is well-formed HTML, but not well-formed XML:

＜p＞This is a paragraph＜br＞
With a line break＜/p＞

This newline tag is well-formed XML and HTML:

＜p＞This is a paragraph＜br /＞
With a line break＜/p＞

If you want to write HTML as well-formed XML, follow the W3C committee's Extensible Hypertext Markup Language (XHTML) standard (see Resources ). All modern browsers can render XHTML. Furthermore, you can use XML tools to read XHTML and find the data in the document, which is much easier than parsing HTML.

Read XML using DOM library

The easiest way to read well-formed XML files is to use the Document Object Model (DOM) library compiled into some PHP installations. The DOM library reads the entire XML document into memory and represents it as a node tree, as shown in Figure 1.

Figure 1. XML DOM tree for book XML

The books node at the top of the tree has two book child tags. In each book, there are several nodes: author, publisher and title. The author, publisher, and title nodes each have text child nodes that contain text.

The code that reads the book XML file and displays the content using the DOM is shown in Listing 2.

Listing 2. Reading book XML using DOM

 
  <?php 
  $doc = new DOMDocument(); 
  $doc->load( 'books.xml' ); 
   
  $books = $doc->getElementsByTagName( "book" ); 
  foreach( $books as $book ) 
  { 
  $authors = $book->getElementsByTagName( "author" ); 
  $author = $authors->item(0)->nodeValue; 
   
  $publishers = $book->getElementsByTagName( "publisher" ); 
  $publisher = $publishers->item(0)->nodeValue; 
   
  $titles = $book->getElementsByTagName( "title" ); 
  $title = $titles->item(0)->nodeValue; 
   
  echo "$title - $author - $publishern"; 
  } 
  ?＞

The script first creates a new DOMdocument object and loads the book XML into this object using the load method. Afterwards, the script uses the getElementsByName method to get a list of all elements under the specified name.

In the loop of the book node, the script uses the getElementsByName method to obtain the nodeValue of the author, publisher, and title tags. nodeValue is the text in the node. The script then displays these values.

You can run PHP scripts on the command line like this:

%phpe1.php
PHP Hacks - Jack Herrington - O'Reilly
Podcasting Hacks - Jack Herrington - O'Reilly
%

As you can see, each book block outputs one line. This is a good start. But what if you don't have access to the XML DOM library?

Read XML with SAX parser

Another way to read XML is to use an XML Simple API (SAX) parser. Most installations of PHP include a SAX parser. The SAX parser runs on a callback model. Each time a tag is opened or closed, or each time the parser sees text, the user-defined function is called back with information about the node or text.

The advantage of the SAX parser is that it is truly lightweight. The parser does not keep content in memory for long periods of time, so it can be used for very large files. The disadvantage is that writing SAX parser callbacks is very cumbersome. Listing 3 shows code that uses SAX to read a book XML file and display the content.

Listing 3. Reading book XML with SAX parser

 
  <?php 
  $g_books = array(); 
  $g_elem = null; 
   
  function startElement( $parser, $name, $attrs )  
  { 
  global $g_books, $g_elem; 
  if ( $name == 'BOOK' ) $g_books []= array(); 
  $g_elem = $name; 
  } 
   
  function endElement( $parser, $name )  
  { 
  global $g_elem; 
  $g_elem = null; 
  } 
   
  function textData( $parser, $text ) 
  { 
  global $g_books, $g_elem; 
  if ( $g_elem == 'AUTHOR' || 
  $g_elem == 'PUBLISHER' || 
  $g_elem == 'TITLE' ) 
  { 
  $g_books[ count( $g_books ) - 1 ][ $g_elem ] = $text; 
  } 
  } 
   
  $parser = xml_parser_create(); 
   
  xml_set_element_handler( $parser, "startElement", "endElement" ); 
  xml_set_character_data_handler( $parser, "textData" ); 
   
  $f = fopen( 'books.xml', 'r' ); 
   
  while( $data = fread( $f, 4096 ) ) 
  { 
  xml_parse( $parser, $data ); 
  } 
   
  xml_parser_free( $parser ); 
   
  foreach( $g_books as $book ) 
  { 
  echo $book['TITLE']." - ".$book['AUTHOR']." - "; 
  echo $book['PUBLISHER']."n"; 
  } 
  ?＞

The script first sets up the g_books array, which holds all books and book information in memory, and the g_elem variable holds the name of the tag the script is currently processing. The script then defines the callback function. In this example, the callback functions are startElement, endElement, and textData. When opening and closing the mark, call the startElement and endElement functions respectively. TextData is called on the text between the opening and closing tags.

In this example, the startElement tag looks for the book tag to start a new element in the book array. The textData function then looks at the current element to see if it is a publisher, title, or author tag. If so, the function puts the current text into the current book.

To allow parsing to continue, the script creates a parser using the xml_parser_create function. Then, set the callback handle. Afterwards, the script reads the file and sends chunks of the file to the parser. After the file is read, the xml_parser_free function removes the parser. The end of the script outputs the contents of the g_books array.

As you can see, this is much more difficult than writing the same functionality in the DOM. What if there is no DOM library and no SAX library? Are there any alternatives?

Parse XML with regular expressions

I'm sure some engineers will criticize me for even mentioning this method, but it is possible to parse XML with regular expressions. Listing 4 shows an example of using the preg_ function to read a book file.

Listing 4. Reading XML with regular expressions

 
  <?php 
  $xml = ""; 
  $f = fopen( 'books.xml', 'r' ); 
  while( $data = fread( $f, 4096 ) ) { $xml .= $data; } 
  fclose( $f ); 
   
  preg_match_all( "/＜book＞(.*?)＜/book＞/s",  
  $xml, $bookblocks ); 
   
  foreach( $bookblocks[1] as $block ) 
  { 
  preg_match_all( "/＜author＞(.*?)＜/author＞/",  
  $block, $author ); 
  preg_match_all( "/＜title＞(.*?)＜/title＞/",  
  $block, $title ); 
  preg_match_all( "/＜publisher＞(.*?)＜/publisher＞/",  
  $block, $publisher ); 
  echo( $title[1][0]." - ".$author[1][0]." - ". 
  $publisher[1][0]."n" ); 
  } 
  ?＞

Notice how short this code is. Initially, it reads the file into a large string. Then use a regex function to read each book item. Finally, use a foreach loop to loop through each book block and extract the author, title, and publisher.

So, where are the flaws? The problem with using regular expression code to read XML is that it doesn't first check to make sure the XML is well-formed. This means that there is no way to know whether the XML is well-formed before reading it. Also, some well-formed XML may not match the regular expression, so they must be modified later.

I never recommend using regular expressions to read XML, but sometimes it's the best way for compatibility because the regular expression functions are always available. Do not use regular expressions to read XML directly from the user because you have no control over the format or structure of such XML. You should always use a DOM library or a SAX parser to read XML from the user.

Writing XML using DOM

Reading XML is only part of the equation. How to write XML? The best way to write XML is to use the DOM. Listing 5 shows how the DOM builds the book XML file.

Listing 5. Writing book XML using DOM

 
  <?php 
  $books = array(); 
  $books [] = array( 
  'title' => 'PHP Hacks', 
  'author' => 'Jack Herrington', 
  'publisher' => "O'Reilly" 
  ); 
  $books [] = array( 
  'title' => 'Podcasting Hacks', 
  'author' => 'Jack Herrington', 
  'publisher' => "O'Reilly" 
  ); 
   
  $doc = new DOMDocument(); 
  $doc->formatOutput = true; 
   
  $r = $doc->createElement( "books" ); 
  $doc->appendChild( $r ); 
   
  foreach( $books as $book ) 
  { 
  $b = $doc->createElement( "book" ); 
   
  $author = $doc->createElement( "author" ); 
  $author->appendChild( 
  $doc->createTextNode( $book['author'] ) 
  ); 
  $b->appendChild( $author ); 
   
  $title = $doc->createElement( "title" ); 
  $title->appendChild( 
  $doc->createTextNode( $book['title'] ) 
  ); 
  $b->appendChild( $title ); 
   
  $publisher = $doc->createElement( "publisher" ); 
  $publisher->appendChild( 
  $doc->createTextNode( $book['publisher'] ) 
  ); 
  $b->appendChild( $publisher ); 
   
  $r->appendChild( $b ); 
  } 
   
  echo $doc->saveXML(); 
  ?＞

At the top of the script, the books array is loaded with some sample books. This data can come from the user or from the database.

After the sample books are loaded, the script creates a new DOMDocument and adds the root books node to it. The script then creates nodes for each book's author, title, and publisher, and adds text nodes to each node. The final step for each book node is to re-add it to the root node books.

At the end of the script, use the saveXML method to output the XML to the console. (You can also use the save method to create an XML file.) The output of the script is shown in Listing 6.

Listing 6. Output of the DOM build script

 
  %phpe4.php  
  ＜?xml version="1.0"?＞ 
  ＜books＞ 
  ＜book＞ 
  ＜author＞Jack Herrington＜/author＞ 
  ＜title＞PHP Hacks＜/title＞ 
  ＜publisher＞O'Reilly＜/publisher＞ 
  ＜/book＞ 
  ＜book＞ 
  ＜author＞Jack Herrington＜/author＞ 
  ＜title＞Podcasting Hacks＜/title＞ 
  ＜publisher＞O'Reilly＜/publisher＞ 
  ＜/book＞ 
  ＜/books＞ 
  %

The real value of using the DOM is that the XML it creates is always well-formed. But what if you can't create XML using the DOM?

Writing XML in PHP

If the DOM is not available, XML can be written using PHP's text templates. Listing 7 shows how PHP builds the book XML file.

Listing 7. Writing book XML in PHP

 
  <?php 
  $books = array(); 
  $books [] = array( 
  'title' => 'PHP Hacks', 
  'author' => 'Jack Herrington', 
  'publisher' => "O'Reilly" 
  ); 
  $books [] = array( 
  'title' => 'Podcasting Hacks', 
  'author' => 'Jack Herrington', 
  'publisher' => "O'Reilly" 
  ); 
  ?＞ 
  ＜books＞ 
  <?php 
   
  foreach( $books as $book ) 
  { 
  ?＞ 
  ＜book＞ 
  ＜title＞＜?php echo( $book['title'] ); ?＞＜/title＞ 
  ＜author＞＜?php echo( $book['author'] ); ?＞ 
  ＜/author＞ 
  ＜publisher＞＜?php echo( $book['publisher'] ); ?＞ 
  ＜/publisher＞ 
  ＜/book＞ 
  <?php 
  } 
  ?＞ 
  ＜/books＞

The top part of the script is similar to a DOM script. The bottom of the script opens the books tag and then iterates through each book, creating the book tag and all the internal title, author, and publisher tags.

The problem with this approach is encoding the entities. To ensure that entities are encoded correctly, the htmlentities function must be called on each item, as shown in Listing 8.

Listing 8. Encoding entities using the htmlentities function

  
  ＜books＞ 
  <?php 
   
  foreach( $books as $book ) 
  { 
  $title = htmlentities( $book['title'], ENT_QUOTES ); 
  $author = htmlentities( $book['author'], ENT_QUOTES ); 
  $publisher = htmlentities( $book['publisher'], ENT_QUOTES ); 
  ?＞ 
  ＜book＞ 
  ＜title＞＜?php echo( $title ); ?＞＜/title＞ 
  ＜author＞＜?php echo( $author ); ?＞ ＜/author＞ 
  ＜publisher＞＜?php echo( $publisher ); ?＞ 
  ＜/publisher＞ 
  ＜/book＞ 
  <?php 
  } 
  ?＞ 
  ＜/books＞

This is where writing XML in basic PHP becomes annoying. You think you've created perfect XML, but as soon as you try to use the data, you discover that some elements are encoded incorrectly.

Conclusion

There's always a lot of exaggeration and confusion surrounding XML. However, it's not as difficult as you think - especially in a language as great as PHP. Once you understand and implement XML correctly, you'll find many powerful tools at your disposal. XPath and XSLT are two such tools worth studying.