SimpleXML processing in PHP

Author：Eve Cole Update Time：2009-06-01 18:19:42

Learn about the SimpleXML extension bundled with PHP version 5, which enables PHP pages to query, search, modify, and republish XML in PHP-friendly syntax.

PHP version 5 introduces SimpleXML, a new application programming interface (API) for reading and writing XML. In SimpleXML, the following expression:

$doc->rss->channel->item->title

Select elements from the document. This expression is easy to write as long as you are familiar with the structure of the document. However, if it's not clear where the required elements appear (such as in Docbooks, HTML, and similar narrative documents), SimpleXML can use XPath expressions to find these elements.

Getting Started with SimpleXML

Suppose you need a PHP page to convert an RSS feed into HTML. RSS is a simple XML format for publishing linked content. The root element of the document is rss, which contains a channel element. The channel element contains metadata about the feed, such as title, language, and URL. It also contains various reports encapsulated in the item element. Each item has a link element, which contains a URL, and a title or description (usually both), which contains plain text. No namespaces are used. There's certainly more to RSS than that, but that's enough for this article. Listing 1 shows a typical example, which contains two news items.

Listing 1. RSS feed

<?xml version="1.0" encoding="UTF-8"?>
<rss version="0.92">
<channel>
<title>Mokka mit Schlag</title>
<link>http://www.elharo.com/blog</link>
<language>en</language>
<item>
<title>Penn Station: Gone but not Forgotten</title>
<description>
The old Penn Station in New York was torn down before I was born.
Looking at these pictures, that feels like a mistake. The current site is
functional, but no more; really just some office towers and underground
corridors of no particular interest or beauty. The new Madison Square...
</description>
<link>http://www.elharo.com/blog/new-york/2006/07/31/penn-station</link>
</item>
<item>
<title>Personal for Elliotte Harold</title>
<description>Some people use very obnoxious spam filters that require you
to type some random string in your subject such as E37T to get through.
Needless to say neither I nor most other people bother to communicate with
these paranoids. They are grossly overreacting to the spam problem.
Personally I won't...</description>

Let's develop a PHP page to format an RSS feed into HTML. Listing 2 shows the basic structure of this page.

Listing 2. Static structure of PHP code

<?php // Load and parse the XML document ?>
<html xml:lang="en" lang="en">
<head>
<title><?php // The title will be read from the RSS ?></title>
</head>
<body>

<?php
// Here we'll put a loop to include each item's title and description
?>

</body>
</html>

Parsing an XML document

The first step is to parse the XML document and save it into a variable. All it takes is one line of code, passing a URL to the simplexml_load_file() function:

$rss = simplexml_load_file('http://partners.userland.com/nytRss/nytHomepage.xml');

For this example, I've populated the page from Userland's New York Times feed (at http://partners.userland.com/nytRss/nytHomepage.xml ). Of course, any URL to another RSS feed can also be used.

Note that although the name is simplexml_load_file(), this function actually parses an XML document on a remote HTTP URL. But that's not the only weird thing about this function. The return value (here stored in the $rss variable) does not point to the entire document, as you might expect if you have used other APIs such as the Document Object Model (DOM). Instead, it points to the root element of the document. The contents of the document prologue and epilogue are not accessible from SimpleXML.

Find the feed title

The title of the entire feed (not the titles of the individual stories in the feed) is located in the title child of the rss root element channel. It's easy to find the title as if the XML document were a serialized form of an object of class rss, with its channel field itself having a title field. Using regular PHP object reference syntax, the statement to find the title is as follows:

$title = $rss->channel->title;

Once found, it can be added to the output HTML. Doing this is easy, just echo the $title variable:

This line outputs the string value of the element rather than the entire element. That is to say, the text content is written but the tags are not included.

You can even skip the intermediate variable $title completely:

<title><?php echo $rss->channel->title; ?></title>

Because the page reuses this value in multiple places, I find it more convenient to store it in a variable with a clear meaning.

…