Frequently asked questions about getting started with XML (4)

Author：Eve Cole Update Time：2009-07-07 16:08:15

How are whitespace characters handled in the XML object model?

Sometimes, the XML object model will display TEXT nodes that contain whitespace characters. When whitespace characters are truncated, it's likely to cause some confusion. For example, the following XML example:

]>
Smith
John

The following tree is generated:

Processing Instruction: xml
DocType: person
ELEMENT: person
TEXT:
ELEMENT: lastname
TEXT:
ELEMENT: firstname
TEXT:

The first and last names are surrounded by TEXT nodes containing only whitespace characters because the content model of the "person" element is MIXED; it contains the #PCDATA keyword. The MIXED content model specifies that text can exist between elements. Therefore, the following is also correct:

My last name is Smith and my first name is
John

The result is a tree similar to the following:

ELEMENT: person
TEXT: My last name is
ELEMENT: lastname
TEXT: and my first name is
ELEMENT: firstname
TEXT:

Without the whitespace characters after and before the word "is" and the whitespace characters after and before the word "and" the sentence would be unintelligible. Therefore, for the MIXED content model, text combinations, whitespace characters, and elements are all relevant. This is not the case for non-MIXED content models.

To make whitespace-only TEXT nodes disappear, remove the #PCDATA keyword from the "person" element declaration:

the result is the following clear tree:

Processing Instruction: xml
DocType: person
ELEMENT: person
ELEMENT: lastname
ELEMENT: firstname

What does the XML declaration do?

The XML declaration must be listed at the top of the XML document:

it specifies the following items:

The document is an XML document. MIME detectors can use this to detect whether a file is of type text/xml when the MIME type is missing or has not been specified.
The document conforms to the XML 1.0 specification. This will be important in the future when there are other versions of XML.
Document character encoding. The encoding attribute is optional and defaults to UTF-8.
Note: The XML declaration must be on the first line of the XML document, so the following XML file:

produces the following parsing error:

Invalid xml declaration.
Line 0000002:
Location 0000007: ------^
Note: XML declaration is optional. If you need to specify comments or processing instructions at the top, do not put an XML declaration. However, the default encoding will be UTF-8.

How do I print my XML document in a readable format?

When constructing a document from scratch using the DOM to produce an XML file, everything is on one line, with no spaces between them. This is the default behavior.

Constructs the default XSL stylesheet in Internet Explorer 5 to display and print XML documents in a readable format. For example, if IE5 is already installed, try looking at the nospace.xml file. The following tree should appear in the browser:

-
-
XYZ
12.56

No whitespace characters inserted in XML.

Printing readable XML is very interesting, especially when there are DTDs that define different types of content models. For example, under the mixed content model (#PCDATA) you cannot insert spaces because it might change the meaning of the content. For example, consider the following XML:

Elephant
This is best not output as:

E
lephant
Because the word boundaries are no longer correct.

All of this makes automated printing problematic. If you don't need to print readable XML, you can use the DOM to insert whitespace characters as text nodes at appropriate locations.

How to use namespaces in DTD? To use a namespace in a DTD, declare it in the ATTLIST declaration of the element that uses it, as follows:

The namespace type must be #FIXED. The same goes for attribute namespaces:

namespaces and XML schemas DTDs and XML schemas cannot be mixed. For example, the following

xmlns:x CDATA #FIXED "x-schema:myschema.xml"

Will not cause the schema definition defined in myschema.xml to be used. The use of DTD and XML schemas are mutually exclusive.

How to use XMLDSO in Visual Basic?

Use the following XML as an example:

Mark Hanson
206 765 4583

Jane Smith
425 808 1111

You can bind to an ADO recordset as follows:

Create a new VB 6.0 project.

Add references to Microsoft ActiveX Data Objects 2.1 or later, Microsoft Data Adapter Library, and Microsoft XML version 2.0.

Use the following code to load XML data into the XML DSO control:

Dim dso As New XMLDSOControl
Dim doc As IXMLDOMDocument
Set doc = dso.XMLDocument
doc.Load ("d:test.xml")

uses the following code to map the DSO into a new recordset object using the DataAdapter:

Dim da As New DataAdapter
Set da.Object = dso
Dim rs As New ADODB.Recordset
Set rs.DataSource = da

Access data:

MsgBox rs.Fields("name").Value

results in the string "Mark Hanson"
How to use XML DOM in Java?

The IE5 version of MSXML.DLL must be installed. In Visual J++ 6.0, select Add COM Wrapper from the Project menu, and then select "Microsoft XML 1.0" from the COM object list. This will construct the required Java wrapper into a new package called "msxml". These pre-built Java wrappers are also available for download. Classes can be used as follows:

import com.ms.com.*;
import msxml.*;
public class Class1
{
public static void main (String[] args)
{
DOMDocument doc = new DOMDocument();
doc.load(new Variant(" file://d:/samples/ot.xml "));
System.out.println("Loaded " + doc.getDocumentElement().getNodeName());
}
}

The code example will load the 3.8MB test file "ot.xml" from the sun religion example. The Variant class wraps the Win32 VARIANT basic type.

Because you actually get a new wrapper every time you retrieve a node, you can't use pointer comparisons on nodes. So don't use the code below,

IXMLDOMNode root1 = doc.getDocumentElement();
IXMLDOMNode root2 = doc.getDocumentElement();
if (root1 == root2)...

Instead use the following code:

if (ComLib.isEqualUnknown(root1, root2)) ....

The total size of the .class wrapper is approximately 160KB. However, for full compliance with the W3C specification, only IXMLDOM* wrappers should be used. The following classes are old IE 4.0 XML interfaces and can be removed from the msxml folder:

IXMLAttribute*,
IXMLDocument*, XMLDocument*
IXMLElement*,
IXMLError*,
IXMLElementCollection*,
tagXMLEMEM_TYPE*
_xml_error*

This reduces the size to 147KB. You can also delete the following items:

DOMFreeThreadedDocument
Access XML documents from multiple threads in Java applications.
XMLHttpRequest
Use the XML DAV HTTP extension to communicate with the server.
IXTLRuntime
Define the XSL stylesheet script object.
XMLDSOControl
Binds to XML data in an HTML page.
XMLDOMDocumentEvents
Return callback during analysis.

This reduces the size to 116KB. To make it even smaller, consider the fact that the DOM itself has two layers: the core layer consists of:

DOMDocument, IXMLDOMDocument
IXMLDOMNode*
IXMLDOMNodeList*
IXMLDOMNamedNodeMap*
IXMLDOMDocumentFragment*
IXMLDOMImplementation
IXMLDOMParseError

And DTD information that users may need to retain:

IXMLDOMDocumentType
IXMLDOMEntity
IXMLDOMNotation

All node types in an XML document are IXMLDOMNodes, which provide full functionality, but there are higher-level wrappers for each node type. Therefore, if you modify the DOMDocument wrapper and change these specific types to use IXMLDOMNode, all of the following interfaces can be removed:

IXMLDOMAttribute
IXMLDOMCDATASection
IXMLDOMCharacterData
IXMLDOMComment
IXMLDOMElement
IXMLDOMProcessingInstruction
IXMLDOMEntityReference
IXMLDOMText

Removing these will reduce the size to 61KB. However, for IXMLDOMElement, both the getAttribute and setAttribute methods are useful. Otherwise you need to use:

IXMLDOMNode.getAttributes().setNamedItem(...)