Author: AngelGavin Source: CSDN
How to load documents with foreign and special characters?
Documents can contain foreign characters, such as:
foreign characters (úóí?)
For example, foreign characters such as 粲 must be preceded by an escape sequence. Foreign characters can be UTF-8 encoded or specified with a different encoding, as follows:
foreign characters (?磲)
The XML is now loaded correctly.
Other characters are reserved in XML and need to be handled differently. XML below:
This & that
The following error occurs:
No spaces are allowed here.
Line 0000001: This & that
Location 0000012: ----------^
Here & is part of the XML syntax structure. If it is just placed inside the XML data source, it cannot be interpreted as &. You need to replace special character sequences called "entities".
This & that
The following characters require corresponding entities:
< <
& &
>>
" "
''
The quote character is used as a delimiter for attribute values in markup and therefore generally cannot be used inside attribute values. For example, the following will return an error:
The single quote here is used both as an attribute delimiter and within the attribute value itself. To correct this problem, you can change the attribute delimiter to double quotes:
or you can escape the single quotes to the entity.
Both of the above methods will return the attribute value John's Stuff through the getAttribute method in the XML object model. Likewise for double quotes you can use the entity ".
You can also handle special characters in element content by placing the text in a CDATA section. The following is correct:
In this example, the XML object model displays the CDATA node as a child node of the xml node, which returns the string
This & that is just "text" content.
as nodeValue.
How to use MSXML COM component in Visual Studio 6.0 C++?
The easiest way to use MSXML COM components in Visual C++ 6.0 is to use the #import directive:
#import "msxml.dll" named_guids no_namespace#import "msxml.dll" named_guids no_namespace
It defines all IXML* interfaces and interface IDs so that they can be used in applications. The MSXML type library and header files (in English) are also available from the INETSDK, as well as uuid.lib containing class IIDs.
How to use HTML entities in XML?
The following XML contains HTML entities:
Copyright ? 2000, Microsoft Inc, All rights reserved.
It produces the following error:
Reference to undefined entity 'copy'.
Line: 1, Position: 23, Error code: 0xC00CE002
Copyright ? 2000, ...
-----------------------^
This is because XML has only five built-in entities. For more information about built-in entities, see How do I load documents with foreign and special characters? .
To use HTML entities, you need to define them with a DTD. For more information about DTDs, see the W3C XML Recommendations (in English). To use this DTD, include it directly in the DOCTYPE tag, as follows:
Copyright ? 2000, Microsoft Inc, All rights reserved.
To load it, you need to turn off the validateOnParse attribute of the IXMLDOMDocument interface. Try pasting it into the Validator Test Page, turn off DTD validation, and click Validate. Notice that the document loads and the copyright characters appear in the DOM tree at the end of the validator page.
If DTD validation has been completed, the HTML entities that are parameter entities must be included in the existing DTD as follows:
%HTMLENT;
%HTMLENT;
It will define all HTML entities so that they can be used in XML documents.
How to deal with whitespace characters in element content?
The XML DOM has three ways of accessing the textual content of elements:
Attribute Behavior
nodeValue Returns the original textual content (including whitespace characters) on TEXT, CDATA, COMMENT, and PI nodes as specified in the original XML source. For ELEMENT nodes and DOCUMENT itself, null is returned.
Data Same as nodeValue
Text Repeat concatenates multiple TEXT and CDATA nodes in the specified subtree and returns the combined result.
Note: Whitespace characters include new lines, tabs, and spaces.
The nodeValue property usually returns the contents of the original document, regardless of how the document was loaded and the current xml:space scope.
The text attribute concatenates all text in the specified subtree and extends the entity. This is related to how the document is loaded, the current state of the preserveWhiteSpace switch and the current xml:space scope, see below:
preserveWhiteSpace = true when the document is loaded
preserveWhiteSpace=true | preserveWhiteSpace=true | preserveWhiteSpace=false | preserveWhiteSpace=false |
xml:space=preserve | xml:space=default | xml:space=preserve | xml:space=default |
preserve | preserve | preserve | preserve and truncate |
preserveWhiteSpace = false when the document is loaded
preserveWhiteSpace=true | preserveWhiteSpace=true | preserveWhiteSpace=false preserveWhiteSpace= | false |
xml:space=preserve | xml:space=default | xml:space=preserve | xml:space=default |
half-preserve | half-preserve and truncate | half-preserve | half-preserve and truncate |
preserve here means and The exact same original text content in the original XML document, truncated meaning leading and trailing whitespace has been removed, semi-preserved meaning "significant whitespace characters" are preserved and "unimportant whitespace characters" are normalized. Important whitespace characters are whitespace characters within the text content. Unimportant whitespace characters are the whitespace characters between tokens, look like this:
n
tJanen
tSmith n
In this example, red is an unimportant whitespace character that can be ignored, while green is an important whitespace character because it is part of the text content and therefore has important meaning that cannot be ignored. So in this example, the text property returns the following:
The status return value remains "nt JanentSmith n"
Keep and truncate "JanentSmith"
Semi-reserved "Jane Smith"
Half-preserve and truncate "Jane Smith"
Note that "half-preserve" will normalize unimportant whitespace characters, for example, newline and tab characters will be reduced to a single space. If you change the xml:space attribute and preserveWhiteSpace switch, the text properties will return correspondingly different values.
CDATA and xml:space="preserve" subtree boundaries
In the following example, the contents of CDATA nodes or "reserved" nodes will be concatenated because they do not participate in unimportant whitespace character normalization. For example:
n
t Jane n
t Smith ]>n
In this case, whitespace characters inside the CDATA node are no longer "merged" with "unimportant" whitespace characters and are not truncated. So the "half-preserved and truncated" case will return the following:
"Jane Smith"
Here, unimportant whitespace characters between the and tags will be included, regardless of the contents of the CDATA node. If you replace CDATA with the following, the same result will be returned:
Smith
Entities are special
entities that are loaded and parsed as part of the DTD and displayed under the DOCTYPE node. They do not have to have any xml:space scope. For example:
Janen
tn
">
]>
&Jane;
Assuming preserveWhiteSpace=false (in DOCTYPE tag scope), unimportant whitespace characters are lost when parsing entities. Entities will not have whitespace character nodes. The tree will look like:
DOCTYPE foo
ENTITY: Jane
ELEMENT: employee
ELEMENT: name
TEXT: Jane
ELEMENT: title
TEXT>:Software Design Engineer
ELEMENT: foo
ATTRIBUTE: xml:space="preserve"
ENTITYREF: Jane
Note that the DOM tree exposed under the ENTITY node inside the DOCTYPE does not contain any WHITESPACE nodes. This means that the child nodes of the ENTITYREF node also do not have WHITESPACE nodes, even if the entity reference is within the scope of xml:space="preserve".
Each instance of ENTITY referenced in a given document usually has the same tree.
If an entity must absolutely preserve whitespace characters, then it must specify its own xml:space attribute internally, or the document preserveWhiteSpace switch must be set to true.
How to deal with whitespace characters in attributes?
There are several ways to access property values. The IXMLDOMAttribute interface has a nodeValue attribute, which is equivalent to the nodeValue and text attributes as a Microsoft extension. These properties return: The text returned by the property
attrNode.nodeValue
attrNode.value
getAttribute("name") returns the exact same content (and extended entities) as in the original document.
attrNode.nodeTypedValue Null
attrNode.text is the same as nodeValue except that leading and trailing whitespace characters have been truncated.
The "XML Language" specification defines the following behavior for XML applications: Attribute types Returned text CDATA ID, IDREF, IDREFS, ENTITY, ENTITIES, NOTATION, enumeration
Semi-normalization Full normalization
Here semi-normalization represents the conversion of new lines and tab characters is a space, but multiple spaces will not degenerate into one space.