During programming or when writing a network crawler, HTML often needs to be parsed to extract useful data. A good tool is particularly useful and can provide a lot of help. There are many such tools on the Internet, such as: HTMLCLEANER, HTMLPARSER
After using it: It feels that HTMLCLEANER is better than HTMLPARSER, especially the Xpath of HTMLCLEANER.
Let's make an example of HTMLCLEANER. The requirements are: take out the link of title, name = "My_href", and all li content under the DIV class = "d_1 ″.
1. HTMLCLEANER use:
1. HTMLCLEANER
HTMLCLEANER is an open source Java language HTML document parser. HTMLCLEANER can reorganize each element of the HTML document and generate HTML documents with good structures (Well-Formed). The rules it follows by default is similar to the rules used by most web browsers as the model of the document. However, users can provide custom tags and rules for filtering and matching.
Home address: http://htmlcleaner.sourceforge.net/
Download address: //www.vevb.com/softs/364983.html
2. Basic example, grab the airport information in Wikipedia
html-class-decmo.html
HTML-CLEAN-DEMO.HTML <! Doctype html public "-// W3C // DTD XHTML 1.0 Transitional" "http://www.w3.org/xhtml1/dtdml1-transitational.dtd"> <HTML xmlns = "http://www.w3.org/1999/xhtml" xml: lang = "zh-cn" dir = "ltr"> <high> <meta http-equiv = "content-seype" content = "text /html; charset = gbk " /> <meta http-equiv =" Content-Language "content =" zh-cn " /> <title> html clean demo < /head> <body> <div class = "D_1"> <ul> <li> Bar </li> <li> FOO </li> <li> GZZ </li> </ul> </div> <div> <ul> <li> <a name = "My_href" href = "1.html"> Text-</a> </li> <li> <a name = "my_href" href = "2.html"> Text-2 </a> </a> </a> </a> </a> </a> </a> </a> </a> </a> </a> </a> /li> <li> <a name = "my_href" href = "3.html"> Text-3 </a> </li> <li> <a name = "my_href" href = "4.html"> Text-4 </a> </li> </ul> </div> </body> </html>
HtmlcleanerDemo.java
package com.chenlb; Import java.io.file; Import org.htmlCleaner.htmlClener; Import org.htmlClener.tagNode; s htmlcleanerDemo {Public Static Void Main (String [] ARGS ) Throws exception {htmlcleaner cleaner = new htmlcleaner (); tagnode node = clear.clean (new file ("html/html-demo.html"); "gbk"); [] NS = Node.GetelementsByname ("Title", TRUE); // Title if (ns.length> 0) {System.out.println ("Title ="+(Tagnode) ns [0]); GetText ()); } System.out.println ("ul/li:"); // Press xpath to take ns = node.evaluatexpath ("// div [@class = 'd_1']); for (object on: ns ) {Tagnode n = (tagnode) on; system.out.println ("/ttext ="+n.gettext ());} System.out.println ("a:"); Node.GetelementsByattvalue ("name", "My_href", true, true); for (object on: ns) {tagnode n = (tagnode); tattributebyname (" href ")+", text = "+n.gettext ());}}}
The parameters in Cleaner.clean () can be files, URL, and string content. The more commonly used should be Evaluatexpath, GetelementsByattValue, GetelementsByname method. In addition, HTMLCleaner is better compatible with irregular HTML.
Grab the airport information in Wikipedia
Import java.io.unsupportedEdencodingException; Import org.htmlClener.htmlCleaner; Import Org.htmlClener.tagnode; Import on; Import org.slf4j.logger; Import org.slf4j.loggerFactory; // Import com.moore.index .BabyStory; Import Com.moore.util.httpClientutil;/** * Uses: Todo * * @AutHor BBDTEK */PUBLIC CLASS PARSERAIRPORT {Private Static Logger Logg = Logg = logg erfactory.getLogger (PARSERAIRPORT.CLASS);/*** @param ARGS * @throws UNUPPORTEDENCINGEXCEPTION * @throws Xpatherexception */Public Static Void Main (String [] ARGS) Throws UNPORTEDENCODEXCE PTION, xpatherexception {string url = "http://zh.wikipedia.org/wiki/%E4%B8%AD%E5E5 %8d%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%8C%E5%BD%E6%BA%BA%E5%BA%E5%88 88 %97%E8%A1%A8 "; String Contents = httpClientutil.getIl (). GetCon (url); htmlCleaner hc = new htmlclener (); xpath = "// div [ @Class = 'MW-CONTENT-LTR'] // Table [@Class = 'Wikitable + sortable'] // TBODY // TR [@Align = 'Right'] "Object [] Objarr = Null; Objarr = tn .evaluatexpath (xpath); if (Objarr! = Null && Objarr.Length> 0) {for (Object Obj: Objarr) {tagnode tntr = (tagnode) Obj; string xptr = "// td [@a [@a LIGN = 'LEFT' ] // a "; Object [] Objarrtr = NULL; Objarrtr = TNTR.Evaluatexpath (XPTR); if (Objarrtr! = NULL & &&LENGTH> 0) {for (Object Obja) {Tagnode tna = (tagnode) obja; string str = tna.gettext (). Tostring (); log.info (str);}}}}}
Second, the first exploring of XPath
1. Introduction to XPath:
XPath is a language that searches for information in XML documents. XPATH can be used to traverse elements and attributes in XML documents.
2. Select XPath nodes
XPATH uses path expression to select nodes in XML documents. The node is selected by the path or STEP.
The most useful path expression is listed below:
expression | describe |
---|---|
nodename | Select all the sub -nodes of this node. |
/ / | Select from the root node. |
// | Select the nodes in the current node selection of the current node selection without considering their position. |
Then, then | Select the current node. |
... | Select the parent node of the current node. |
@ @ | Select attributes. |
Some commonly used expressions
Path expression | result |
---|---|
/bookstore/book [1] | Select the first book element that belongs to the Bookstore sub -element. |
/bookstore/book [Last ()]] | Select the last book element belonging to the Bookstore sub -element. |
/bookstore/book [Last ()-1] | Select the second BOOK element belonging to the Bookstore sub -element. |
/bookstore/book [position () <3] | Select the two front of the Bookstore's Book elements that belong to the BookStore element. |
// title [@lang] | Select all the Title elements with attributes named LANG. |
// Title [@Lang = 'ENG'] | Select all the title elements, and these elements have the LANG attribute with ENG. |
/bookstore/Book [price>35.00] | Select all Book elements of BookStore element, and the value of the Price element must be greater than 35.00. |
/bookStore/Book [price>35.00-/title | Select all the Title elements of the Book element in the BookStore element, and the value of the Price element must be greater than 35.00. |