Nutch: From search engine to the source of Hadoop
Nutch is an open source project implemented in Java under Apache. Its development history is like the history of the development of big data technology, witnessing the evolution from search engines to Hadoop.
Nutch’s past and present life
Nutch was born in August 2002, originally as a search engine project. Its founder, Doug Cutting, is also the founder of well-known open source projects such as Lucene, Hadoop and Avro. The emergence of Nutch marks that search engine technology has entered a new stage of development.
Starting from Nutch version 1.2, Nutch has gradually evolved into a web crawler focused on crawling data from the Internet. This laid the foundation for subsequent Hadoop development.
During the development of Nutch, two branches were differentiated, 1.X and 2.X. The biggest difference is that version 2.X abstracts the underlying data storage and supports multiple underlying storage technologies, such as HDFS.
During the evolution of Nutch, four Java open source projects were produced: Hadoop, Tika, Gora and Crawler Commons.
Hadoop: Hadoop is an open source big data processing framework based on Nutch and has become the de facto standard for large-scale data processing.
Tika: Tika uses a variety of existing open source content parsing projects to extract metadata and structured text from files in multiple formats.
Gora: Gora supports persistence of big data to multiple storage implementations, such as HBase and Cassandra.
Crawler Commons: Crawler Commons is a universal web crawler component that provides developers with a complete set of crawler development tools.
Big Data and Nutch
The earliest reference to the term big data can be traced back to Nutch. At the time, big data was used to describe large data sets that needed to be batch processed or analyzed simultaneously to update web search indexes.
Now, the meaning of big data has been greatly developed, and the industry has summarized the characteristics of big data into four "V":
1. Volume: The data volume is huge.
2. Variety: There are many data types.
3. Value: Low value density and high commercial value.
4. Velocity: Fast processing speed.
Nutch and Hadoop are inseparable
Hadoop is one of the core technologies of big data, and Nutch is the culmination of Hadoop and is the source of Hadoop.
Learning Hadoop, Nutch is the best data source: What to do if there is no data? Catch with Nutch!
To practice Hadoop, Nutch provides a wealth of cases: After learning Hadoop’s Map Reduce and HDFS, what should I do if there are no practical cases? Learn Nutch! A lot of Nutch's code is written using Map Reduce and HDFS. Where can you find better Hadoop application cases than Nutch?
By learning Nutch, you can not only understand the development history of big data technology, but also master the practical skills of Hadoop. From search engine to Hadoop, Nutch's journey shows the charm of continuous technological evolution and provides us with valuable experience and resources for learning big data technology.