Apache download-Apache Nutch web crawler v1.19 source code download

Apache Nutch web crawler v1.19

JAVA source code

1.19

Download

Nutch: From search engine to the source of Hadoop

Nutch is an open source project implemented in Java under Apache. Its development history is like the history of the development of big data technology, witnessing the evolution from search engines to Hadoop.

Nutch’s past and present life

1. The origin of search engines

Nutch was born in August 2002, originally as a search engine project. Its founder, Doug Cutting, is also the founder of well-known open source projects such as Lucene, Hadoop and Avro. The emergence of Nutch marks that search engine technology has entered a new stage of development.

2. From search engine to web crawler

Starting from Nutch version 1.2, Nutch has gradually evolved into a web crawler focused on crawling data from the Internet. This laid the foundation for subsequent Hadoop development.

3. Two branches: 1.X and 2.X

During the development of Nutch, two branches were differentiated, 1.X and 2.X. The biggest difference is that version 2.X abstracts the underlying data storage and supports multiple underlying storage technologies, such as HDFS.

4. Spawned four open source projects

During the evolution of Nutch, four Java open source projects were produced: Hadoop, Tika, Gora and Crawler Commons.

Hadoop: Hadoop is an open source big data processing framework based on Nutch and has become the de facto standard for large-scale data processing.

Tika: Tika uses a variety of existing open source content parsing projects to extract metadata and structured text from files in multiple formats.

Gora: Gora supports persistence of big data to multiple storage implementations, such as HBase and Cassandra.

Crawler Commons: Crawler Commons is a universal web crawler component that provides developers with a complete set of crawler development tools.

Big Data and Nutch

The earliest reference to the term big data can be traced back to Nutch. At the time, big data was used to describe large data sets that needed to be batch processed or analyzed simultaneously to update web search indexes.

Now, the meaning of big data has been greatly developed, and the industry has summarized the characteristics of big data into four "V":

1. Volume: The data volume is huge.

2. Variety: There are many data types.

3. Value: Low value density and high commercial value.

4. Velocity: Fast processing speed.

Nutch and Hadoop are inseparable

Hadoop is one of the core technologies of big data, and Nutch is the culmination of Hadoop and is the source of Hadoop.

Learning Hadoop, Nutch is the best data source: What to do if there is no data? Catch with Nutch!

To practice Hadoop, Nutch provides a wealth of cases: After learning Hadoop’s Map Reduce and HDFS, what should I do if there are no practical cases? Learn Nutch! A lot of Nutch's code is written using Map Reduce and HDFS. Where can you find better Hadoop application cases than Nutch?

By learning Nutch, you can not only understand the development history of big data technology, but also master the practical skills of Hadoop. From search engine to Hadoop, Nutch's journey shows the charm of continuous technological evolution and provides us with valuable experience and resources for learning big data technology.

Expand

Additional Information

Version 1.19
Type JAVA source code
Update Time 2024-10-30
size 6.85MB

Related Applications

YayCrawler distributed crawler system v1.0

2024-11-11
Forum 5.2.19

2022-09-16
MotoGP 19

2022-09-01
Cricket 19

2022-08-26
Java web crawler

2022-05-30
Jingdong product review crawler source code v1.0

2022-05-23

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
redisson

JAVA source code

redisson-3.40.1
PrettyZoo

JAVA source code

v2.1.1
opentelemetry java instrumentation

JAVA source code

Version 2.10.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All