advanced ir search_engine Download - advanced ir search

advanced ir search_engine

Other source code

1.0.0

Download

README

A document search engine written from scratch in Python. Based on concepts from Stanford's Introduction to Information Retrieval Book.

Uses the TREC8Adhoc part of the TIPSTER collection for index building and evaluation. You'll have to obtain TREC8Adhoc.tar.bz2 from this collection (Disk 4 & 5) to reproduce the reported results.

To evaluate the results with trec_eval you will have to download the TREC-8 ad hoc qrels.

Features:

Inverted index construction methods: Simple, Single-pass in-memory indexing (SPIMI )and Map Reduce
Document similarity metrics: TF-IDF, BM25, BM25VA (PDF Link) and TF-IDF Cosine Distance
Performance evaluation on TREC and result reporting in qrel format.

An early Rust port is located at ir-search-engine-rust.

Prerequisites

This project requires at least python 3.6.

Install Dependencies

pip install -r requirements.txt

Index Creation

Run

Run python cmd_index.py --help for instructions.

Example:

To create an index and document stats using the SPIMI method, run: python cmd_index.py --document_folder=./data/TREC8all/Adhoc/ --index_file=spimi.index --stats_file=spimi.stats spimi

Output

The script creates two output files:

index_file: Inverted index
stats_file: Document stats collected during index creation (document lengths and term counts)

Index Format

The index file format is text-based.

The first line states the number of unique documents in the index.

Each consecutive line represents a term including related data:

<TERM> <DOCUMENT_FREQUENCY> <POSTINGS>

TERM - The term itself
DOCUMENT_FREQUENCY - Document Frequency (Number of documents the term appears in)
POSTINGS - A comma-separated list of documents the term appears in along with the term frequency separated by pipe in the given document: <DOCUMENT_ID>|<TERM_FREQUENCY>,<DOCUMENT_ID>|<TERM_FREQUENCY>,...
TERM_FREQUENCY - Number of times the term appears in the corresponding document

Evaluation

Run

Run python cmd_search.py --help for instructions.

Example:

To evaluate a topic list on a previously created SPIMI index using a bm25 ranking, run: python cmd_search.py --output_file=out.txt --run_name=dev --topics_file=./data/TREC8all/topicsTREC8Adhoc.txt --index_file=spimi.index --stats_file=spimi.stats bm25

Output

The script creates an output file which can be used with trec_eval, like: trec_eval -q -m map -c ./data/TREC8all/qrels.trec8.adhoc.parts1-5 ./out.txt

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2024-12-20
size 24.36KB
From Github

Related Applications

Hanfox Search Engine

2012-03-15
DataLife Engine

2011-05-16
XOOPS Engine

2011-05-05
Advanced Guestbook

2010-10-09
Advanced SystemCare Pro

2009-06-22
Advanced Installer

2009-06-05

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
Sunamu

Other source code

Release 2.2.0
MySchedule.py

Other source code

Updates to the fetching of week codes
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All