A document search engine written from scratch in Python. Based on concepts from Stanford's Introduction to Information Retrieval Book.
Uses the TREC8Adhoc part of the TIPSTER collection for index building and evaluation. You'll have to obtain TREC8Adhoc.tar.bz2
from this collection (Disk 4 & 5) to reproduce the reported results.
To evaluate the results with trec_eval
you will have to download the TREC-8 ad hoc qrels.
Features:
An early Rust port is located at ir-search-engine-rust.
This project requires at least python 3.6.
pip install -r requirements.txt
Run python cmd_index.py --help
for instructions.
To create an index and document stats using the SPIMI method, run:
python cmd_index.py --document_folder=./data/TREC8all/Adhoc/ --index_file=spimi.index --stats_file=spimi.stats spimi
The script creates two output files:
index_file
: Inverted indexstats_file
: Document stats collected during index creation (document lengths and term counts)The index file format is text-based.
The first line states the number of unique documents in the index.
Each consecutive line represents a term including related data:
<TERM> <DOCUMENT_FREQUENCY> <POSTINGS>
Run python cmd_search.py --help
for instructions.
To evaluate a topic list on a previously created SPIMI index using a bm25 ranking, run: python cmd_search.py --output_file=out.txt --run_name=dev --topics_file=./data/TREC8all/topicsTREC8Adhoc.txt --index_file=spimi.index --stats_file=spimi.stats bm25
The script creates an output file which can be used with trec_eval
, like: trec_eval -q -m map -c ./data/TREC8all/qrels.trec8.adhoc.parts1-5 ./out.txt