Inverted Index Download - Inverted Index Source code download

Inverted Index

Other source code

1.0.0

Download

Building Index

The test data is from ClueWeb09B Dataset, the sample of which is stored in directory ./test_data for test purpose. When start building the index, the program read in the compressed forward index, decompress it and generate positional posting. The positional posting is counted to generate non-positional posting.

The positional posting and non-positional postings are stored in two STL vectors separately. When the vector reaches a certain size, for example 500, both vector are compressed and emptied. Two files are created on disk under the disk_index directory to store the compressed binary array. This process is repeated until there are less than 500 postings left. The leftover postings are stored in memory as dynamic index.

Variable Byte Encoding

As of now, only one form of encoding method is implemented. When method equals 1, the postings are compressed using variable byte encoding. The variable byte rule used in my index is as follows:

if < 128, use one byte (highest bit set to 0)
if < 128*128 = 16384, use two bytes (first has highest bit 0, the other 1)
if < 128^3, then use three bytes, and so on ... For example: 34 = 00100010 144 = 00000001 10010000 16385 = 00000001 10000001 10000001

Logarithmic Merge

The name of each file contains two parts, a prefix and a number. Positional indexes are stored in file "Z0", "Z1", ... "Zn", "I0", "I1", ... "In". After compressing, the indexer checks whether "Z0" exists on disk. If "Z0" doesn't exit than "Z0" is created and write the index to "Z0". If "Z0" exists, the index is written to "I0" instead. For non-positional index, "X0" is checked and written to if doesn't exist. Otherwise the non-positional index is written to "L0". The term ID are stored in the file meta data of the corresponding file. The meta data of the term stores the filename and the start and end position.

Two files that has same index number can never co-exist on disk permanently. Whenever 500 postings are finished writing to disk, the indexer checks if there are two file has the same index number. If there are, two files are combined into a new file with index number increased by 1. This process is known as the logarithmic merge. For example, if both "Z0" and "I0" exists, they are combined and written into "Z1", or "I1" if "Z1" already exists on disk. This process is repeated until there are no "I" files on disk. The meta data is updated accordingly.

Updating Index

New postings are received and stored in dynamic index. The postings generated by document analyzer is called external postings and the postings to be stored in memory is called internal postings. When there are 500 postings in memory it is compressed using the same method as building index.

What to Do Next

Implement query processor
Two errors in indexer.cpp that needs to be corrected, and two functions can be optimized in the same file
Construct page table to store all the pages, so that we can display snippets of text when querying

How to Build project

./initialize.sh to create directory where disk index is stored and make sure the directory is empty make calls makefile ./INDEX starts building the index

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2024-12-20
size 354.97MB
From Github

Related Applications

index cli plugin

2024-11-06
GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
SmartTube

Other source code

24.71 Stable
Sunamu

Other source code

Release 2.2.0
waymo open dataset

Other source code

December 2023 Update
wp functions

Other categories

1.0.0
termwind

Other categories

v2.3.0

Related Information All