The test data is from ClueWeb09B Dataset, the sample of which is stored in directory ./test_data
for test purpose. When start building the index, the program read in the compressed forward index, decompress it and generate positional posting. The positional posting is counted to generate non-positional posting.
The positional posting and non-positional postings are stored in two STL vectors separately. When the vector reaches a certain size, for example 500, both vector are compressed and emptied. Two files are created on disk under the disk_index
directory to store the compressed binary array. This process is repeated until there are less than 500 postings left. The leftover postings are stored in memory as dynamic index.
As of now, only one form of encoding method is implemented. When method equals 1, the postings are compressed using variable byte encoding. The variable byte rule used in my index is as follows:
The name of each file contains two parts, a prefix and a number. Positional indexes are stored in file "Z0", "Z1", ... "Zn", "I0", "I1", ... "In". After compressing, the indexer checks whether "Z0" exists on disk. If "Z0" doesn't exit than "Z0" is created and write the index to "Z0". If "Z0" exists, the index is written to "I0" instead. For non-positional index, "X0" is checked and written to if doesn't exist. Otherwise the non-positional index is written to "L0". The term ID are stored in the file meta data of the corresponding file. The meta data of the term stores the filename and the start and end position.
Two files that has same index number can never co-exist on disk permanently. Whenever 500 postings are finished writing to disk, the indexer checks if there are two file has the same index number. If there are, two files are combined into a new file with index number increased by 1. This process is known as the logarithmic merge. For example, if both "Z0" and "I0" exists, they are combined and written into "Z1", or "I1" if "Z1" already exists on disk. This process is repeated until there are no "I" files on disk. The meta data is updated accordingly.
New postings are received and stored in dynamic index. The postings generated by document analyzer is called external postings and the postings to be stored in memory is called internal postings. When there are 500 postings in memory it is compressed using the same method as building index.
indexer.cpp
that needs to be corrected, and two functions can be optimized in the same file./initialize.sh
to create directory where disk index is stored and make sure the directory is empty
make
calls makefile
./INDEX
starts building the index