The ugrep-indexer utility recursively indexes files to speed up recursive grepping.
Also the contents of archives and compressed files are indexed when specified with a command-line option. This eliminates searching them when none of their contents match the specified patterns.
ugrep is a grep-compatible fast file searcher that supports index-based searching. Index-based search can be significantly faster on slow file systems and when file system caching is ineffective: if the file system on a drive searched is not cached in RAM, i.e. it is "cold", then indexing will speed up search. It only searches those files that may match a specified regex pattern by using an index of the file. This index allows for a quick check if there is a potential match, thus we avoid searching all files.
Indexed-based search with ugrep is safe and never skips updated files that may now match. If any files and directories are added or changed after indexing, then searching will always search these additions and changes made to the file system by comparing file and directory time stamps to the indexing time stamp.
When many files are added or changed after indexing, then we might want to re-index to bring the indexes up to date. Re-indexing is incremental, so it will not take as much time as the initial indexing process.
A typical but small example of an index-based search, for example on the ugrep v3.12.6 repository placed on a separate drive:
$ cd drive/ugrep
$ ugrep-indexer -I
12247077 bytes scanned and indexed with 19% noise on average
1317 files indexed in 28 directories
28 new directories indexed
1317 new files indexed
0 modified files indexed
0 deleted files removed from indexes
128 binary files ignored with --ignore-binary
0 symbolic links skipped
0 devices skipped
5605227 bytes indexing storage increase at 4256 bytes/file
Normal searching on a cold file system without indexing takes 1.02 seconds
after unmounting the drive
and mounting again to clear FS cache to record the
effect of indexing:
$ ugrep -I -l 'std::chrono' --stats
src/ugrep.cpp
Searched 1317 files in 28 directories in 1.02 seconds with 8 threads: 1 matching (0.07593%)
Ripgrep 13.0.0 takes longer with 1.18 seconds for the same cold search (ripgrep
skips binary files by default, so option -I
is not specified):
$ time rg -l 'std::chrono'
src/ugrep.cpp
1.18 real 0.01 user 0.06 sys
By contrast, with indexing, searching a cold file system only takes 0.0487
seconds with ugrep, which is 21 times faster, after unmounting drive
and
mounting again to clear FS cache to record the effect of indexing:
$ ugrep --index -I -l 'std::chrono' --stats
src/ugrep.cpp
Searched 1317 files in 28 directories in 0.0487 seconds with 8 threads: 1 matching (0.07593%)
Skipped 1316 of 1317 files with non-matching indexes
There is always some variance in the elapsed time with 0.0487 seconds the best time of four search runs that produced a search time range of 0.0487 (21x speed up) to 0.0983 seconds (10x speed up).
The speed increase may be significantly higher in general compared to this small demo, depending on several factors, the size of the files indexed, the read speed of the file system and assuming most files are cold.
The indexing algorithm that I designed is provably monotonic: a higher accuracy guarantees an increased search performance by reducing the false positive rate, but also increases index storage overhead. Likewise, a lower accuracy decreases search performance, but also reduces the index storage overhead. Therefore, I named my indexer a monotonic indexer.
If file storage space is at a premium, then we can dial down the index storage overhead by specifying a lower indexing accuracy.
Indexing the example from above with level 0 (option -0
) reduces the indexing
storage overhead by 8.6 times, from 4256 bytes per file to a measly 490 bytes
per file:
12247077 bytes scanned and indexed with 42% noise on average
1317 files indexed in 28 directories
0 new directories indexed
1317 new files indexed
0 modified files indexed
0 deleted files removed from indexes
128 binary files ignored with --ignore-binary
0 symbolic links skipped
0 devices skipped
646123 bytes indexing storage increase at 490 bytes/file
Indexed search is still a lot faster by 12x than non-indexed for this example, with 16 files actually searched (15 false positives):
Searched 1317 files in 28 directories in 0.0722 seconds with 8 threads: 1 matching (0.07593%)
Skipped 1301 of 1317 files with non-matching indexes
Regex patterns that are more complex than this example may have a higher false positive rate naturally, which is the rate of files that are considered possibly matching when they are not. A higher false positive rate may reduce search speeds when the rate is large enough to be impactful.
The following table shows how indexing accuracy affects indexing storage
and the average noise per file indexed. The rightmost columns show the search
speed and false positive rate for ugrep --index -I -l 'std::chrono'
:
acc. | index storage (KB) | average noise | false positives | search time (s) |
---|---|---|---|---|
-0 |
631 | 42% | 15 | 0.0722 |
-1 |
1276 | 39% | 1 | 0.0506 |
-2 |
1576 | 36% | 0 | 0.0487 |
-3 |
2692 | 31% | 0 | unch |
-4 |
2966 | 28% | 0 | unch |
-5 |
4953 | 23% | 0 | unch |
-6 |
5474 | 19% | 0 | unch |
-7 |
9513 | 15% | 0 | unch |
-8 |
10889 | 11% | 0 | unch |
-9 |
13388 | 7% | 0 | unch |
If the specified regex matches many more possible patterns, for example with
the search ugrep --index -I -l '(todo|TODO)[: ]'
, then we may observe a
higher rate of false positives among the 1317 files searched, resulting in
slightly longer search times:
acc. | false positives | search time (s) |
---|---|---|
-0 |
189 | 0.292 |
-1 |
69 | 0.122 |
-2 |
43 | 0.103 |
-3 |
19 | 0.101 |
-4 |
16 | 0.097 |
-5 |
2 | 0.096 |
-6 |
1 | unch |
-7 |
0 | unch |
-8 |
0 | unch |
-9 |
0 | unch |
Accuracy -4
is the default (from -5
previously in older releases), which
tends to work very well to search with regex patterns of modest complexity.
One word of caution. There is always a tiny bit of overhead to check the indexes. This means that if all files are already cached in RAM, because files were searched or read recently, then indexing will not necesarily speed up search, obviously. In that case a non-indexed search might be faster. Furthermore, an index-based search has a longer start-up time. This start-up time increases when Unicode character classes and wildcards are used that must be converted to hash tables.
To summarize, index-based search is most effective when searching a lot of
cold files and when regex patterns aren't matching too much, i.e. we want to
limit the use of unlimited repeats *
and +
and limit the use of Unicode
character classes when possible. This reduces the ugrep start-up time and
limits the rate of false positive pattern matches (see also Q&A below).
Recursively and incrementally index all non-binary files showing progress:
ugrep-indexer -I -v
Recursively and incrementally index all non-binary files, including non-binary files stored in archives and in compressed files, showing progress:
ugrep-indexer -z -I -v
Incrementally index all non-binary files, including archives and compressed files, show progress, follow symbolic links to files (but not to directories), but do not index files and directories matching the globs in .gitignore:
ugrep-indexer -z -I -v -S -X
Force re-indexing of all non-binary files, including archives and compressed files, follow symbolic links to files (but not to directories), but do not index files and directories matching the globs in .gitignore:
ugrep-indexer -f -z -I -v -S -X
Same, but decrease index file storage to a minimum by decreasing indexing accuracy from 5 (default) to 0:
ugrep-indexer -f -0 -z -I -v -S -X
Increase search performance by increasing the indexing accuracy from 5 (default) to 7 at a cost of larger index files:
ugrep-indexer -f7zIvSX
Recursively delete all hidden ._UG#_Store
index files to restore the
directory tree to non-indexed:
ugrep-indexer -d
Configure and compile with:
./build.sh
If desired but not required, install with:
sudo make install
Add an option to create one index file, e.g. specified explicitly to ugrep.
This could further improve indexed search speed if the index file is located
on a fast file system. Otherwise, do not expect much improvement or even
possible slow down, since a single index file cannot be searched concurrently
and more index entries will be checked when in fact directories are skipped
(skipping their indexes too). Experiments will tell. A critical caveat of
this approach is that index-based search with ugrep --index
is no longer
safe: new and modified files that are not indexed yet will not be searched.
Each N-gram Bloom filter has its own "bit tier" in the hash table to avoid
hash conflicts. For example 2-grams do not share any bits with 3-grams.
This ensures that we never have any false positives with characters being
falsely matched that are actually not part of the pattern. However, the
1-gram (single character) bit space is small (at most 256 bits). Therefore,
we waste some bits when hash tables are larger. A possible approach to
reduce waste is to combine 1-grams with 2-grams to share the same bit space.
This is easy to do if we consider a 1-gram being equal to a 2-gram with the
second character set to