esmy Download - esmy Source code download

esmy

Other source code

1.0.0

Download

Esmy

Esmy is a library for full text search, written in Rust. It is inspired by Lucene, but aims to be more flexible.

Features

Text indexing with different analyzers.
Text search, including phrases.
Parallel indexing
Document deletions
Quite fast

Roadmap

Document scoring
~~Document deletions~~
Doc-values data structures (fast access to values of fields)
Improve merge concurrency
More query types (e.g. spans, more boolean logic)

Example

let schema = SegmentSchemaBuilder::new()
    .add_full_doc("full_doc_feature") //features have names
    .add_string_index(
        "text_string_index",
        "text",
        Box::new(UAX29Analyzer::new())) //Unicode tokenization
    .build();

let index = IndexBuilder::new().create("path/to/index", schema).unwrap();
let doc1 = Doc::new().string_field("text", "The quick brown fox jumps over the lazy dog");
index.add_doc(doc1).unwrap();
let doc2 = Doc::new().string_field("text", "Foxes are generally smaller than some other members of the family Canidae");
index.add_doc(doc2).unwrap();
index.commit().unwrap();

let query = TextQuery::new(
    "text",                         //field
    "brown fox",                    //value
    Box::new(UAX29Analyzer::new()), //Search with the same analyzer as we indexed
);
let mut collector = CountCollector::new();
let reader = index.open_reader().unwrap();
reader.search(&query, &mut collector).unwrap();
assert_eq!(1, collector.total_count());

Design

Esmy is an information retrieval system, and takes a lot of inspiration from Lucene. The main idea is to have an inverted index, which allows you to look up which documents contain a certain term. However, often additional data structures are needed in order to be able to visualize or process the data, e.g. to create histograms of result sets or being able to do geo-search. Thus, Esmy is structured to accommodate adding new data structures.

Esmy, as e.g. Lucene, is structured around indexes and segments. A segment is a collection of on-disk data structures, and an index is a set of segments. Segments are immutable. When adding documents to Esmy, you add some documents which are at some point commited to disk, at which point a segment is created. Over time, this will mean many small segments. In order to prevent having so many small segments, Esmy can merge segments into larger segments. The on-disk data structures of the segments can then be used to do something useful, e.g. searching for text.

Apart from not being on the JVM, there are a few differences from Lucene.

One is that Lucene treats the inverted index as the core of the Library. While it is an important feature of Esmy, it's only one kind of useful data structure. Esmy instead has a concept of a segment feature. The inverted index is one such segment feature. The requirements on a segment feature is that you can create one from a set of documents, and that the feature can merge files that it wrote into larger files.

Features are identified by names, and since they are decoupled from fields you can add more than one type of index for a particular field. This means that you, for example, can have a document indexed with different analyzers without having to have separate fields for them, as you would in Lucene.

Another one is that Esmy has more opinionated (but open) view of what a document is. Lucene treats a document as a set of fields at input, but has no notion of a document when reading. This leads to e.g. Elasticsearch having a JSON-structure emulate this, by storing the JSON as a string field. Since Lucene is not Elasticsearch, Lucene can not use that _source field, Lucene can't use that field. Esmy instead has a notion of a document, and an on-disk data structure. This means that Esmy can use the document.

License

This repository is licensed under the Apache License, Version 2.0 license, with the exception of data in the data directory, which comes from Wikipedia and is only used for testing purposes.

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2024-12-20
size 44.08KB
From Github

Related Applications

waymo open dataset

2024-11-18
Sunamu

2024-12-14
MySchedule.py

2024-12-15
chat.petals.dev

2024-11-30
SmartTube

2024-12-14
viptools for eslam

2024-12-15

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
Sunamu

Other source code

Release 2.2.0
MySchedule.py

Other source code

Updates to the fetching of week codes
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All