liblevenshtein java下载 - liblevenshtein java源代码下载

liblevenshtein java

JAVA源码

1.0.0

下载

利布文施泰因

爪哇

用于生成基于编辑自动机的有限状态换能器的库。

Levenshtein 传感器接受一个查询术语并返回字典中与其相距 n 个拼写错误以内的所有术语。它们构成了一类高效（空间和时间）的拼写校正器，当您在提出建议时不需要上下文时，它们可以很好地工作。忘记对字典执行线性扫描来查找与用户查询足够接近的所有术语，使用 Levenshtein 距离或 Damerau-Levenshtein 距离的二次实现，这些婴儿可以在线性时间内找到字典中的所有术语查询词的长度（不是字典的大小，而是查询词的长度）。

如果您需要上下文，则将传感器生成的候选项作为起始位置，并将它们插入您用于上下文的任何模型中（例如通过选择一起出现的可能性最大的术语序列）。

如需快速演示，请访问此处的 Github 页面。还有一个命令行界面 liblevenshtein-java-cli。请参阅其 README.md 了解获取和使用信息。

该库目前是用 Java、CoffeeScript 和 JavaScript 编写的，但我很快就会将其移植到其他语言。如果您希望使用某种特定语言，或者希望将其部署到包管理系统，请告诉我。

分支机构

分支	描述
掌握	最新，开发源码
发布	最新，发布源码
版本 3.x	最新版本 3.x 的发布源
版本 2.x	最新版本 2.x 的发布源

项目管理

问题在 waffle.io 上进行管理。下面你会看到我关闭它们的速度的图表。

请访问 Bountysource 承诺对持续存在的问题提供支持。

文档

当涉及到文档时，您有多种选择：

维基百科
Java文档
源代码

基本用法：

最低 Java 版本

liblevenshtein 是针对 Java ≥ 1.8 开发的。它不适用于以前的版本。

安装

梅文

<依赖关系>
  <groupId>com.github.universal-automata</groupId>
  <artifactId>liblevenshtein</artifactId>
  <版本>3.0.0</版本>
</依赖>

阿帕奇构建者

'com.github.universal-automata:liblevenshtein:jar:3.0.0'

阿帕奇常春藤

<dependency org="com.github.universal-automata" name="liblevenshtein" rev="3.0.0" />

绝妙的葡萄

@葡萄（
@Grab(group='com.github.universal-automata', module='liblevenshtein', version='3.0.0')
）

摇篮/Grails

编译 'com.github.universal-automata:liblevenshtein:3.0.0'

斯卡拉SBT

库依赖项 += "com.github.universal-automata" % "liblevenshtein" % "3.0.0"

莱宁根

[com.github.universal-automata/liblevenshtein“3.0.0”]

git

% git clone --progress [email protected]:universal-automata/liblevenshtein-java.git
Cloning into 'liblevenshtein-java'...
remote: Counting objects: 8117, done.        
remote: Compressing objects: 100% (472/472), done.        
remote: Total 8117 (delta 352), reused 0 (delta 0), pack-reused 7619        
Receiving objects: 100% (8117/8117), 5.52 MiB | 289.00 KiB/s, done.
Resolving deltas: 100% (5366/5366), done.
Checking connectivity... done.

% cd liblevenshtein-java
% git pull --progress
Already up-to-date.

% git fetch --progress --tags
% git checkout --progress 3.0.0
Note: checking out '3.0.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 4f0f172... pushd and popd silently

% git submodule init
% git submodule update

用法

假设您在名为 top-20-most-common-english-words.txt 的纯文本文件中有以下内容（请注意，该文件每行有一个术语）：

the
be
to
of
and
a
in
that
have
I
it
for
not
on
with
he
as
you
do
at

下面为您提供查询其内容的方法：

导入java.io.InputStream;导入java.io.OutputStream;导入java.nio.file.Files;导入java.nio.file.Path;导入java.nio.file.Paths;导入com.github.liblevenshtein.collection。字典.SortedDawg;导入 com.github.liblevenshtein.serialization.PlainTextSerializer;导入com.github.liblevenshtein.serialization.ProtobufSerializer;导入 com.github.liblevenshtein.serialization.Serializer;导入 com.github.liblevenshtein.transducer.Algorithm;导入 com.github.liblevenshtein.transducer.Candidate;导入 com.github.liblevenshtein.传感器.ITransducer;导入com.github.liblevenshtein.transducer.factory.TransducerBuilder;// ...最终 SortedDawg 字典;最终路径字典Path = Paths.get("/path/to/top-20-most-common-english-words.txt") ;try (final InputStream stream = Files.newInputStream(dictionaryPath)) { // PlainTextSerializer 构造函数接受一个可选的布尔值指定
  // 字典是否已经按字典顺序升序排序
  // 命令。  如果已排序，则传递 true 将优化构造
  // 字典；无论字典是否已排序，您都可以传递 false
  // 不（如果您不知道是否
  // 字典已排序）。
  最终 Serializer 序列化器 = new PlainTextSerializer(false);  字典=serializer.deserialize(SortedDawg.class,stream);
}最终 ITransducer<Candidate> 传感器 = new TransducerBuilder()
  .dictionary(字典)
  .algorithm(算法.TRANSPOSITION)
  .defaultMaxDistance(2)
  .includeDistance(真)
  .build();for (final String queryTerm : new String[] {"foo", "bar"}) { System.out.println("+---------------- -------------------------------------------------- -------------");  System.out.printf("| 查询词的拼写候选: "%s"%n", queryTerm);  System.out.println("+------------------------------------------ --------------------------------------------------");  for (最终候选候选：transducer.transduce(queryTerm)) {System.out.printf("| d("%s", "%s") = [%d]%n", queryTerm, Candidate.term() , 候选人.距离());
  }
}// +-------------------------------------------------------- ----------------------------------// |查询词的拼写候选：“foo”// +---------------------------------------------------- ------------------------------------------------------ // | d("foo", "do") = [2]// | d("foo", "do") = [2]// | d("foo", "of") = [2]// | d("foo", "of") = [2]// | d("foo", "on") = [2]// | d("foo", "on") = [2]// | d("foo", "to") = [2]// | d("foo", "to") = [2]// | d("foo", "for") = [1]// | d("foo", "for") = [1]// | d("foo", "not") = [2]// | d("foo", "not") = [2]// | d("foo", "you") = [2]// +--------------------------------- ---------------------------------------------------------- // |查询词的拼写候选：“bar”// +---------------------------------------------------- ------------------------------------------------------ // | d("bar", "a") = [2]// | d("bar", "a") = [2]// | d("bar", "as") = [2]// | d("bar", "as") = [2]// | d("bar", "at") = [2]// | d("bar", "at") = [2]// | d("bar", "be") = [2]// | d("bar", "be") = [2]// | d("bar", "for") = [2]// ...

如果您想将字典序列化为以后易于阅读的格式，请执行以下操作：

最终路径serializedDictionaryPath = Paths.get（“/path/to/top-20-most-common-english-words.protobuf.bytes”）;尝试（最终OutputStream流= Files.newOutputStream（serializedDictionaryPath））{最终Serializer序列化器=新的 ProtobufSerializer();  序列化器.序列化（字典，流）；
}

然后，您可以稍后阅读字典，就像阅读纯文本版本一样：

最终 SortedDawg deserializedDictionary;try (最终 InputStream 流 = Files.newInputStream(serializedDictionaryPath)) { 最终 Serializer 序列化器 = new ProtobufSerializer();  deserializedDictionary = serializer.deserialize(SortedDawg.class, 流);
}

序列化不仅限于字典，您还可以对转换器进行序列化（反序列化）。

请参阅 wiki 了解更多详细信息。