PDFLayoutTextStripper下载 - PDFLayoutTextStripper源代码下载

PDFLayoutTextStripper

JAVA源码

v2.2.5

下载

PDF布局文本剥离器

将 PDF 文件转换为文本文件，同时保留原始 PDF 的布局。用于从 PDF 文件中的表格或表单中提取内容。 PDFLayoutTextStripper 是 PDFTextStripper 类（来自 Apache PDFBox 库）的子类。

使用案例

从 PDF 文件中的表格中提取数据

从 PDF 文件中的表单中提取数据

如何安装

梅文

 <dependency>
  <groupId>io.github.jonathanlink</groupId>
  <artifactId>PDFLayoutTextStripper</artifactId>
  <version>2.2.3</version>
</dependency>

手动的

手动安装apache pdfbox （要获取v2.0.6，请点击此处）及其两个依赖项 commons-logging.jar 和 fontbox

警告：只有2.0.0 版以上的pdfbox 版本与此版本的 PDFLayoutTextStripper.java 兼容

如何在Linux/Mac上使用

 cd PDFLayoutTextStripper
javac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java
java -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test

如何在 Windows 上使用

与 Linux 相同（见上文），但将 : 替换为 ;

示例代码

 import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class test {
	public static void main(String[] args) {
		String string = null;
        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r"));
            pdfParser.parse();
            PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
            string = pdfTextStripper.getText(pdDocument);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        };
        System.out.println(string);
	}
}