在 pdf 文档中,文本处理需要大量的准备工作,可能涉及处理字体编码、将原始文本数据流解码为更可用的数据,以及对于扫描的 pdf 的情况,对它们进行预处理以进行可靠的文本提取。没有直接的方法从扫描的 pdf 文档(嵌入在文档中的图像)中提取文本,而是称为不可搜索的 pdf,尤其是在使用 java 时。
在这里,我向您展示如何使用 Java 中的 Apache Tika Ocr 引擎和 Tesseract OCR 从扫描的 pdf 文档中提取文本
这对于某些扫描的 pdf 效果很好,但对于其他的则非常失败,因为它需要对扫描的 pdf 进行预处理才能获得更好的性能。对 pdf 进行预处理,例如噪声去除、旋转、边框去除、重新缩放,甚至增强文本阈值,这不是开玩笑。尽管您可以使用 openCv(javacv)。
我的 Apache Tika 示例代码不会预处理扫描的 pdf。根据扫描/图像嵌入 pdf 的性质选择性地工作
stream = new FileInputStream(fileName);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1); //
Metadata metadata = new Metadata();
TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //for recursive parsing
parser.parse(stream, handler, metadata, parseContext);
// Print extracted text to console or make use of it as appropriate
System.out.println(handler.toString());
if (stream != null)
try {
stream.close();
} catch (IOException e) {
System.out.println("Error closing stream");
}
}
查看ExtractTextFromScannedPdf.java类了解详细信息
这促使我尝试一下 tesseract。
-使用PdfBox库从扫描的Pdf中提取图像
PDDocument doc = PDDocument.load(pdfFile);
PDPageTree list = doc.getPages();
numberOfPages = doc.getNumberOfPages();
for (PDPage page : list) {
PDResources resource = page.getResources();
for (COSName xObjectName : resource.getXObjectNames()) {
PDXObject xObject = resource.getXObject(xObjectName);
if (xObject instanceof PDImageXObject) {
PDImageXObject image = (PDImageXObject) xObject;
BufferedImage bufferedImage = image.getImage();
// Add bufferedImages to list
bufferedImages.add(bufferedImage);
images++;
}
}
}
doc.close();
I use LinkedList to store BufferedImages for sequential retrieval in FIFO order
BufferedImage grayImage = ImageHelper.convertImageToGrayscale(image);
String ocrResults = null;
try {
ocrResults = tesseract.doOCR(grayImage).replaceAll("\n{2,}", "n");
} catch (TesseractException e) {
e.printStackTrace();
}
if (ocrResults == null || ocrResults.trim().length() == 0) {
return null;
}
ocrResults = ocrResults.trim();
// TODO remove the trash that doesn't seem to be words
return ocrResults;
-将从扫描的 pdf 的每个图像中提取的文本附加在一起以获得最终结果
StringBuilder extractedText = new StringBuilder("");
LinkedList<BufferedImage> bufferedImageList = new LinkedList<BufferedImage>();
bufferedImageList = checkScannedPdf(file);
if(!bufferedImageList.isEmpty()){
for(BufferedImage image: bufferedImageList){
BufferedImage deskewedImage = correctSkewness(image);
String text = extractTextFromImage(deskewedImage);
if(text != null ) {
extractedText.append(text);
}
}
}
return extractedText.toString();
查看TesseractScannedPdfTextExtraction.java 类了解详细信息
输入pdf https://github.com/fraponyo94/Text-Extraction-Scanned-Pdf/blob/master/sample-scanned-pdfs/PublicWaterMassMailing.pdf
提取的文本
< S > £
& 2 ( e
Missouri Department of Health and Senior Services ( " ’ )
(0] 5 FEPB P.0. Box 570, Jefferson City, MO 65102-0570 Phone: 573-751-6400 FAX: 573-751-6010 S
R é‘g RELAY MISSOURI for Hearing and Speech Impaired 1-800-735-2966_VOICE 1-800-735-2466 M
& Peter Lyskowski Jeremiah W. (Jay) Ni
Rz Lo o B e
Missouri Public Water Systems
November 10, 2015
Dear Public Water System Owners/Operators:
The Missouri State Public Health Laboratory (MSPHL) is in the process of implementing a new
Laboratory Information Management System (LIMS) in its drinking water bacteriology testing
laboratory. The OpenELIS (OE) LIMS will provide the laboratory with improved sample management
capability, improved data integrity and reduced potential for human data entry error. In addition, the
system will provide improved reporting capabilities, including direct electronic data exchange with the
Missouri Department of Natural Resources’ (MDNR) Safe Drinking Water Information System
(SDWIS). SDWIS is the computer system MDNR uses to store regulatory water testing data and report
testing results to you and the U.S. Environmental Protection Agency. In addition, the new OE LIMS will
provide a web portal that MSPHL clients can use to access their own test results in real time.
As the MSPHL implements this new computer system, several changes will be made in the way you
collect and submit water samples for testing. This letter and information packet will provide you with
information to help educate you on these changes.
NEW SAMPLE BOTTLES:
Beginning in August 2015, the MSPHL began using a larger sample bottle for water bacterial testing.
This bottle has a shrink wrap seal and two lines to indicate the proper sample volume. Please read the
attached “SAMPLE COLLECTION INSTRUCTIONS?” for details on how to use these new bottles.
Sample volume MUST be within the two lines on the bottle (100 — 120 mL) to be acceptable for
testing. You may continue to use your old bottles until the MSPHL can ship you new ones. Once you
have received the new bottles, please discard or recycle the old bottles.
NEW SAMPLE INFORMATION FORMS:
The traditional sample information “card™ that has been used for more than twenty years is being
replaced by the Environmental Sample Collection Form. An example form is attached. Please read the
attached instructions for information on properly completing the new form.
Changes to the form include the following:
1. Form size is expanded to a single 8 4 " x 117 sheet of paper. The form is no longer in a triplicate
carbon copy format. You may choose to photocopy for your records if you prefer. Note : MDNR
does not require a public water system to retain copies of sample collection forms ; however , you
might utilize them for system inspections.
2 . The form is printed by the OE LIMS and will be pre - populated with your Public Water Supply
ID number , PWS name , address and county. Forms should not be shared with other supplies.
www.health.mo.gov
Healthy Missourians for life.
The Missouri Department of Health and Senior Services will be the leader in promoting , protecting and partnering for health ,
AN EQUAL OPPORTUNITY / AFFIRMATIVE ACTION EMPLOYER : Services provided on a nondiscriminatory basis.
sR
Contract operators will be provided with forms for all the supplies they operate. Blank forms will
be available for MDNR Regional Office staff use.
3 . The form requires all requested information to be printed by the collector. There are no longer
check boxes for Sample Type or Repeat Location.
4 . Facility ID , Sample Collection Point ID and Location for the sampling site MUST be
provided by the collector. This information is available from your MDNR approved PWS
sampling plan. MDNR will be providing all public water systems with a current copy of their
approved sampling plan. This information is required by SDWIS and is used by MDNR to
ensure regulatory compliance requirements have been met. Failure to complete this information
on the sample collection form may result in a non - compliance report from MDNR.
5 . A Collector Signature line has been added. The sample collector must sign the form to attest the
information provided is accurate to the best of their knowledge.
The MSPHL will begin shipping the new forms to public water systems in late November or early
December. Please begin using the new forms December 16 , 2015 . Discard all the old forms (“ cards ™)
at that time.
NEW SAMPLE INSTRUCTIONS :
Sample instructions have been revised to include changes to the bottle and sampling form. The
instructions include detailed information on how to collect the sample using the new bottle , how to
complete the new sample collection form , how to best ship samples to the MSPHL using the free
MSPHL courier system , and how to register for the new MSPHL web portal. A copy of these
instructions is attached.
NEW WEB PORTAL FOR RESULTS REPORTS
The OE LIMS provides a web portal that may be used by systems to view and print their test result
reports , check status of samples , download sample information into Excel , and receive automated emails
when samples are received at the laboratory , and when sample results are ready to be viewed. For
information on how to gain access to this portal , please contact Shondra Johnson , LIMS Administrator
at Shondra.Johnson @ health.mo.gov or at 573 - 751 - 3334 .
IMPLEMENTATION DATES :
The MSPHL intends to implement the OpenELIS LIMS on December 1 , 2015 . There will be a two
week testing period in which laboratory staff will run the new LIMS in conjunction with our current
manual , paper - based system to ensure the OE LIMS is operating properly. You may continue to submit
samples as you currently do , using the old sample information card , throughout this time.
On December 16 , 2015 , the MSPHL plans to “ go - live ” with the new OE LIMS. Samples submitted
after that date should be submitted on the new Environmental Sample Collection Form. At that time , the
MSPHL Test Results Web Portal will also be available to those systems that have been granted access.
The MSPHL and MDNR understand that there will be a lot of changes to a system that has been in place
for many years. The MSPHL is excited about the added benefits from this new system , and we ask for
your patience as we implement the OpenELIS LIMS at the Missouri State Public Health Laboratory.
LI
If you have any questions , please contact the MSPHL Environmental Bacteriology Unit at 573 - 751 -
3334 . You may also contact your MDNR Regional Office for additional information on sample
collection.
Once again , thank you for your patience and understanding as we implement these changes.
Pttt R Hoamses.
Patrick R. Shannon
Manager , Environmental Bacteriology Unit
Missouri Department of Health and Senior Services
State Public Health Laboratory
101 North Chestnut St.
P.O. Box 570
Jefferson City , MO 65102
Phone : 573 - 751 - 3334
Email : Pat.Shannon @ health.mo.gov
Web : www.health.mo.gov / LabOrder # : 984 [T REPORT TO: BILL TO:
Pages in Order : 1of 1 65 i 82 i
Containers in Order : 1 ADRIAN MO DEPARTMENT OF NATURAL RESOURCES
16 E 5TH ST 1101 RIVERSIDE DRIVE
ADRIAN , MO 64720 JEFFERSON CITY , MO 65102
Requested Analyses / Tests
PUBLIC DRINKING WATER BACTERIAL ANALYSIS
Total Coliform Bacteria and E. coli ( Present / Absent Test )
£ PRINT LEGIBLY. Instructions for completing form are supplied in the Collection Kit. For compliance monitoring questions , contact the
o Missouri Department of Natur