PDF 문서에서 텍스트를 처리하려면 글꼴 인코딩 처리, 원시 텍스트 데이터 스트림을 보다 유용한 데이터로 디코딩, 스캔한 PDF의 경우 안정적인 텍스트 추출을 위해 전처리하는 등 상당히 많은 준비 작업이 필요합니다. 특히 Java로 작업할 때 검색 불가능한 PDF라고 부르는 스캔된 PDF 문서(문서에 포함된 이미지)에서 텍스트를 추출하는 간단한 방법은 없습니다.
여기에서는 Java에서 Apache Tika Ocr 엔진과 Tesseract OCR을 사용하여 스캔한 PDF 문서에서 텍스트를 추출하는 방법을 보여줍니다.
이는 일부 스캔한 PDF에서는 잘 작동하지만 더 나은 성능을 위해 사전 처리된 스캔된 PDF가 필요하므로 다른 PDF에서는 크게 실패합니다. 노이즈 제거, 회전, 테두리 제거, 크기 조정 또는 텍스트 임계값 향상과 같은 PDF 전처리는 농담이 아닙니다 . openCv(javacv)를 사용할 수도 있습니다.
Apache Tika의 예제 코드는 스캔한 PDF를 전처리하지 않습니다. 스캔한/이미지가 포함된 PDF의 특성에 따라 선택적으로 작동합니다.
stream = new FileInputStream(fileName);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1); //
Metadata metadata = new Metadata();
TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //for recursive parsing
parser.parse(stream, handler, metadata, parseContext);
// Print extracted text to console or make use of it as appropriate
System.out.println(handler.toString());
if (stream != null)
try {
stream.close();
} catch (IOException e) {
System.out.println("Error closing stream");
}
}
자세한 내용은 ExtractTextFromScannedPdf.java 클래스를 확인하세요.
이로 인해 tesseract를 사용해 보게 되었습니다.
-PdfBox 라이브러리를 사용하여 스캔한 PDF에서 이미지를 추출합니다.
PDDocument doc = PDDocument.load(pdfFile);
PDPageTree list = doc.getPages();
numberOfPages = doc.getNumberOfPages();
for (PDPage page : list) {
PDResources resource = page.getResources();
for (COSName xObjectName : resource.getXObjectNames()) {
PDXObject xObject = resource.getXObject(xObjectName);
if (xObject instanceof PDImageXObject) {
PDImageXObject image = (PDImageXObject) xObject;
BufferedImage bufferedImage = image.getImage();
// Add bufferedImages to list
bufferedImages.add(bufferedImage);
images++;
}
}
}
doc.close();
I use LinkedList to store BufferedImages for sequential retrieval in FIFO order
BufferedImage grayImage = ImageHelper.convertImageToGrayscale(image);
String ocrResults = null;
try {
ocrResults = tesseract.doOCR(grayImage).replaceAll("\n{2,}", "n");
} catch (TesseractException e) {
e.printStackTrace();
}
if (ocrResults == null || ocrResults.trim().length() == 0) {
return null;
}
ocrResults = ocrResults.trim();
// TODO remove the trash that doesn't seem to be words
return ocrResults;
-스캔한 PDF의 각 이미지에서 추출된 텍스트를 함께 추가하여 최종 결과를 얻습니다.
StringBuilder extractedText = new StringBuilder("");
LinkedList<BufferedImage> bufferedImageList = new LinkedList<BufferedImage>();
bufferedImageList = checkScannedPdf(file);
if(!bufferedImageList.isEmpty()){
for(BufferedImage image: bufferedImageList){
BufferedImage deskewedImage = correctSkewness(image);
String text = extractTextFromImage(deskewedImage);
if(text != null ) {
extractedText.append(text);
}
}
}
return extractedText.toString();
자세한 내용은 TesseractScannedPdfTextExtraction.java 클래스를 확인하세요.
PDF 입력 https://github.com/fraponyo94/Text-Extraction-Scanned-Pdf/blob/master/sample-scanned-pdfs/PublicWaterMassMailing.pdf
추출된 텍스트
< S > £
& 2 ( e
Missouri Department of Health and Senior Services ( " ’ )
(0] 5 FEPB P.0. Box 570, Jefferson City, MO 65102-0570 Phone: 573-751-6400 FAX: 573-751-6010 S
R é‘g RELAY MISSOURI for Hearing and Speech Impaired 1-800-735-2966_VOICE 1-800-735-2466 M
& Peter Lyskowski Jeremiah W. (Jay) Ni
Rz Lo o B e
Missouri Public Water Systems
November 10, 2015
Dear Public Water System Owners/Operators:
The Missouri State Public Health Laboratory (MSPHL) is in the process of implementing a new
Laboratory Information Management System (LIMS) in its drinking water bacteriology testing
laboratory. The OpenELIS (OE) LIMS will provide the laboratory with improved sample management
capability, improved data integrity and reduced potential for human data entry error. In addition, the
system will provide improved reporting capabilities, including direct electronic data exchange with the
Missouri Department of Natural Resources’ (MDNR) Safe Drinking Water Information System
(SDWIS). SDWIS is the computer system MDNR uses to store regulatory water testing data and report
testing results to you and the U.S. Environmental Protection Agency. In addition, the new OE LIMS will
provide a web portal that MSPHL clients can use to access their own test results in real time.
As the MSPHL implements this new computer system, several changes will be made in the way you
collect and submit water samples for testing. This letter and information packet will provide you with
information to help educate you on these changes.
NEW SAMPLE BOTTLES:
Beginning in August 2015, the MSPHL began using a larger sample bottle for water bacterial testing.
This bottle has a shrink wrap seal and two lines to indicate the proper sample volume. Please read the
attached “SAMPLE COLLECTION INSTRUCTIONS?” for details on how to use these new bottles.
Sample volume MUST be within the two lines on the bottle (100 — 120 mL) to be acceptable for
testing. You may continue to use your old bottles until the MSPHL can ship you new ones. Once you
have received the new bottles, please discard or recycle the old bottles.
NEW SAMPLE INFORMATION FORMS:
The traditional sample information “card™ that has been used for more than twenty years is being
replaced by the Environmental Sample Collection Form. An example form is attached. Please read the
attached instructions for information on properly completing the new form.
Changes to the form include the following:
1. Form size is expanded to a single 8 4 " x 117 sheet of paper. The form is no longer in a triplicate
carbon copy format. You may choose to photocopy for your records if you prefer. Note : MDNR
does not require a public water system to retain copies of sample collection forms ; however , you
might utilize them for system inspections.
2 . The form is printed by the OE LIMS and will be pre - populated with your Public Water Supply
ID number , PWS name , address and county. Forms should not be shared with other supplies.
www.health.mo.gov
Healthy Missourians for life.
The Missouri Department of Health and Senior Services will be the leader in promoting , protecting and partnering for health ,
AN EQUAL OPPORTUNITY / AFFIRMATIVE ACTION EMPLOYER : Services provided on a nondiscriminatory basis.
sR
Contract operators will be provided with forms for all the supplies they operate. Blank forms will
be available for MDNR Regional Office staff use.
3 . The form requires all requested information to be printed by the collector. There are no longer
check boxes for Sample Type or Repeat Location.
4 . Facility ID , Sample Collection Point ID and Location for the sampling site MUST be
provided by the collector. This information is available from your MDNR approved PWS
sampling plan. MDNR will be providing all public water systems with a current copy of their
approved sampling plan. This information is required by SDWIS and is used by MDNR to
ensure regulatory compliance requirements have been met. Failure to complete this information
on the sample collection form may result in a non - compliance report from MDNR.
5 . A Collector Signature line has been added. The sample collector must sign the form to attest the
information provided is accurate to the best of their knowledge.
The MSPHL will begin shipping the new forms to public water systems in late November or early
December. Please begin using the new forms December 16 , 2015 . Discard all the old forms (“ cards ™)
at that time.
NEW SAMPLE INSTRUCTIONS :
Sample instructions have been revised to include changes to the bottle and sampling form. The
instructions include detailed information on how to collect the sample using the new bottle , how to
complete the new sample collection form , how to best ship samples to the MSPHL using the free
MSPHL courier system , and how to register for the new MSPHL web portal. A copy of these
instructions is attached.
NEW WEB PORTAL FOR RESULTS REPORTS
The OE LIMS provides a web portal that may be used by systems to view and print their test result
reports , check status of samples , download sample information into Excel , and receive automated emails
when samples are received at the laboratory , and when sample results are ready to be viewed. For
information on how to gain access to this portal , please contact Shondra Johnson , LIMS Administrator
at Shondra.Johnson @ health.mo.gov or at 573 - 751 - 3334 .
IMPLEMENTATION DATES :
The MSPHL intends to implement the OpenELIS LIMS on December 1 , 2015 . There will be a two
week testing period in which laboratory staff will run the new LIMS in conjunction with our current
manual , paper - based system to ensure the OE LIMS is operating properly. You may continue to submit
samples as you currently do , using the old sample information card , throughout this time.
On December 16 , 2015 , the MSPHL plans to “ go - live ” with the new OE LIMS. Samples submitted
after that date should be submitted on the new Environmental Sample Collection Form. At that time , the
MSPHL Test Results Web Portal will also be available to those systems that have been granted access.
The MSPHL and MDNR understand that there will be a lot of changes to a system that has been in place
for many years. The MSPHL is excited about the added benefits from this new system , and we ask for
your patience as we implement the OpenELIS LIMS at the Missouri State Public Health Laboratory.
LI
If you have any questions , please contact the MSPHL Environmental Bacteriology Unit at 573 - 751 -
3334 . You may also contact your MDNR Regional Office for additional information on sample
collection.
Once again , thank you for your patience and understanding as we implement these changes.
Pttt R Hoamses.
Patrick R. Shannon
Manager , Environmental Bacteriology Unit
Missouri Department of Health and Senior Services
State Public Health Laboratory
101 North Chestnut St.
P.O. Box 570
Jefferson City , MO 65102
Phone : 573 - 751 - 3334
Email : Pat.Shannon @ health.mo.gov
Web : www.health.mo.gov / LabOrder # : 984 [T REPORT TO: BILL TO:
Pages in Order : 1of 1 65 i 82 i
Containers in Order : 1 ADRIAN MO DEPARTMENT OF NATURAL RESOURCES
16 E 5TH ST 1101 RIVERSIDE DRIVE
ADRIAN , MO 64720 JEFFERSON CITY , MO 65102
Requested Analyses / Tests
PUBLIC DRINKING WATER BACTERIAL ANALYSIS
Total Coliform Bacteria and E. coli ( Present / Absent Test )
£ PRINT LEGIBLY. Instructions for completing form are supplied in the Collection Kit. For compliance monitoring questions , contact the
o Missouri Department of Natur