Em documentos PDF, o manuseio de texto requer muito trabalho preparatório que pode envolver o manuseio da codificação de fontes, a decodificação dos fluxos de dados de texto bruto em dados mais utilizáveis e, no caso de PDF digitalizado, o pré-processamento deles para extração confiável de texto. Não existe uma maneira direta de extrair texto de documentos PDF digitalizados (imagens incorporadas em um documento), em vez de chamados de PDF não pesquisável, especialmente quando se trabalha com Java.
Aqui eu mostro como você pode extrair texto de um documento PDF digitalizado usando o mecanismo Apache Tika Ocr e Tesseract OCR em java
Isso funciona bem para alguns PDFs digitalizados e falha terrivelmente em outros, pois requer PDFs digitalizados pré-processados para melhor desempenho. O pré-processamento de PDFs, como remoção de ruído, rotação, remoção de bordas, redimensionamento ou até mesmo aprimoramento do limite do texto, não é brincadeira . Embora você possa usar openCv (javacv).
Meu código de exemplo do Apache Tika não pré-processa o PDF digitalizado. Funciona seletivamente dependendo da natureza do seu PDF digitalizado/imagem incorporada
stream = new FileInputStream(fileName);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1); //
Metadata metadata = new Metadata();
TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //for recursive parsing
parser.parse(stream, handler, metadata, parseContext);
// Print extracted text to console or make use of it as appropriate
System.out.println(handler.toString());
if (stream != null)
try {
stream.close();
} catch (IOException e) {
System.out.println("Error closing stream");
}
}
Confira a classe ExtractTextFromScannedPdf.java para obter detalhes
Isso me levou a experimentar o tesseract.
-Use a biblioteca PdfBox para extrair imagens do PDF digitalizado
PDDocument doc = PDDocument.load(pdfFile);
PDPageTree list = doc.getPages();
numberOfPages = doc.getNumberOfPages();
for (PDPage page : list) {
PDResources resource = page.getResources();
for (COSName xObjectName : resource.getXObjectNames()) {
PDXObject xObject = resource.getXObject(xObjectName);
if (xObject instanceof PDImageXObject) {
PDImageXObject image = (PDImageXObject) xObject;
BufferedImage bufferedImage = image.getImage();
// Add bufferedImages to list
bufferedImages.add(bufferedImage);
images++;
}
}
}
doc.close();
I use LinkedList to store BufferedImages for sequential retrieval in FIFO order
BufferedImage grayImage = ImageHelper.convertImageToGrayscale(image);
String ocrResults = null;
try {
ocrResults = tesseract.doOCR(grayImage).replaceAll("\n{2,}", "n");
} catch (TesseractException e) {
e.printStackTrace();
}
if (ocrResults == null || ocrResults.trim().length() == 0) {
return null;
}
ocrResults = ocrResults.trim();
// TODO remove the trash that doesn't seem to be words
return ocrResults;
-Anexar o texto extraído de cada imagem do pdf digitalizado para obter o resultado final
StringBuilder extractedText = new StringBuilder("");
LinkedList<BufferedImage> bufferedImageList = new LinkedList<BufferedImage>();
bufferedImageList = checkScannedPdf(file);
if(!bufferedImageList.isEmpty()){
for(BufferedImage image: bufferedImageList){
BufferedImage deskewedImage = correctSkewness(image);
String text = extractTextFromImage(deskewedImage);
if(text != null ) {
extractedText.append(text);
}
}
}
return extractedText.toString();
Confira a classe TesseractScannedPdfTextExtraction.java para obter detalhes
PDF de entrada https://github.com/fraponyo94/Text-Extraction-Scanned-Pdf/blob/master/sample-scanned-pdfs/PublicWaterMassMailing.pdf
Texto extraído
< S > £
& 2 ( e
Missouri Department of Health and Senior Services ( " ’ )
(0] 5 FEPB P.0. Box 570, Jefferson City, MO 65102-0570 Phone: 573-751-6400 FAX: 573-751-6010 S
R é‘g RELAY MISSOURI for Hearing and Speech Impaired 1-800-735-2966_VOICE 1-800-735-2466 M
& Peter Lyskowski Jeremiah W. (Jay) Ni
Rz Lo o B e
Missouri Public Water Systems
November 10, 2015
Dear Public Water System Owners/Operators:
The Missouri State Public Health Laboratory (MSPHL) is in the process of implementing a new
Laboratory Information Management System (LIMS) in its drinking water bacteriology testing
laboratory. The OpenELIS (OE) LIMS will provide the laboratory with improved sample management
capability, improved data integrity and reduced potential for human data entry error. In addition, the
system will provide improved reporting capabilities, including direct electronic data exchange with the
Missouri Department of Natural Resources’ (MDNR) Safe Drinking Water Information System
(SDWIS). SDWIS is the computer system MDNR uses to store regulatory water testing data and report
testing results to you and the U.S. Environmental Protection Agency. In addition, the new OE LIMS will
provide a web portal that MSPHL clients can use to access their own test results in real time.
As the MSPHL implements this new computer system, several changes will be made in the way you
collect and submit water samples for testing. This letter and information packet will provide you with
information to help educate you on these changes.
NEW SAMPLE BOTTLES:
Beginning in August 2015, the MSPHL began using a larger sample bottle for water bacterial testing.
This bottle has a shrink wrap seal and two lines to indicate the proper sample volume. Please read the
attached “SAMPLE COLLECTION INSTRUCTIONS?” for details on how to use these new bottles.
Sample volume MUST be within the two lines on the bottle (100 — 120 mL) to be acceptable for
testing. You may continue to use your old bottles until the MSPHL can ship you new ones. Once you
have received the new bottles, please discard or recycle the old bottles.
NEW SAMPLE INFORMATION FORMS:
The traditional sample information “card™ that has been used for more than twenty years is being
replaced by the Environmental Sample Collection Form. An example form is attached. Please read the
attached instructions for information on properly completing the new form.
Changes to the form include the following:
1. Form size is expanded to a single 8 4 " x 117 sheet of paper. The form is no longer in a triplicate
carbon copy format. You may choose to photocopy for your records if you prefer. Note : MDNR
does not require a public water system to retain copies of sample collection forms ; however , you
might utilize them for system inspections.
2 . The form is printed by the OE LIMS and will be pre - populated with your Public Water Supply
ID number , PWS name , address and county. Forms should not be shared with other supplies.
www.health.mo.gov
Healthy Missourians for life.
The Missouri Department of Health and Senior Services will be the leader in promoting , protecting and partnering for health ,
AN EQUAL OPPORTUNITY / AFFIRMATIVE ACTION EMPLOYER : Services provided on a nondiscriminatory basis.
sR
Contract operators will be provided with forms for all the supplies they operate. Blank forms will
be available for MDNR Regional Office staff use.
3 . The form requires all requested information to be printed by the collector. There are no longer
check boxes for Sample Type or Repeat Location.
4 . Facility ID , Sample Collection Point ID and Location for the sampling site MUST be
provided by the collector. This information is available from your MDNR approved PWS
sampling plan. MDNR will be providing all public water systems with a current copy of their
approved sampling plan. This information is required by SDWIS and is used by MDNR to
ensure regulatory compliance requirements have been met. Failure to complete this information
on the sample collection form may result in a non - compliance report from MDNR.
5 . A Collector Signature line has been added. The sample collector must sign the form to attest the
information provided is accurate to the best of their knowledge.
The MSPHL will begin shipping the new forms to public water systems in late November or early
December. Please begin using the new forms December 16 , 2015 . Discard all the old forms (“ cards ™)
at that time.
NEW SAMPLE INSTRUCTIONS :
Sample instructions have been revised to include changes to the bottle and sampling form. The
instructions include detailed information on how to collect the sample using the new bottle , how to
complete the new sample collection form , how to best ship samples to the MSPHL using the free
MSPHL courier system , and how to register for the new MSPHL web portal. A copy of these
instructions is attached.
NEW WEB PORTAL FOR RESULTS REPORTS
The OE LIMS provides a web portal that may be used by systems to view and print their test result
reports , check status of samples , download sample information into Excel , and receive automated emails
when samples are received at the laboratory , and when sample results are ready to be viewed. For
information on how to gain access to this portal , please contact Shondra Johnson , LIMS Administrator
at Shondra.Johnson @ health.mo.gov or at 573 - 751 - 3334 .
IMPLEMENTATION DATES :
The MSPHL intends to implement the OpenELIS LIMS on December 1 , 2015 . There will be a two
week testing period in which laboratory staff will run the new LIMS in conjunction with our current
manual , paper - based system to ensure the OE LIMS is operating properly. You may continue to submit
samples as you currently do , using the old sample information card , throughout this time.
On December 16 , 2015 , the MSPHL plans to “ go - live ” with the new OE LIMS. Samples submitted
after that date should be submitted on the new Environmental Sample Collection Form. At that time , the
MSPHL Test Results Web Portal will also be available to those systems that have been granted access.
The MSPHL and MDNR understand that there will be a lot of changes to a system that has been in place
for many years. The MSPHL is excited about the added benefits from this new system , and we ask for
your patience as we implement the OpenELIS LIMS at the Missouri State Public Health Laboratory.
LI
If you have any questions , please contact the MSPHL Environmental Bacteriology Unit at 573 - 751 -
3334 . You may also contact your MDNR Regional Office for additional information on sample
collection.
Once again , thank you for your patience and understanding as we implement these changes.
Pttt R Hoamses.
Patrick R. Shannon
Manager , Environmental Bacteriology Unit
Missouri Department of Health and Senior Services
State Public Health Laboratory
101 North Chestnut St.
P.O. Box 570
Jefferson City , MO 65102
Phone : 573 - 751 - 3334
Email : Pat.Shannon @ health.mo.gov
Web : www.health.mo.gov / LabOrder # : 984 [T REPORT TO: BILL TO:
Pages in Order : 1of 1 65 i 82 i
Containers in Order : 1 ADRIAN MO DEPARTMENT OF NATURAL RESOURCES
16 E 5TH ST 1101 RIVERSIDE DRIVE
ADRIAN , MO 64720 JEFFERSON CITY , MO 65102
Requested Analyses / Tests
PUBLIC DRINKING WATER BACTERIAL ANALYSIS
Total Coliform Bacteria and E. coli ( Present / Absent Test )
£ PRINT LEGIBLY. Instructions for completing form are supplied in the Collection Kit. For compliance monitoring questions , contact the
o Missouri Department of Natur