mercredi 3 septembre 2014

IOException and missing content when indexing PDFs


Vote count:

0




I'm trying to commit files to Solr 4.9.0 using Solrj with a method like this:



private boolean commitFile(File file) throws IOException {
FilenameAnalyzer analyzer = new FilenameAnalyzer();
String contentType = "application/pdf";
String id = FileHandler.getInstance().getFileId(file);
System.out.println("indexing file " + file.getName());
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
up.addFile(file, contentType);
up.setParam("literal.id", id);

up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
try {
server.request(up);
return true;
} catch (SolrServerException | RuntimeException e) {
return false;
}
}


However, Solr, in something like 90% of the cases throws the following exception:



java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@1aa0464b
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
...


The logger also warns the following:



9/3/2014, 10:09:15 AM WARN XrefTrailerResolver Did not found XRef object at specified startxref position 0


When polling Solr for the file, it finds it, but returns " \n \n \n \n \n \n \n \n " as the entire content. I've found a similar issue reported at http://ift.tt/1rla70F, but this is marked as resolved, which seems to be active on the version of PDFBox that Solr 4.9.0 uses.


Is this the same issue? If so, is it possible it's not solved correctly? How can I proceed to resolve this when it breaks 90% of my pdf-indexing?



asked 24 secs ago







IOException and missing content when indexing PDFs

Aucun commentaire:

Enregistrer un commentaire