3D grphique: IOException and missing content when indexing PDFs

mercredi 3 septembre 2014

IOException and missing content when indexing PDFs

Vote count:

0

I'm trying to commit files to Solr 4.9.0 using Solrj with a method like this:


private boolean commitFile(File file) throws IOException {
    FilenameAnalyzer analyzer = new FilenameAnalyzer();
    String contentType = "application/pdf";
    String id = FileHandler.getInstance().getFileId(file);
    System.out.println("indexing file " + file.getName());
    ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
    up.addFile(file, contentType);
    up.setParam("literal.id", id);

    up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
    try {
        server.request(up);
        return true;
    } catch (SolrServerException | RuntimeException e) {
        return false;
    }
}

However, Solr, in something like 90% of the cases throws the following exception:


java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@1aa0464b
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
...

The logger also warns the following:


9/3/2014, 10:09:15 AM WARN XrefTrailerResolver Did not found XRef object at specified startxref position 0

When polling Solr for the file, it finds it, but returns " \n \n \n \n \n \n \n \n " as the entire content. I've found a similar issue reported at http://ift.tt/1rla70F, but this is marked as resolved, which seems to be active on the version of PDFBox that Solr 4.9.0 uses.

Is this the same issue? If so, is it possible it's not solved correctly? How can I proceed to resolve this when it breaks 90% of my pdf-indexing?

3D grphique

mercredi 3 septembre 2014

IOException and missing content when indexing PDFs

Vote count:

0

Aucun commentaire:

Enregistrer un commentaire

mercredi 3 septembre 2014

IOException and missing content when indexing PDFs

Vote count: 0

Aucun commentaire:

Enregistrer un commentaire

Vote count:

0