Showing posts with label Apache. Show all posts
Showing posts with label Apache. Show all posts

Monday, December 18, 2017

Extending Apache Tika Capabilities

Extending Apache Tika Capabilities
Apache Tika is a toolkit for extracting metadata and textual content from various document formats. Tika itself provides implementation for parsing some document formats while it relies on external libraries(such as Apache PDFBox and Apache POI) for parsing many more.

Tika provides a uniform Java API for all of the supported document formats to make life easier for the user. Additionally, Tika provides functionality for detecting document type and content language.

In my earlier article, Content Extraction with Tika, I looked into using Tika either with Solr or in standalone mode. In this post I will go though some of the aspects involved when implementing support for new document formats. I will also provide a couple of example parsers and a full maven project to get you up to speed quickly.

Extension mechanisms

The basic principle of adding support for more document formats in Tika is very simple. All you need to do is write Java classes that implements the Tika Parser interface and let Tika know about your extension. If you implemented a parser for one of the >900 file formats Tika knows by filename extension or one of the ~300 formats that Tika can recognize from the number of file content bytes this is all you need to do.

There are at least three different ways to let Tika know about the new parser. The first way (and the most flexible one) is to wire it up with java code. If you wanted to use Tika AutodetectParser youd call setParsers method on AutodetectParser with the parser of your choice. By wiring up things with java code you could also customize the detection logic easily too just by calling the setDetector method.

The next way to customize the set of available parsers is to use an external XML file and construct a TikaConfig object with that configuration file.
The last of the three methods explained here is to use the new mechanism that was added in version 0.7 of Tika: the standard Java ServiceProvider API.
Extending the capabilities of Tika by using the ServiceProvider API is very straightforward and simple. There are however a couple of details you should pay attention to when using this mechanism.

If you need to replace a Tika provided parser implementation with your custom implementation you need to make sure that your .jar file is loaded after the tika-parsers.jar file. This is because in the current implementation the last parser registered for certain mime type is used to parse content for that mime type.

To support completely new document types (that Tika knows nothing about) you need to customize the detection process of AutoDetectParser manually. This is because there is no similar mechanism to extend the detection step as there is for adding new parsers. One way to do this is to use CompositeDetector to add your overlay detections to be done and trust for the default Detector for detecting the other types.

In this article I have demonstrated some ways to extend Tika parsing and detection capabilities if needed in your custom environment. Process-wise, the best possible way to add new capabilities to Tika is to contribute your new parser integrations and enhancements back to the Tika project. This way the community as whole will benefit from the results.
Running the provided example
Download project
Compile project
mvn clean install
Copy dependencies to directory target/dependencies
mvn dependency:copy-dependencies
Execute the default TikaGUI with our additions (enhanced .txt parser, vCard parser)

To know more about Search applications and Enterprise Search check out Lucid Imagination website

Sami Siren is Apache Nutch developer and Lucene PMC member.

More Sitemap Xml Articles