What is Open Discover™ SDK?
It is a .NET developer toolkit that developers use to identify/classify document file formats and extract content such as text, metadata, attributes (e.g., ‘WorksheetHasHiddenColumns’ document attribute), embedded objects, and attachments. Languages in the extracted text such as English, Chinese, etc. are automatically identified. The SDK also calculates MD5/SHA1 hashes for all documents and calculates additional sophisticated hashes on email and office document types. These hashes are useful for de-duplicating documents, that is, allowing users to remove duplicate documents from a document set.
Before indexing a document for full text search or using machine learning classification on a set of documents, you first need to get document text, metadata, and that of any document attachments. For this reason, Open Discover™ SDK is useful companion toolset for machine learning, full-text indexing, file storage document identification/classification, ECM systems, eDiscovery, and more.
The SDK can identify 1,300+ document file formats. The SDK does not rely on file extensions to identify file formats except for a few cases (described below), the SDK uses binary or other unique internal signatures of a document to identify its file format.
To identify a document, using C#/.NET code, is as simple as:
- where method argument ‘_stream’ is an open .NET Stream object (e.g., FileStream or MemoryStream)
- where method argument ‘filename’ is the filename or full path of file with extension (if it exists).
It is not necessary to pass in filename as an argument but is strongly recommended. Some file formats such as encrypted Microsoft Office 2007-2016 documents have the same file format and same internal signatures and cannot be 100% identified until the internal package hosting the real document is decrypted. In cases such as these, and a few other special cases, the file extension is used in conjunction with the internal signatures to identify the document.
In the example code snippet above, the returned result ‘docIdResult’ is an IdResult object (see class diagram below) that specifies the identified file format (property name "ID" which is of enumerated type Id, ex: Id. OutlookMessage, Id.Excel2007Encrypted, etc.), the classification of the document format (property name "Classification" which is of enumerated type ‘IdClassification’, ex: IdClassification.Email, IdClassification.Spreadsheet, IdClassification.WordProcessing, etc.), MIME type if known, the character set encoding if a text based document format, text description of the file format, and more.
The SDK can extract content from 600+ document file formats and growing (counted by document file format ID). For document types that aren’t supported, a fast and accurate binary-to-text extractor is provided that allows useful text, if any present, in UTF8, UTF16, and code page 1252 encodings to be extracted from the binary.
To extract content from a document, using C#/.NET code, we make a method call to the content extractor factory that makes use of the identified document format to return an appropriate extraction interface for that particular format:
- where method argument ‘_stream’ is an open Stream object (FileStream or MemoryStream) to the document;
- where argument ‘_docIdResult’ is the document identification result returned in the earlier code snippet;
- where ‘filename’ is the filename or full path of file with extension (if it exists);
- where ‘_contentConfig’ is a ContentExtractionSettings object that has setting options for what is extracted (e.g., only extract metadata, or to extract text, metadata, and attachments/embedded objects) and options for hashing, language identification of extracted text, etc.
In the example code snippet above, the returned ‘docContentResult’ result object is a ContentExtractorResult object from which the user can get the appropriate interface to extract content for the document’s particular file format. Archives, mail stores, and office documents have their own distinct extraction interface types.
The code snippet below shows how to use the ContentExtractorType.Document content extractor. If the document is encrypted with a password and the SDK supports decrypting the document type, then a dialog prompting for the valid password is displayed in this example:
It is that simple. The returned '_docContent' object (DocumentContent class object) in the above code contains the extracted text, languages present in the extracted text, document attributes, metadata, embedded objects and attachments. All retrieved in one method call. The class diagram of the returned DocumentContent class looks like this:
The example C# sample projects distributed with the SDK show how to use all the content extractor types in addition to an example showing how to use DocumentIdentifier class to identify files in parallel. There are also several examples showing how to use the .NET PlatformWorker class to process a set of documents as a task, process a large archive as task, and process a mail store (.pst,.ost,.mbox) as a task.
See the 'API Reference' for detailed descriptions of all .NET SDK API classes.
The SDK can decrypt common office formats such as Microsoft Office 97-2003, Microsoft Office 2007-2016, Open Document Formats (OpenOffice and Libre Office), ZIP, 7Z, RAR, and PDF by cycling through a user supplied list of known passwords.
The SDK identifies many encrypted document formats. Knowing if a document is encrypted is useful for many reasons, such as:
- To get document passwords from key people leaving a company
- IT Security: verify that employees are encrypting their documents and following security guidelines
- To identify why content extraction failed on a particular document, e.g., if no valid passwords given to extract from an encrypted document or archive.
The Open Discover™ SDK is ideal technology for processing unstructured content for business applications such as:
- Corporate information governance
- Full-text search using SDK with open source Lucene.NET
- Text analytics/document concept clustering
- Enterprise search and content management
- Big Data, machine learning, AI, etc.
- Website crawling/full-text website search
PlatformWorker .NET class is an Open Discover SDK built highly parallel document processing engine. PlatformWorker is designed to recursively extract content from documents, large archives (.7z,.zip,.rar, etc), or large mail stores (.pst,.ost,.mbox) as a task. Ideally, large document processing jobs would be broken into smaller document subsets, say 1GB-5GB worth of documents in a set, and have distributed PlatformWorkers process the document subsets. Many PlatformWorker instances could be distributed across desktop PCs or VMs as a remote procedure called (RPC) console application, or Windows Service hosted, to allow high throughput processing of 100's of gigabytes to terabytes of document data.
PlatformWorker allows customers of Open Discover SDK to more easily build their own processing platforms for the cloud or on site.
The following code snippet is taken from an PlatformWorker C# console application example that is distributed with SDK, it shows how easy it is to configure a PlatformWorker for a task: