Open Discover® SDK for .NET - A document file format identification and content extraction API

Open Discover® SDK for .NET

Open Discover SDK is a .NET application programming interface (API) that allows for:

  • Identifying file formats using internal binary signatures for reliable and fast file format identification (versus using unreliable file extensions)
  • Extracting text from supported file formats and optionally identifying languages present in the extracted text (DOC, XLS, PPT, DOCX, XLSX, PPTX, ONENOTE, MSG, EML, EMLX, DXL, and many more)
  • Extracting metadata from supported file formats (over 1,325 known metadata fields in total)
  • Extracting embedded items/attachments from supported document formats
  • Extracting archive container items (7ZIP, ZIP, RAR, TAR, etc)
  • Extracting mail store container email objects (PST, OST, OST2013, MBOX, etc
  • Automatically detecting and extracting sensitive personally identifying information (PII) like social security numbers, credit card numbers, bank account/routing numbers, IBAN accounts, investment accounts, maiden names, phone numbers, addresses, IP addresses, crytocurrency addresses, email addresses, and more

Open Discover SDK API is purposed for users to develop higher level document processing applications for:

  • Full text search using Lucene.NET
  • Machine learning using extracted text and metadata
  • Text analytics and document concept clustering
  • Information governance
  • Website crawling/full-text website search
  • Enterprise search and content management
  • IT Departments - identify, metadata scan, and de-duplicate documents on file servers
  • eDiscovery applications
  • And more...

For examples, check out the Open Discover SDK repository on GitHub (coming soon):

Open Discover SDK GitHub Example Repository