Open Discover® SDK for .NET - A document file format identification and content extraction API

Open Discover® SDK for .NET

Open Discover SDK is a .NET application programming interface (API) that allows for:

  • Identifying file formats using internal binary signatures for reliable and fast file format identification (versus using unreliable file extensions)
  • Extracting text from supported file formats and optionally identifying languages present in the extracted text (DOC, XLS, PPT, DOCX, XLSX, PPTX, ONENOTE, MSG, EML, EMLX, DXL, and many more)
  • Extracting metadata from supported file formats (over 1,325 known metadata fields in total)
  • Extracting embedded items/attachments from supported document formats
  • Extracting archive container items (7ZIP, ZIP, RAR, TAR, etc)
  • Extracting mail store container email objects (PST, OST, OST2013, Outlook for Mac OLM, MBOX, etc
  • Automatically detecting and extracting sensitive personally identifying information (PII) like social security numbers, credit card numbers, bank account/routing numbers, IBAN accounts, investment accounts, maiden names, phone numbers, addresses, IP addresses, crytocurrency addresses, email addresses, and more
  • Detecting and extracting entities related to medical, health care, and insurance records (and more)

Open Discover SDK API is purposed for users to develop higher level document processing applications for:

  • Full text search using Lucene.NET
  • Machine learning using extracted text and metadata
  • Text analytics and document concept clustering
  • Information governance
  • Website crawling/full-text website search
  • Enterprise search and content management
  • IT Departments - identify, metadata scan, and de-duplicate documents on file servers
  • eDiscovery applications
  • And more...

For examples, check out the Open Discover SDK repository on GitHub (coming soon):

Open Discover SDK GitHub Example Repository