Open Discover® SDK for .NET
Open Discover SDK is a .NET application programming interface (API) that allows for:
- Identifying file formats using internal binary signatures for reliable and fast file format identification (versus using unreliable file extensions)
- Extracting text from supported file formats and optionally identifying languages present in the extracted text (DOC, XLS, PPT, DOCX, XLSX, PPTX, ONENOTE, MSG, EML, EMLX, DXL, and many more)
- Extracting metadata from supported file formats (over 1,325 known metadata fields in total)
- Extracting embedded items/attachments from supported document formats
- Extracting archive container items (7ZIP, ZIP, RAR, TAR, etc)
- Extracting mail store container email objects (PST, OST, OST2013, Outlook for Mac OLM, MBOX, etc
- Automatically detecting and extracting sensitive personally identifying information (PII) like social security numbers, credit card numbers, bank account/routing numbers, IBAN accounts, investment accounts, maiden names, phone numbers, addresses, IP addresses, crytocurrency addresses, email addresses, and more
- Detecting and extracting entities related to medical, health care, and insurance records (and more)
Open Discover SDK API is purposed for users to develop higher level document processing applications for:
- Full text search using Lucene.NET
- Machine learning using extracted text and metadata
- Text analytics and document concept clustering
- Information governance
- Website crawling/full-text website search
- Enterprise search and content management
- IT Departments - identify, metadata scan, and de-duplicate documents on file servers
- eDiscovery applications
- And more...
For examples, check out the Open Discover SDK repository on GitHub (coming soon):
Open Discover SDK GitHub Example Repository