Open Source tool for Text Extraction

AI startup Sorcero Inc. is releasing a free open source software ingestion framework it calls Ingestum.

It supports sourcing and transformation of a wide variety of data and document types into a uniform document format.

“Data and Analytics executives tell us that unstructured documents are full of data they need but can’t access. We want organizations to benefit from AI and ingestion is a significant barrier. We think open-sourcing Ingestum will democratise ingestion,” said Dipanwita Das, Sorcero CEO & Co-founder.

Ingestion of arbitrary and unstructured content formats - PDF files, Microsoft Office documents, email threads, and so on - presents a challenge in the AI industry.

Sorcero claims the ingestion market is extremely fragmented with many niche players, and most AI firms handle the ingestion of unstructured text in-house. The Ingestum framework aims to provide a methodical, reusable, extensible, and scalable framework for ingesting content, free and open to all.

Written in Python and built around reusable, programmable pipelines, Ingestum - from the Latin word to ingest or toss in - is largely agnostic of both source and output formats; it is designed to be extended through the use of plugins, and it can be deployed as a command-line tool or as a web service. Ingestum integrates existing FOSS projects such as PDFMiner, Google’s Tesseract-OCR Engine, and Mozilla's Deep Speech speech-to-text engine.

“Ingestum leverages many existing open source projects, so no one has to reinvent the wheel; it can easily integrate existing workflows, or incorporate existing software as plugins,” said Walter Bender, CTO and Co-founder of Sorcero

Sorcero—recently featured at the LOINC and InsurTech NY conferences - invites IT directors, software engineers, and AI researchers to download and use Ingestum today (git clone https://gitlab.com/sorcero/community/ingestum.git).

 

Business Solution: