Unlocking insights from PDFs using purpose built annotation tool

Ravi Shankar
4 min readDec 7, 2020


Today, many enterprises extract data from scanned documents, such as PDF’s, tables and forms, through manual data entry (that is slow, expensive and prone to errors), or through simple OCR software that requires manual configuration which needs to be updated each time the form changes to be usable. To overcome these manual processes, Deep learning based approaches have been developed to instantly read and process any type of document, accurately extracting printed text, handwriting, forms, tables and other data without the need for any manual effort or custom code.

While there are many purpose-built third party softwares available, cloud providers have democratized the OCR capabilities. The popular cloud services include Amazon Textract, Google Vision or Microsoft Azure’s OCR Service. Many enterprises have adopted these services to unlock data out of PDFs or Image documents. So, we recommend customers to not waste cycles and valuable data science effort on building OCR systems.

When your organization processes a variety of documents, you sometimes need to extract entities from unstructured text in the documents. A contract document, for example, can have paragraphs of text where names and other contract terms are listed in the paragraph of text instead of as a key/value or form structure. Amazon Comprehend is a natural language processing (NLP) service that can extract key phrases, places, names, organizations, events, sentiment from unstructured text, and more. With custom entity recognition, you can identify new entity types not supported as one of the preset generic entity types. This allows you to extract business-specific entities to address your needs.

The custom entity recognition models require high quality labeled data for training. Performing annotations on blobs of text makes it very hard for understanding the context of a document. Let’s say in the document below, we need to extract key skills and experience. You can see that annotating it in original PDF appears lot easier and accurate than OCRed blob of text.

Original PDF Document

OCRed Bag of Text

Experience Assistant Branch ManagerOctober 2008 to Current University Federal Credit Union — Salt Lake City, UT Manage operations and production at 2 branches, including high volume university campus branch. Hire, train, develop, motivate, coach and discipline branch personnel. Achieve and maintain high level of member service through sales and team meetings. Run daily branch activities and ensure compliance with established credit union policies and procedures.Mentor employees in effective sales methods and techniques. Determine feasibility of loans by analyzing financial status, credit and property evaluation of applicants. Provide expert financial advice on mortgage, educational and personal loans Deliver informational sales presentations to potential business members and build new relationships Perform marketing by participating in and sponsoring community events. Quality Coach January 2004 to September 2008 Discover Financial Services — Salt Lake City, UT Leveraged expertise in resolving issues to satisfy customers for Retention Department…..

So, how do I go about achieving better PDF annotation. Before diving deep, bit about labeling tools. While there are many labeling tools in the market, Amazon SageMaker GroundTruth offers lots of flexibility to create custom annotation UI and unlike other tools, it is entirely pay as you go means we can do a lot of experimentation without lock in.

We demonstrate how you can use Objectways developed PDF annotation tool label PDF documents for Named Entity Recognition(NER) labeling. The annotation tool provides labeling entities, relationships among entities, overlapping entities, document classification along with a custom notes field all in a single annotation UI. The tool is really easy to configure and compatible with SageMaker GroundTruth. It also supports multi-page annotation. The input to the annotation tool is searchable PDF which can be easily created using a freely available utility on GitHub. The utility uses Amazon Textract to OCR and then creates a searchable PDF as the output.

Here are simple steps to get started:

  1. Use Searchable PDF tool to create searchable PDFs
  2. Contact Objectways(@sales@objectways.com) to set up a data labeling job in SageMaker Ground Truth
  3. Our Expert annotators will label your data(We will do multiple quality passes to ensure high quality labels)
  4. Output labels are saved to your S3 bucket

See the tool in action below.

Objectways PDF NLP Annotation Tool

Contact sales@objectways.com for more information

About Objectways:

We are a social impact data labeling services and machine learning company with a proven track record to help our customers provide high quality labels. We have built and operated many complex workflows to solve customer problems for Computer vision and NLP. Please contact us at sales@objectways.com for more info



Ravi Shankar

Co-founder of objectways.com, passionate about Machine Learning and Data labeling

Recommended from Medium