🤖 ML Project

NLP Document Classification & Entity Extraction API

NLP microservice classifying PDFs and extracting key entities.

View Code

Project Overview

Client

Internal Project

Industry

NLP / Document Processing

Timeline

2 months (2024)

My Role

ML Engineer

Built an NLP microservice that automatically classifies documents and extracts structured data from unstructured PDFs.

The Challenge

Manual document processing was a bottleneck:

✗

Hours spent manually categorizing documents

✗

Inconsistent classification across team members

✗

Key data buried in unstructured text

✗

No searchable document database

✗

Compliance risks from misfiled documents

They needed automated document understanding.

My Solution

I developed an intelligent document processing system:

Text Extraction

OCR and PDF parsing for text extraction from any document.

Classification Model

Fine-tuned transformer model for document type classification.

Entity Extraction

spaCy NER for extracting names, dates, amounts, and custom entities.

REST API

FastAPI endpoint for document upload and processing.

Key Features

📄

PDF Processing

Extract text from scanned and digital PDFs.

🏷️

Classification

Auto-categorize into 20+ document types.

🔍

Entity Extraction

Extract dates, names, amounts, and more.

🔗

API Ready

RESTful API for easy integration.

Tech Stack

NLP

spaCyTransformersNLTK

API

FastAPIPython

Document

PyMuPDFTesseract OCR

Hugging FacePyTorch

Screenshots

Entity Extraction

API Documentation

Classification Results

Results & Impact

95%

Accuracy

Classification

20+

Document

Types

2sec

Processing

Per Doc

10K+

Documents

Processed

Key Achievements

✓

95% classification accuracy across 20+ document types

✓

Processing time under 2 seconds per document

✓

Extracted 50+ entity types with high precision

✓

Processed 10,000+ documents in production

Interested in Something Similar?

I help businesses build robust backend systems, membership platforms, and automation tools.

Let's Talk

More Projects

Genomic Mutation Analysis with Hybrid ML Models

Hybrid machine learning model for understanding human genomic dynamics and mutation patterns with comprehensive EDA and predictive modeling.

View Project

Customer Churn Prediction ML Pipeline

Scikit-learn pipeline predicting member churn with 89% accuracy.

View Project

Personalized Health Plan Recommendation Engine

Collaborative filtering engine for personalized plan suggestions.

View Project

View All Projects