LegalNLPClassification9 min read

Garza, Robles & Cantu Law

NLP-Powered Document Triage & Case Classification for a Personal Injury Firm

PythonBERTspaCyAWS TextractSageMakerClio API

-73%

Intake-to-Assessment

time saved

94.2%

Classification Accuracy

automated

-81%

Critical Detail Miss Rate

reduction

01The Challenge

Garza, Robles & Cantu is a personal injury law firm in McAllen, TX, with 4 attorneys and 6 paralegals handling 200+ active cases at any given time. Their practice spans auto accidents (65% of caseload), slip-and-fall, workplace injuries, and medical malpractice. The Rio Grande Valley's high traffic volume on US-83 and I-2 means a steady flow of new cases — 8-12 per week.

The bottleneck was document processing. Every new case generates a stack of documents: police reports, medical records, insurance correspondence, witness statements, and billing records. Paralegals were spending 6-8 hours per case just reading, categorizing, and summarizing before an attorney could make an initial assessment. Document triage consumed nearly 60% of paralegal capacity.

The second problem was consistency. Different paralegals flagged different things. Critical details — a pre-existing condition buried on page 47, a liability-shifting phrase in a police report, or a gap in treatment that insurance adjusters exploit — were sometimes missed entirely. The firm also needed bilingual capability: approximately 40% of their clients are Spanish-speaking.

Data Landscape

The data landscape: 2,400 historically labeled documents across 7 types, 800 annotated for custom NER training, bilingual document flow (English + Spanish), and integration requirements with the firm's Clio case management system.

02Our Approach

We built a five-stage NLP pipeline: OCR ingestion, document classification, entity extraction, passage ranking, and structured summary generation. The system handles bilingual documents natively and integrates with the firm's existing case management workflow.

Fine-tuned BERT — multi-class document classifier trained on 2,400 labeled documents — 94.2% accuracy across 7 document types
spaCy + Custom NER — domain-specific named entity recognition for injury types, ICD-10 codes, policy numbers, and liability language
AWS Textract — OCR with layout preservation for scanned police reports and medical records
Cross-encoder Reranker — passage-level relevance scoring to surface the 10-15 most case-critical sentences from full document stacks
Bilingual Pipeline — fastText language detection routing Spanish documents through a parallel spaCy transformer model

PDF Ingestion

Textract OCR + direct parse

Classification

BERT, 7 doc types

Entity Extraction

spaCy custom NER

Passage Ranking

Cross-encoder scoring

Case Summary

Structured 1-2 page brief

03Key Findings

Document Classification Confusion Matrix

Police Report

Medical Record

Billing Invoice

Insurance Corr.

Legal Filing

Witness Stmt.

Correspondence

Police Report

Medical Record

Billing Invoice

Insurance Corr.

Legal Filing

Witness Stmt.

Correspondence

Low

High

Classification performance across 7 document types. High diagonal values indicate strong accuracy. The main confusion corridor is between insurance correspondence and legal filings — they share overlapping legal language and formatting.

Processing Time: Manual vs. Automated

Hours to complete document triage by case type. Complex cases (catastrophic injury, med-mal) show the largest absolute savings — from 11-14 hours down to 2.5-3 hours. Even simple auto accident cases see a 4x speedup.

Entity Extraction Coverage Over Time

Percentage of key case entities (dates, injury codes, providers, amounts) successfully extracted. Step improvements at Q2, Q3, and Q4 correspond to quarterly model retraining on paralegal-corrected outputs.

04Business Impact

Intake-to-Assessment

6-8 hours1.5-2 hours

-73%

Classification Accuracy

Manual review94.2%

Automated

Critical Detail Miss Rate

BaselineReduced

-81%

Paralegal Time Freed

60% on triageReallocated

~35 hrs/wk

Projected Annual Value

~35 paralegal hours freed per week, reallocated to client communication and case strategy

The immediate impact was speed: what used to take a paralegal an entire morning now takes under two hours, with the system handling the reading, classifying, and flagging automatically. The structured case summary became the attorneys' preferred starting point for initial assessments — several noted that the "red flags" section (pre-existing conditions, treatment gaps, inconsistent statements) caught details they would have spent hours finding manually.

The bilingual pipeline was a quiet win. Spanish-language police reports and medical records from facilities across the border are now processed with the same accuracy as English documents, with entity extraction working cross-lingually. The firm estimates this alone saved 4-5 hours per week that was previously spent on manual translation and cross-referencing.

Interactive Demo

Try Our AI-Powered Intake System

See how the pipeline handles matter classification, risk assessment, and attorney assignment — in real time.

Launch Live Demo

05Technical Details

Document Classifier (BERT)

Base model: bert-base-uncased, fine-tuned on 2,400 labeled documents
Classes: police report, medical record, billing, insurance corr., legal filing, witness stmt., other
Evaluation: 94.2% accuracy, macro-F1 = 0.91 (stratified 5-fold CV)

Named Entity Recognition (spaCy)

Custom entities: DATE_OF_INCIDENT, INJURY_TYPE (ICD-10), PROVIDER, CARRIER, POLICY_NUM, AMOUNT
Training: 800 annotated documents, prodigy-assisted labeling
Bilingual: English (en_core_web_trf) + Spanish (es_dep_news_trf) models

Infrastructure (AWS)

OCR: AWS Textract with layout detection for multi-column medical records
Hosting: SageMaker endpoint for BERT classifier, Lambda for orchestration
Integration: Clio case management API for automatic case file attachment

Facing similar challenges?

Let's discuss how data science can drive results for your business.

Discuss a Similar Project

All Case Studies