All Case Studies
LegalNLPClassification9 min read

Garza, Robles & Cantu Law

NLP-Powered Document Triage & Case Classification for a Personal Injury Firm

PythonBERTspaCyAWS TextractSageMakerClio API

-73%

Intake-to-Assessment

time saved

94.2%

Classification Accuracy

automated

-81%

Critical Detail Miss Rate

reduction

01The Challenge

Garza, Robles & Cantu is a personal injury law firm in McAllen, TX, with 4 attorneys and 6 paralegals handling 200+ active cases at any given time. Their practice spans auto accidents (65% of caseload), slip-and-fall, workplace injuries, and medical malpractice. The Rio Grande Valley's high traffic volume on US-83 and I-2 means a steady flow of new cases — 8-12 per week.

The bottleneck was document processing. Every new case generates a stack of documents: police reports, medical records, insurance correspondence, witness statements, and billing records. Paralegals were spending 6-8 hours per case just reading, categorizing, and summarizing before an attorney could make an initial assessment. Document triage consumed nearly 60% of paralegal capacity.

The second problem was consistency. Different paralegals flagged different things. Critical details — a pre-existing condition buried on page 47, a liability-shifting phrase in a police report, or a gap in treatment that insurance adjusters exploit — were sometimes missed entirely. The firm also needed bilingual capability: approximately 40% of their clients are Spanish-speaking.

Data Landscape

The data landscape: 2,400 historically labeled documents across 7 types, 800 annotated for custom NER training, bilingual document flow (English + Spanish), and integration requirements with the firm's Clio case management system.

02Our Approach

We built a five-stage NLP pipeline: OCR ingestion, document classification, entity extraction, passage ranking, and structured summary generation. The system handles bilingual documents natively and integrates with the firm's existing case management workflow.

  • Fine-tuned BERT multi-class document classifier trained on 2,400 labeled documents — 94.2% accuracy across 7 document types
  • spaCy + Custom NER domain-specific named entity recognition for injury types, ICD-10 codes, policy numbers, and liability language
  • AWS Textract OCR with layout preservation for scanned police reports and medical records
  • Cross-encoder Reranker passage-level relevance scoring to surface the 10-15 most case-critical sentences from full document stacks
  • Bilingual Pipeline fastText language detection routing Spanish documents through a parallel spaCy transformer model

PDF Ingestion

Textract OCR + direct parse

Classification

BERT, 7 doc types

Entity Extraction

spaCy custom NER

Passage Ranking

Cross-encoder scoring

Case Summary

Structured 1-2 page brief

03Key Findings

Document Classification Confusion Matrix

Police Report
Medical Record
Billing Invoice
Insurance Corr.
Legal Filing
Witness Stmt.
Correspondence
Police Report
96
1
0
1
1
1
0
Medical Record
1
93
3
1
0
1
1
Billing Invoice
0
2
95
1
0
0
2
Insurance Corr.
1
1
1
86
7
2
2
Legal Filing
1
0
0
6
88
2
3
Witness Stmt.
2
1
0
1
2
92
2
Correspondence
0
1
2
3
2
1
91
Low
High

Classification performance across 7 document types. High diagonal values indicate strong accuracy. The main confusion corridor is between insurance correspondence and legal filings — they share overlapping legal language and formatting.

Processing Time: Manual vs. Automated

Hours to complete document triage by case type. Complex cases (catastrophic injury, med-mal) show the largest absolute savings — from 11-14 hours down to 2.5-3 hours. Even simple auto accident cases see a 4x speedup.

Entity Extraction Coverage Over Time

Percentage of key case entities (dates, injury codes, providers, amounts) successfully extracted. Step improvements at Q2, Q3, and Q4 correspond to quarterly model retraining on paralegal-corrected outputs.

04Business Impact

Intake-to-Assessment
6-8 hours1.5-2 hours
-73%
Classification Accuracy
Manual review94.2%
Automated
Critical Detail Miss Rate
BaselineReduced
-81%
Paralegal Time Freed
60% on triageReallocated
~35 hrs/wk

Projected Annual Value

~35 paralegal hours freed per week, reallocated to client communication and case strategy

The immediate impact was speed: what used to take a paralegal an entire morning now takes under two hours, with the system handling the reading, classifying, and flagging automatically. The structured case summary became the attorneys' preferred starting point for initial assessments — several noted that the "red flags" section (pre-existing conditions, treatment gaps, inconsistent statements) caught details they would have spent hours finding manually.

The bilingual pipeline was a quiet win. Spanish-language police reports and medical records from facilities across the border are now processed with the same accuracy as English documents, with entity extraction working cross-lingually. The firm estimates this alone saved 4-5 hours per week that was previously spent on manual translation and cross-referencing.

Interactive Demo

Try Our AI-Powered Intake System

See how the pipeline handles matter classification, risk assessment, and attorney assignment — in real time.

Launch Live Demo

05Technical Details

Document Classifier (BERT)

  • Base model: bert-base-uncased, fine-tuned on 2,400 labeled documents
  • Classes: police report, medical record, billing, insurance corr., legal filing, witness stmt., other
  • Evaluation: 94.2% accuracy, macro-F1 = 0.91 (stratified 5-fold CV)

Named Entity Recognition (spaCy)

  • Custom entities: DATE_OF_INCIDENT, INJURY_TYPE (ICD-10), PROVIDER, CARRIER, POLICY_NUM, AMOUNT
  • Training: 800 annotated documents, prodigy-assisted labeling
  • Bilingual: English (en_core_web_trf) + Spanish (es_dep_news_trf) models

Infrastructure (AWS)

  • OCR: AWS Textract with layout detection for multi-column medical records
  • Hosting: SageMaker endpoint for BERT classifier, Lambda for orchestration
  • Integration: Clio case management API for automatic case file attachment

Facing similar challenges?

Let's discuss how data science can drive results for your business.