Garza, Robles & Cantu Law
NLP-Powered Document Triage & Case Classification for a Personal Injury Firm
-73%
Intake-to-Assessment
time saved
94.2%
Classification Accuracy
automated
-81%
Critical Detail Miss Rate
reduction
01The Challenge
Garza, Robles & Cantu is a personal injury law firm in McAllen, TX, with 4 attorneys and 6 paralegals handling 200+ active cases at any given time. Their practice spans auto accidents (65% of caseload), slip-and-fall, workplace injuries, and medical malpractice. The Rio Grande Valley's high traffic volume on US-83 and I-2 means a steady flow of new cases — 8-12 per week.
The bottleneck was document processing. Every new case generates a stack of documents: police reports, medical records, insurance correspondence, witness statements, and billing records. Paralegals were spending 6-8 hours per case just reading, categorizing, and summarizing before an attorney could make an initial assessment. Document triage consumed nearly 60% of paralegal capacity.
The second problem was consistency. Different paralegals flagged different things. Critical details — a pre-existing condition buried on page 47, a liability-shifting phrase in a police report, or a gap in treatment that insurance adjusters exploit — were sometimes missed entirely. The firm also needed bilingual capability: approximately 40% of their clients are Spanish-speaking.
Data Landscape
02Our Approach
We built a five-stage NLP pipeline: OCR ingestion, document classification, entity extraction, passage ranking, and structured summary generation. The system handles bilingual documents natively and integrates with the firm's existing case management workflow.
- Fine-tuned BERT — multi-class document classifier trained on 2,400 labeled documents — 94.2% accuracy across 7 document types
- spaCy + Custom NER — domain-specific named entity recognition for injury types, ICD-10 codes, policy numbers, and liability language
- AWS Textract — OCR with layout preservation for scanned police reports and medical records
- Cross-encoder Reranker — passage-level relevance scoring to surface the 10-15 most case-critical sentences from full document stacks
- Bilingual Pipeline — fastText language detection routing Spanish documents through a parallel spaCy transformer model
PDF Ingestion
Textract OCR + direct parse
Classification
BERT, 7 doc types
Entity Extraction
spaCy custom NER
Passage Ranking
Cross-encoder scoring
Case Summary
Structured 1-2 page brief
03Key Findings
Document Classification Confusion Matrix
Classification performance across 7 document types. High diagonal values indicate strong accuracy. The main confusion corridor is between insurance correspondence and legal filings — they share overlapping legal language and formatting.
Processing Time: Manual vs. Automated
Hours to complete document triage by case type. Complex cases (catastrophic injury, med-mal) show the largest absolute savings — from 11-14 hours down to 2.5-3 hours. Even simple auto accident cases see a 4x speedup.
Entity Extraction Coverage Over Time
Percentage of key case entities (dates, injury codes, providers, amounts) successfully extracted. Step improvements at Q2, Q3, and Q4 correspond to quarterly model retraining on paralegal-corrected outputs.
04Business Impact
Projected Annual Value
~35 paralegal hours freed per week, reallocated to client communication and case strategy
The immediate impact was speed: what used to take a paralegal an entire morning now takes under two hours, with the system handling the reading, classifying, and flagging automatically. The structured case summary became the attorneys' preferred starting point for initial assessments — several noted that the "red flags" section (pre-existing conditions, treatment gaps, inconsistent statements) caught details they would have spent hours finding manually.
The bilingual pipeline was a quiet win. Spanish-language police reports and medical records from facilities across the border are now processed with the same accuracy as English documents, with entity extraction working cross-lingually. The firm estimates this alone saved 4-5 hours per week that was previously spent on manual translation and cross-referencing.
Interactive Demo
Try Our AI-Powered Intake System
See how the pipeline handles matter classification, risk assessment, and attorney assignment — in real time.
Launch Live Demo05Technical Details
Document Classifier (BERT)
- Base model: bert-base-uncased, fine-tuned on 2,400 labeled documents
- Classes: police report, medical record, billing, insurance corr., legal filing, witness stmt., other
- Evaluation: 94.2% accuracy, macro-F1 = 0.91 (stratified 5-fold CV)
Named Entity Recognition (spaCy)
- Custom entities: DATE_OF_INCIDENT, INJURY_TYPE (ICD-10), PROVIDER, CARRIER, POLICY_NUM, AMOUNT
- Training: 800 annotated documents, prodigy-assisted labeling
- Bilingual: English (en_core_web_trf) + Spanish (es_dep_news_trf) models
Infrastructure (AWS)
- OCR: AWS Textract with layout detection for multi-column medical records
- Hosting: SageMaker endpoint for BERT classifier, Lambda for orchestration
- Integration: Clio case management API for automatic case file attachment
Facing similar challenges?
Let's discuss how data science can drive results for your business.

