Domain-Specific Language Models

How Aare uses purpose-built transformer models to extract compliance-critical entities from unstructured text with high accuracy and minimal footprint.

What is a DSLM?

A Domain-Specific Language Model (DSLM) is a language model pre-trained or fine-tuned on domain-specific corpora. Unlike general-purpose LLMs trained on broad internet text, DSLMs learn the vocabulary, terminology, and linguistic patterns unique to a particular field: healthcare, finance, legal, insurance, and so on.

This domain focus enables DSLMs to achieve higher accuracy on specialized tasks than general models, while requiring far fewer parameters and computational resources.

Key distinction: DSLMs are defined by what they're trained on (domain-specific data), not by what task they perform. A DSLM can be further fine-tuned for NER, classification, or other downstream tasks while retaining its domain knowledge.

Why DSLMs for Compliance?

Compliance verification requires extracting structured data from unstructured text. Before you can verify that a loan decision follows fair lending rules, you need to extract the loan amount, credit score, DTI ratio, and decision from the LLM's natural language output.

Traditional approaches have significant limitations:

Regex / Pattern Matching

Brittle: breaks on format variations
Can't handle OCR errors or typos
Misses semantic equivalents ("SSN" vs "Social Security Number")
Exponential maintenance burden
No context awareness

Domain-Specific Language Models

Robust to format variations
Handles OCR errors, typos, abbreviations
Understands semantic equivalents
Single model covers domain
Context-aware extraction

The Extraction Challenge

Consider extracting PHI (Protected Health Information) from clinical text. A regex approach requires patterns for every possible format:

Regex approach (brittle)

# Just SSN alone requires dozens of patterns
ssn_patterns = [
    r"\d{3}-\d{2}-\d{4}",           # 123-45-6789
    r"\d{3}\s\d{2}\s\d{4}",         # 123 45 6789
    r"\d{9}",                        # 123456789
    r"SSN[:\s]+\d{3}-\d{2}-\d{4}",  # SSN: 123-45-6789
    r"Social Security[:\s]+.*",     # Social Security Number...
    # OCR errors: l23-45-6789, 12E-45-6789...
    # And this is just ONE of 18 HIPAA categories
]

A DSLM trained on healthcare text learns these patterns implicitly. It recognizes that "SS#", "SSN", "Social Security", and even OCR-corrupted variants all refer to the same entity type.

How Aare DSLMs Work

Aare DSLMs serve as the extraction layer in our verification pipeline. They convert unstructured text into typed entities that can be formally verified by Z3.

Unstructured Text → DSLM Extractor → Typed Entities → Z3 Verifier → Proof / Violation

Named Entity Recognition (NER)

Our DSLMs are fine-tuned for token classification using BIO tagging (Beginning, Inside, Outside). Each token in the input is classified as either the start of an entity (B-TYPE), a continuation of an entity (I-TYPE), or not an entity (O).

BIO tagging example

Input:  "Patient John Smith has SSN 123-45-6789"

Tokens: Patient  John   Smith  has  SSN  123-45-6789
Labels: O        B-NAME I-NAME O    O    B-SSN

Extracted entities:
  - NAME: "John Smith"
  - SSN: "123-45-6789"

Subword Tokenization

Modern transformers use subword tokenization, which splits words into smaller pieces. This is critical for handling domain-specific terminology and out-of-vocabulary terms. Our training pipeline properly aligns labels across subword boundaries.

Subword handling

Input:  "MRN: A12345678"

Subwords: MR  ##N  :  A  ##12  ##34  ##56  ##78
Labels:   O   O    O  B  I     I     I     I

# The model learns that all subwords of "A12345678"
# belong to the same entity

DSLM vs General LLM for Extraction

You could use GPT-4 or Claude for entity extraction. Why use a specialized DSLM?

Factor	General LLM	Aare DSLM
Model Size	100B+ parameters	67M parameters
Inference Latency	1-5 seconds (API call)	<50ms (on-device)
Data Privacy	Sent to cloud provider	Never leaves device
Output Format	Variable, requires parsing	Structured, deterministic
Offline Operation	Requires internet	Fully offline
Cost per Inference	$0.01-0.10+	~$0 (compute only)
Domain Accuracy	Good (general knowledge)	Excellent (specialized)

The privacy advantage: For HIPAA, PCI-DSS, and other privacy-sensitive domains, the ability to run extraction entirely on-device without transmitting sensitive data is often a compliance requirement, not just a preference.

Available Models

HIPAA PHI Detector

Our flagship DSLM detects all 18 HIPAA Safe Harbor PHI categories in clinical and healthcare text. It handles voice transcriptions, OCR output, and free-form medical notes.

NAME ADDRESS DATE PHONE FAX EMAIL SSN MRN HEALTH_PLAN ACCOUNT LICENSE VEHICLE_ID DEVICE_ID URL IP BIOMETRIC OTHER_ID

Specification	Value
Base Architecture	DistilBERT (6-layer transformer)
Parameters	67M
CoreML Package Size	~127 MB
Inference Time (iPhone 14)	<50ms
Max Sequence Length	512 tokens
Entity Categories	18 (35 BIO labels)
Output Format	CoreML (.mlpackage)

Coming Soon

Fair Lending Extractor: Loan parameters, applicant data, decision factors for ECOA compliance
General PII Detector: Broad PII detection for GDPR, CCPA, and privacy compliance
Custom Domain Models: We train DSLMs on your data for your specific entity schema

Deployment Options

Aare DSLMs are optimized for edge deployment across multiple platforms:

CoreML

Native iOS/macOS deployment with hardware acceleration on Apple Neural Engine.

ONNX

Cross-platform format for Android, Windows, Linux, and cloud deployment.

TensorFlow Lite

Optimized for mobile and embedded devices with quantization support.

Custom Training

Need entity extraction for a specialized domain? We train custom DSLMs on your data with your entity schema.

The Process

Schema Definition: Define your entity types and their semantic boundaries
Data Collection: Provide representative text samples from your domain
Annotation: We work with you to label training data (or use your existing annotations)
Training: Fine-tune a base model on your domain-specific corpus
Validation: Evaluate on held-out test data, iterate until accuracy targets are met
Deployment: Export to your target format (CoreML, ONNX, TFLite)

Data privacy: Custom training can be performed on-premises or in your cloud environment. Your training data never needs to leave your infrastructure.

Get Started with DSLMs

Explore our pre-built models or discuss custom training for your domain.