Domain-Specific Language Models

How Aare uses purpose-built transformer models to extract compliance-critical entities from unstructured text with high accuracy and minimal footprint.

What is a DSLM?

A Domain-Specific Language Model (DSLM) is a language model pre-trained or fine-tuned on domain-specific corpora. Unlike general-purpose LLMs trained on broad internet text, DSLMs learn the vocabulary, terminology, and linguistic patterns unique to a particular field: healthcare, finance, legal, insurance, and so on.

This domain focus enables DSLMs to achieve higher accuracy on specialized tasks than general models, while requiring far fewer parameters and computational resources.

Key distinction: DSLMs are defined by what they're trained on (domain-specific data), not by what task they perform. A DSLM can be further fine-tuned for NER, classification, or other downstream tasks while retaining its domain knowledge.

Why DSLMs for Compliance?

Compliance verification requires extracting structured data from unstructured text. Before you can verify that a loan decision follows fair lending rules, you need to extract the loan amount, credit score, DTI ratio, and decision from the LLM's natural language output.

Traditional approaches have significant limitations:

Regex / Pattern Matching

  • Brittle: breaks on format variations
  • Can't handle OCR errors or typos
  • Misses semantic equivalents ("SSN" vs "Social Security Number")
  • Exponential maintenance burden
  • No context awareness

Domain-Specific Language Models

  • Robust to format variations
  • Handles OCR errors, typos, abbreviations
  • Understands semantic equivalents
  • Single model covers domain
  • Context-aware extraction

The Extraction Challenge

Consider extracting PHI (Protected Health Information) from clinical text. A regex approach requires patterns for every possible format:

Regex approach (brittle)
# Just SSN alone requires dozens of patterns
ssn_patterns = [
    r"\d{3}-\d{2}-\d{4}",           # 123-45-6789
    r"\d{3}\s\d{2}\s\d{4}",         # 123 45 6789
    r"\d{9}",                        # 123456789
    r"SSN[:\s]+\d{3}-\d{2}-\d{4}",  # SSN: 123-45-6789
    r"Social Security[:\s]+.*",     # Social Security Number...
    # OCR errors: l23-45-6789, 12E-45-6789...
    # And this is just ONE of 18 HIPAA categories
]

A DSLM trained on healthcare text learns these patterns implicitly. It recognizes that "SS#", "SSN", "Social Security", and even OCR-corrupted variants all refer to the same entity type.

How Aare DSLMs Work

Aare DSLMs serve as the extraction layer in our verification pipeline. They convert unstructured text into typed entities that can be formally verified by Z3.

Unstructured Text DSLM Extractor Typed Entities Z3 Verifier Proof / Violation

Named Entity Recognition (NER)

Our DSLMs are fine-tuned for token classification using BIO tagging (Beginning, Inside, Outside). Each token in the input is classified as either the start of an entity (B-TYPE), a continuation of an entity (I-TYPE), or not an entity (O).

BIO tagging example
Input:  "Patient John Smith has SSN 123-45-6789"

Tokens: Patient  John   Smith  has  SSN  123-45-6789
Labels: O        B-NAME I-NAME O    O    B-SSN

Extracted entities:
  - NAME: "John Smith"
  - SSN: "123-45-6789"

Subword Tokenization

Modern transformers use subword tokenization, which splits words into smaller pieces. This is critical for handling domain-specific terminology and out-of-vocabulary terms. Our training pipeline properly aligns labels across subword boundaries.

Subword handling
Input:  "MRN: A12345678"

Subwords: MR  ##N  :  A  ##12  ##34  ##56  ##78
Labels:   O   O    O  B  I     I     I     I

# The model learns that all subwords of "A12345678"
# belong to the same entity

DSLM vs General LLM for Extraction

You could use GPT-4 or Claude for entity extraction. Why use a specialized DSLM?

Factor General LLM Aare DSLM
Model Size 100B+ parameters 67M parameters
Inference Latency 1-5 seconds (API call) <50ms (on-device)
Data Privacy Sent to cloud provider Never leaves device
Output Format Variable, requires parsing Structured, deterministic
Offline Operation Requires internet Fully offline
Cost per Inference $0.01-0.10+ ~$0 (compute only)
Domain Accuracy Good (general knowledge) Excellent (specialized)

The privacy advantage: For HIPAA, PCI-DSS, and other privacy-sensitive domains, the ability to run extraction entirely on-device without transmitting sensitive data is often a compliance requirement, not just a preference.

Available Models

HIPAA PHI Detector

Our flagship DSLM detects all 18 HIPAA Safe Harbor PHI categories in clinical and healthcare text. It handles voice transcriptions, OCR output, and free-form medical notes.

NAME ADDRESS DATE PHONE FAX EMAIL SSN MRN HEALTH_PLAN ACCOUNT LICENSE VEHICLE_ID DEVICE_ID URL IP BIOMETRIC OTHER_ID
Specification Value
Base Architecture DistilBERT (6-layer transformer)
Parameters 67M
CoreML Package Size ~127 MB
Inference Time (iPhone 14) <50ms
Max Sequence Length 512 tokens
Entity Categories 18 (35 BIO labels)
Output Format CoreML (.mlpackage)

Coming Soon

Deployment Options

Aare DSLMs are optimized for edge deployment across multiple platforms:

CoreML

Native iOS/macOS deployment with hardware acceleration on Apple Neural Engine.

ONNX

Cross-platform format for Android, Windows, Linux, and cloud deployment.

TensorFlow Lite

Optimized for mobile and embedded devices with quantization support.

Custom Training

Need entity extraction for a specialized domain? We train custom DSLMs on your data with your entity schema.

The Process

  1. Schema Definition: Define your entity types and their semantic boundaries
  2. Data Collection: Provide representative text samples from your domain
  3. Annotation: We work with you to label training data (or use your existing annotations)
  4. Training: Fine-tune a base model on your domain-specific corpus
  5. Validation: Evaluate on held-out test data, iterate until accuracy targets are met
  6. Deployment: Export to your target format (CoreML, ONNX, TFLite)

Data privacy: Custom training can be performed on-premises or in your cloud environment. Your training data never needs to leave your infrastructure.

Get Started with DSLMs

Explore our pre-built models or discuss custom training for your domain.

Contact Us View Available Models