How Aare uses purpose-built transformer models to extract compliance-critical entities from unstructured text with high accuracy and minimal footprint.
A Domain-Specific Language Model (DSLM) is a language model pre-trained or fine-tuned on domain-specific corpora. Unlike general-purpose LLMs trained on broad internet text, DSLMs learn the vocabulary, terminology, and linguistic patterns unique to a particular field: healthcare, finance, legal, insurance, and so on.
This domain focus enables DSLMs to achieve higher accuracy on specialized tasks than general models, while requiring far fewer parameters and computational resources.
Key distinction: DSLMs are defined by what they're trained on (domain-specific data), not by what task they perform. A DSLM can be further fine-tuned for NER, classification, or other downstream tasks while retaining its domain knowledge.
Compliance verification requires extracting structured data from unstructured text. Before you can verify that a loan decision follows fair lending rules, you need to extract the loan amount, credit score, DTI ratio, and decision from the LLM's natural language output.
Traditional approaches have significant limitations:
Consider extracting PHI (Protected Health Information) from clinical text. A regex approach requires patterns for every possible format:
# Just SSN alone requires dozens of patterns
ssn_patterns = [
r"\d{3}-\d{2}-\d{4}", # 123-45-6789
r"\d{3}\s\d{2}\s\d{4}", # 123 45 6789
r"\d{9}", # 123456789
r"SSN[:\s]+\d{3}-\d{2}-\d{4}", # SSN: 123-45-6789
r"Social Security[:\s]+.*", # Social Security Number...
# OCR errors: l23-45-6789, 12E-45-6789...
# And this is just ONE of 18 HIPAA categories
]
A DSLM trained on healthcare text learns these patterns implicitly. It recognizes that "SS#", "SSN", "Social Security", and even OCR-corrupted variants all refer to the same entity type.
Aare DSLMs serve as the extraction layer in our verification pipeline. They convert unstructured text into typed entities that can be formally verified by Z3.
Our DSLMs are fine-tuned for token classification using BIO tagging (Beginning, Inside, Outside). Each token in the input is classified as either the start of an entity (B-TYPE), a continuation of an entity (I-TYPE), or not an entity (O).
Input: "Patient John Smith has SSN 123-45-6789"
Tokens: Patient John Smith has SSN 123-45-6789
Labels: O B-NAME I-NAME O O B-SSN
Extracted entities:
- NAME: "John Smith"
- SSN: "123-45-6789"
Modern transformers use subword tokenization, which splits words into smaller pieces. This is critical for handling domain-specific terminology and out-of-vocabulary terms. Our training pipeline properly aligns labels across subword boundaries.
Input: "MRN: A12345678"
Subwords: MR ##N : A ##12 ##34 ##56 ##78
Labels: O O O B I I I I
# The model learns that all subwords of "A12345678"
# belong to the same entity
You could use GPT-4 or Claude for entity extraction. Why use a specialized DSLM?
| Factor | General LLM | Aare DSLM |
|---|---|---|
| Model Size | 100B+ parameters | 67M parameters |
| Inference Latency | 1-5 seconds (API call) | <50ms (on-device) |
| Data Privacy | Sent to cloud provider | Never leaves device |
| Output Format | Variable, requires parsing | Structured, deterministic |
| Offline Operation | Requires internet | Fully offline |
| Cost per Inference | $0.01-0.10+ | ~$0 (compute only) |
| Domain Accuracy | Good (general knowledge) | Excellent (specialized) |
The privacy advantage: For HIPAA, PCI-DSS, and other privacy-sensitive domains, the ability to run extraction entirely on-device without transmitting sensitive data is often a compliance requirement, not just a preference.
Our flagship DSLM detects all 18 HIPAA Safe Harbor PHI categories in clinical and healthcare text. It handles voice transcriptions, OCR output, and free-form medical notes.
| Specification | Value |
|---|---|
| Base Architecture | DistilBERT (6-layer transformer) |
| Parameters | 67M |
| CoreML Package Size | ~127 MB |
| Inference Time (iPhone 14) | <50ms |
| Max Sequence Length | 512 tokens |
| Entity Categories | 18 (35 BIO labels) |
| Output Format | CoreML (.mlpackage) |
Aare DSLMs are optimized for edge deployment across multiple platforms:
Native iOS/macOS deployment with hardware acceleration on Apple Neural Engine.
Cross-platform format for Android, Windows, Linux, and cloud deployment.
Optimized for mobile and embedded devices with quantization support.
Need entity extraction for a specialized domain? We train custom DSLMs on your data with your entity schema.
Data privacy: Custom training can be performed on-premises or in your cloud environment. Your training data never needs to leave your infrastructure.
Explore our pre-built models or discuss custom training for your domain.
Contact Us View Available Models