FineType
Semantic type detection for text data. 209 types, DuckDB integration, pure Rust.
FineType classifies text into 209 semantic types — dates, emails, IP addresses, coordinates, financial identifiers, and more.
Why it matters
You download a dataset. DuckDB reads it instantly, but every text column is VARCHAR. Is that column of numbers a postal code, a year, or a price? Are those dates US or European format?
FineType answers these questions:
$ finetype profile -f orders.csv
Column Type Confidence
──────────── ───────────────────────────── ──────────
order_date datetime.date.us_slash 0.97
amount representation.numeric.decimal 0.98
customer identity.person.full_name 0.93
country geography.location.country 0.95
ip_address technology.internet.ip_v4 0.99Every type maps to a DuckDB SQL expression. FineType says order_date is datetime.date.us_slash — that means strptime(order_date, '%m/%d/%Y') will succeed on every matching value. Profile first, then cast with confidence.
Installation
Homebrew (macOS)
brew install noon-org/tap/finetypeCargo
cargo install finetype-cliFrom Source
git clone https://github.com/noon-org/finetype
cd finetype
cargo build --releaseCLI Usage
# Classify a single value
finetype infer -i "bc89:60a9:23b8:c1e9:3924:56de:3eb1:3b90"
# Classify from file (one value per line), JSON output
finetype infer -f data.txt --output json
# Column-mode inference (distribution-based disambiguation)
finetype infer -f column_values.txt --mode column
# Profile a CSV file — detect column types
finetype profile -f data.csv
# Profile with data quality validation
finetype profile -f data.csv --validate
# Generate a DuckDB CREATE TABLE from a CSV
finetype schema-for -f data.csv
# Export a JSON Schema for a type
finetype schema datetime.timestamp.iso_8601
# Validate data quality against taxonomy schemas
finetype validate -f data.ndjson --strategy quarantine
# Generate synthetic training data
finetype generate --samples 1000 --output training.ndjson
# Validate generator ↔ taxonomy alignment
finetype check
# Show taxonomy (filter by domain, category, priority)
finetype taxonomy --domain datetimeColumn-Mode Inference
Single-value classification can be ambiguous: is 01/02/2024 a US date (Jan 2) or EU date (Feb 1)? Is 1995 a year, postal code, or plain number?
Column-mode analyses the distribution of values in a column and applies disambiguation rules:
- Date format — US vs EU slash dates, short vs long dates
- Year detection — 4-digit integers predominantly in 1900–2100 range
- Coordinate resolution — latitude vs longitude based on value ranges
- Numeric types — ports, increments, postal codes, street numbers
# CLI column-mode
finetype infer -f column_values.txt --mode column
# CSV profiling (uses column-mode automatically)
finetype profile -f data.csvSchema Export
Once you know your column types, FineType can generate a DuckDB CREATE TABLE statement — every column typed, every cast guaranteed to succeed:
$ finetype schema-for -f orders.csv
CREATE TABLE orders (
order_date DATE, -- strptime(order_date, '%m/%d/%Y')
amount DECIMAL(10,2), -- decimal
customer VARCHAR, -- full_name
country VARCHAR, -- country
ip_address VARCHAR -- ip_v4 (INET)
);Output formats include plain SQL, JSON (for programmatic use), and Arrow schema JSON.
Data Quality
Profile with --validate to get quality grades for each column. FineType checks every value against the type's schema and reports what doesn't match:
$ finetype profile -f data.csv --validate
Column Type Confidence Quality Invalid
──────────── ───────────────────── ────────── ─────── ───────
email identity.person.email 0.99 A 0/1000
order_date datetime.date.us_slash 0.97 B 12/1000
amount representation.numeric 0.94 A 2/1000For deeper validation, finetype validate supports quarantine, null-replacement, forward-fill, and backward-fill strategies for handling invalid values.
DuckDB Extension
-- Install and load
INSTALL finetype FROM community;
LOAD finetype;
-- Classify a single value
SELECT finetype('192.168.1.1');
-- → 'technology.internet.ip_v4'
-- Detailed output (type, confidence, DuckDB broad type)
SELECT finetype_detail(value) FROM my_table;
-- → '{"type":"datetime.date.us_slash","confidence":0.98,"broad_type":"DATE"}'
-- Normalize values for safe TRY_CAST
SELECT finetype_cast(value) FROM my_table;
-- Recursively classify JSON fields
SELECT finetype_unpack(json_col) FROM my_table;
-- Check extension version
SELECT finetype_version();The extension embeds model weights at compile time — no external files needed.
Taxonomy
FineType recognises 209 types across 7 domains:
| Domain | Types | Examples |
|---|---|---|
datetime | 84 | ISO 8601, RFC 2822, Unix timestamps, timezones, 27+ date format locales |
representation | 32 | Integers, floats, booleans, hex colours, base64, JSON |
finance | 28 | IBAN, SWIFT/BIC, currency amounts, credit card numbers, tax identifiers |
identity | 19 | Names, emails, phone numbers (46+ locales), passwords |
technology | 19 | IPv4, IPv6, MAC addresses, URLs, UUIDs, DOIs, hashes, user agents |
geography | 15 | Latitude, longitude, countries, cities, postal codes (65+ locales) |
container | 12 | JSON objects, CSV rows, query strings, key-value pairs |
Label format: {domain}.{category}.{type} — e.g., technology.internet.ip_v4. Locale-specific types append a suffix: identity.person.phone_number.EN_AU.
Performance
Accuracy
Evaluated on 21 real-world datasets (116 annotated columns):
| Metric | Result |
|---|---|
| Label accuracy | 97.4% |
| Domain accuracy | 98.3% |
| Actionability | 99.7% (DuckDB casts succeed on real data) |
The inference pipeline uses a two-stage architecture — Sense (broad classification) followed by Sharpen (fine-grained disambiguation) — with column-mode distribution analysis for ambiguous types like dates and coordinates.
Latency & Throughput
| Metric | Value |
|---|---|
| Model load | 66 ms cold, 25–30 ms warm |
| Single inference | p50 = 26 ms, p95 = 41 ms |
| Batch throughput | 600–750 values/sec |
| Memory footprint | 8.5 MB peak RSS |