Skip to content

FineType

FineType classifies strings into a rich taxonomy of 151 semantic types — each type is a transformation contract that guarantees a DuckDB cast expression will succeed.

$ finetype infer "192.168.1.1"
technology.internet.ip_v4
$ finetype infer "2024-01-15T10:30:00Z"
datetime.timestamp.iso_8601
$ finetype infer "[email protected]"
identity.person.email

Why FineType?

Go beyond primitive types. Detect the real meaning of your data.

151 Semantic Types

Classify across 6 domains: datetime, technology, identity, representation, geography, and container types.

Transformation Contracts

Every prediction maps to a DuckDB SQL expression guaranteed to parse successfully. Types you can act on.

Pure Rust, No Python

Built with Candle ML. 600+ classifications/sec, 8.5 MB memory, 66ms cold start. Ship without a runtime.

FineType recognizes 151 types across 6 domains:

DomainTypesExamples
datetime46ISO 8601, RFC 2822, Unix timestamps, timezones
technology34IPv4, IPv6, MAC addresses, URLs, UUIDs, hashes
identity25Names, emails, phone numbers, passwords
representation19Integers, floats, booleans, hex colors, base64, JSON
geography16Latitude, longitude, countries, cities, postal codes
container11JSON objects, CSV rows, query strings, key-value pairs

Label format: {domain}.{category}.{type} — e.g., technology.internet.ip_v4. Locale-specific types append a suffix: identity.person.phone_number.EN_AU.

FineType provides 9 commands covering the full ML pipeline:

Terminal window
# Classify a single value
finetype infer -i "bc89:60a9:23b8:c1e9:3924:56de:3eb1:3b90"
# Classify from file (one value per line), JSON output
finetype infer -f data.txt --output json
# Column-mode inference (distribution-based disambiguation)
finetype infer -f column_values.txt --mode column
# Profile a CSV file — detect column types
finetype profile -f data.csv
# Generate synthetic training data
finetype generate --samples 1000 --output training.ndjson
# Train a CharCNN model
finetype train --data data/train.ndjson --epochs 10 --batch-size 64
# Evaluate model accuracy
finetype eval --data data/test.ndjson --model models/char-cnn-v2
# Evaluate on GitTables benchmark
finetype eval-gittables --dir eval/gittables
# Validate data quality against taxonomy schemas
finetype validate -f data.ndjson --strategy quarantine

Single-value classification can be ambiguous: is 01/02/2024 a US date (Jan 2) or EU date (Feb 1)? Is 1995 a year, postal code, or plain number?

Column-mode analyzes the distribution of values in a column and applies disambiguation rules:

  • Date format — US vs EU slash dates, short vs long dates
  • Year detection — 4-digit integers predominantly in 1900-2100 range
  • Coordinate resolution — latitude vs longitude based on value ranges
  • Numeric types — ports, increments, postal codes, street numbers
Terminal window
# CLI column-mode
finetype infer -f column_values.txt --mode column
# CSV profiling (uses column-mode automatically)
finetype profile -f data.csv
-- Install and load
INSTALL finetype FROM community;
LOAD finetype;
-- Classify a single value
SELECT finetype('192.168.1.1');
-- → 'technology.internet.ip_v4'
-- Detailed output (type, confidence, DuckDB broad type)
SELECT finetype_detail(value) FROM my_table;
-- → '{"type":"datetime.date.us_slash","confidence":0.98,"broad_type":"DATE"}'
-- Normalize values for safe TRY_CAST
SELECT finetype_cast(value) FROM my_table;
-- Recursively classify JSON fields
SELECT finetype_unpack(json_col) FROM my_table;
-- Check extension version
SELECT finetype_version();

The extension embeds model weights at compile time — no external files needed.

ModelAccuracyTest Samples
Flat CharCNN v291.97%15,100

Post-processing rules improve Macro F1 from 87.9% to 90.8% (+2.9 points without retraining).

Evaluated against 2,363 annotated columns from 883 real-world CSV tables:

Type CategoryAccuracyExample Types
Timestamps100%datetime.timestamp.*
Country names100%geography.location.country
URLs89.7%technology.internet.url
Dates88.2%datetime.date.*
Person names80-85%identity.person.*

Column-mode inference improves accuracy for ambiguous types: geography +9.7%, datetime +4.8%.

MetricValue
Model load66 ms cold, 25-30 ms warm
Single inferencep50 = 26 ms, p95 = 41 ms
Batch throughput600-750 values/sec
Memory footprint8.5 MB peak RSS
Terminal window
brew install noon-org/tap/finetype
Terminal window
cargo install finetype-cli
Terminal window
git clone https://github.com/noon-org/finetype
cd finetype
cargo build --release