Noonnoon
Projects

FineType

Semantic type detection for text data. 209 types, DuckDB integration, pure Rust.

FineType classifies text into 209 semantic types — dates, emails, IP addresses, coordinates, financial identifiers, and more.

Why it matters

You download a dataset. DuckDB reads it instantly, but every text column is VARCHAR. Is that column of numbers a postal code, a year, or a price? Are those dates US or European format?

FineType answers these questions:

$ finetype profile -f orders.csv

Column        Type                           Confidence
────────────  ─────────────────────────────  ──────────
order_date    datetime.date.us_slash          0.97
amount        representation.numeric.decimal  0.98
customer      identity.person.full_name       0.93
country       geography.location.country      0.95
ip_address    technology.internet.ip_v4       0.99

Every type maps to a DuckDB SQL expression. FineType says order_date is datetime.date.us_slash — that means strptime(order_date, '%m/%d/%Y') will succeed on every matching value. Profile first, then cast with confidence.

Installation

Homebrew (macOS)

brew install noon-org/tap/finetype

Cargo

cargo install finetype-cli

From Source

git clone https://github.com/noon-org/finetype
cd finetype
cargo build --release

CLI Usage

# Classify a single value
finetype infer -i "bc89:60a9:23b8:c1e9:3924:56de:3eb1:3b90"

# Classify from file (one value per line), JSON output
finetype infer -f data.txt --output json

# Column-mode inference (distribution-based disambiguation)
finetype infer -f column_values.txt --mode column

# Profile a CSV file — detect column types
finetype profile -f data.csv

# Profile with data quality validation
finetype profile -f data.csv --validate

# Generate a DuckDB CREATE TABLE from a CSV
finetype schema-for -f data.csv

# Export a JSON Schema for a type
finetype schema datetime.timestamp.iso_8601

# Validate data quality against taxonomy schemas
finetype validate -f data.ndjson --strategy quarantine

# Generate synthetic training data
finetype generate --samples 1000 --output training.ndjson

# Validate generator ↔ taxonomy alignment
finetype check

# Show taxonomy (filter by domain, category, priority)
finetype taxonomy --domain datetime

Column-Mode Inference

Single-value classification can be ambiguous: is 01/02/2024 a US date (Jan 2) or EU date (Feb 1)? Is 1995 a year, postal code, or plain number?

Column-mode analyses the distribution of values in a column and applies disambiguation rules:

  • Date format — US vs EU slash dates, short vs long dates
  • Year detection — 4-digit integers predominantly in 1900–2100 range
  • Coordinate resolution — latitude vs longitude based on value ranges
  • Numeric types — ports, increments, postal codes, street numbers
# CLI column-mode
finetype infer -f column_values.txt --mode column

# CSV profiling (uses column-mode automatically)
finetype profile -f data.csv

Schema Export

Once you know your column types, FineType can generate a DuckDB CREATE TABLE statement — every column typed, every cast guaranteed to succeed:

$ finetype schema-for -f orders.csv

CREATE TABLE orders (
    order_date    DATE,          -- strptime(order_date, '%m/%d/%Y')
    amount        DECIMAL(10,2), -- decimal
    customer      VARCHAR,       -- full_name
    country       VARCHAR,       -- country
    ip_address    VARCHAR        -- ip_v4 (INET)
);

Output formats include plain SQL, JSON (for programmatic use), and Arrow schema JSON.

Data Quality

Profile with --validate to get quality grades for each column. FineType checks every value against the type's schema and reports what doesn't match:

$ finetype profile -f data.csv --validate

Column        Type                   Confidence  Quality  Invalid
────────────  ─────────────────────  ──────────  ───────  ───────
email         identity.person.email  0.99        A        0/1000
order_date    datetime.date.us_slash 0.97        B        12/1000
amount        representation.numeric 0.94        A        2/1000

For deeper validation, finetype validate supports quarantine, null-replacement, forward-fill, and backward-fill strategies for handling invalid values.

DuckDB Extension

-- Install and load
INSTALL finetype FROM community;
LOAD finetype;

-- Classify a single value
SELECT finetype('192.168.1.1');
-- → 'technology.internet.ip_v4'

-- Detailed output (type, confidence, DuckDB broad type)
SELECT finetype_detail(value) FROM my_table;
-- → '{"type":"datetime.date.us_slash","confidence":0.98,"broad_type":"DATE"}'

-- Normalize values for safe TRY_CAST
SELECT finetype_cast(value) FROM my_table;

-- Recursively classify JSON fields
SELECT finetype_unpack(json_col) FROM my_table;

-- Check extension version
SELECT finetype_version();

The extension embeds model weights at compile time — no external files needed.

Taxonomy

FineType recognises 209 types across 7 domains:

DomainTypesExamples
datetime84ISO 8601, RFC 2822, Unix timestamps, timezones, 27+ date format locales
representation32Integers, floats, booleans, hex colours, base64, JSON
finance28IBAN, SWIFT/BIC, currency amounts, credit card numbers, tax identifiers
identity19Names, emails, phone numbers (46+ locales), passwords
technology19IPv4, IPv6, MAC addresses, URLs, UUIDs, DOIs, hashes, user agents
geography15Latitude, longitude, countries, cities, postal codes (65+ locales)
container12JSON objects, CSV rows, query strings, key-value pairs

Label format: {domain}.{category}.{type} — e.g., technology.internet.ip_v4. Locale-specific types append a suffix: identity.person.phone_number.EN_AU.

Performance

Accuracy

Evaluated on 21 real-world datasets (116 annotated columns):

MetricResult
Label accuracy97.4%
Domain accuracy98.3%
Actionability99.7% (DuckDB casts succeed on real data)

The inference pipeline uses a two-stage architecture — Sense (broad classification) followed by Sharpen (fine-grained disambiguation) — with column-mode distribution analysis for ambiguous types like dates and coordinates.

Latency & Throughput

MetricValue
Model load66 ms cold, 25–30 ms warm
Single inferencep50 = 26 ms, p95 = 41 ms
Batch throughput600–750 values/sec
Memory footprint8.5 MB peak RSS

On this page