SQL Server, from the ground up Lesson 4 / 40

Data types: the difference between VARCHAR(50) and a lawsuit

VARCHAR vs NVARCHAR, DATETIME vs DATETIME2, DECIMAL vs FLOAT, and the surprisingly expensive consequences of being sloppy about data types in SQL Server.

The first time I saw a production table where every single column was NVARCHAR(MAX), I thought it must be a test database somebody forgot to clean up. It wasn’t. It was the customer master. Someone years ago had decided that “MAX means biggest, and we want to be safe,” and the table had grown to four terabytes while returning one customer at a time on every query.

Data types are not boring. They are the single biggest thing you can get right on day one that will save your future self from weeping at 3am. They decide how much disk space your data takes, how fast queries run, how SQL Server estimates cardinality, whether your indexes work, whether you can sort correctly, whether your backups fit on the nightly tape. They decide whether a query plan picks an index seek or a table scan. They decide whether a JOIN uses a hash or a merge. They decide whether your boss fires you.

This lesson covers the types you’ll use 95% of the time, what to pick, what to avoid, and why. Print it. Tape it to your monitor. Read it again in a week.

The three categories

SQL Server has about 30 data types. They fall into three categories:

  1. Numeric — whole numbers, decimals, money.
  2. Character / string — single-byte strings, Unicode strings, short fixed-length strings, giant blobs.
  3. Date and time — dates, times, datetimes, timezone-aware datetimes.

Plus a handful of specials (BIT, UNIQUEIDENTIFIER, VARBINARY, XML, JSON in 2025+, spatial types, sql_variant) that you’ll use less often.

Let’s go through each category with opinions.

Numeric types

Integers: TINYINT, SMALLINT, INT, BIGINT

Four integer types, in size order:

TypeBytesRange
TINYINT10 to 255 (no negatives!)
SMALLINT2-32,768 to +32,767
INT4~-2.1 billion to +2.1 billion
BIGINT8~-9.2 quintillion to +9.2 quintillion

The mental model: use the smallest one that’ll never run out.

TINYINT is great for things like status codes, month numbers, or a small enumeration. If the values fit in 0 to 255, use it.

SMALLINT for year numbers, small counts, things that won’t exceed 30k.

INT is the workhorse. Auto-incrementing primary keys, counters, quantities, most everything. Two-billion-row range is more than enough for 95% of cases.

BIGINT for when you will exceed two billion rows or need big counts. Event tables, log tables, high-volume OLTP systems. Here’s the rule: any table where you expect more than 500 million rows over its lifetime should have a BIGINT primary key from day one. Converting an INT PK to BIGINT later is possible but painful, and if you’re referenced by 20 foreign keys you’ll change all of them.

Decimals: DECIMAL(p, s) / NUMERIC(p, s)

Same thing, different name. Use DECIMAL. p is precision (total digits), s is scale (digits after the decimal point). DECIMAL(18, 2) can store numbers like 1234567890123456.78.

Use DECIMAL for all money, all percentages, and anything you need to sum or compare exactly. It’s precise. No rounding errors. Slower than integers but worth it.

Floats: FLOAT and REAL

Approximate floating-point. FLOAT(53) is 8-byte double precision. REAL is 4-byte single precision. Fast, but rounding errors are real. SELECT 0.1 + 0.2 returns 0.3; SELECT CAST(0.1 AS FLOAT) + CAST(0.2 AS FLOAT) returns 0.30000000000000004.

Use FLOAT only for scientific values where precision doesn’t have to be exact: sensor readings, statistical measures, estimated latitudes. Never use FLOAT for money. I have seen a financial reconciliation job produce a €0.0000001 mismatch every month because of exactly this. The audit finding took three months to close.

Money: MONEY and SMALLMONEY

Don’t. MONEY looks like it was designed for currency but it has weird precision behaviour, bad type-conversion rules, and the save in storage versus DECIMAL(19, 4) is negligible. Industry consensus for fifteen years has been use DECIMAL(19, 4) for currency, not MONEY.

Booleans: BIT

SQL Server doesn’t have a proper BOOLEAN type. It has BIT, which stores 0, 1, or NULL. Up to 8 BIT columns on a row share a single byte, so they’re efficient.

Watch out: BIT columns aren’t usable in all the same places as integers. You can’t WHERE IsActive = TRUE — there is no TRUE keyword in SQL. Use WHERE IsActive = 1.

String types

This is where most people make the most mistakes. Read it carefully.

Fixed vs variable: CHAR vs VARCHAR, NCHAR vs NVARCHAR

  • CHAR(n) / NCHAR(n) — fixed length. Always n characters, padded with spaces. Wastes space if most values are shorter.
  • VARCHAR(n) / NVARCHAR(n) — variable length. Stores exactly the string you give it, plus 2 bytes of length metadata.

Almost always use VARCHAR or NVARCHAR. Fixed-length CHAR is only appropriate when the column truly is fixed (country code CHAR(2), currency code CHAR(3)), because padding hurts and variable-length is more flexible.

Single-byte vs Unicode: VARCHAR vs NVARCHAR

  • VARCHAR — 1 byte per character (for ASCII). Uses whatever collation the column has, which determines its character set and sort order. Can store accented characters in Western European collations at 1 byte per char using the extended 0-255 range. Can’t store Chinese, Arabic, emoji, or most non-Latin scripts.
  • NVARCHAR — 2 bytes per character (UTF-16). Can store any Unicode character. Twice the storage for ASCII-only data, but doesn’t care about collation drama.

In 2026, my rule is: default to NVARCHAR unless you know you don’t need Unicode. Yes it doubles the storage for ASCII data. Disk is cheap; bug reports from Japanese customers whose names your database corrupts are not. The one exception: internal codes you control (status codes, ISO country codes, SKUs) where you guarantee only ASCII. Those are fine as VARCHAR.

SQL Server 2019 added a UTF-8 collation that lets VARCHAR columns store UTF-8. If you’re on 2019+ and happy pinning a specific collation, VARCHAR with a UTF-8 collation gives you the best of both worlds: variable-width encoding that’s still single-byte for ASCII. That’s a pro move; if it confuses you, just use NVARCHAR and move on.

How big should n be?

A very common mistake: VARCHAR(50) for names, VARCHAR(255) for emails, VARCHAR(max) “just to be safe.”

Rule: size for real-world usage, not for extreme edge cases.

  • Names: NVARCHAR(100) is plenty. Almost no real name is longer than 100 characters.
  • Emails: NVARCHAR(254) — that’s the RFC limit.
  • Phone numbers: NVARCHAR(30) — leaves room for country codes and formatting.
  • URLs: NVARCHAR(2048) — browsers typically cap at 2000-ish.
  • Free-form notes: NVARCHAR(4000) or NVARCHAR(MAX) if you genuinely need big.

Why not just use NVARCHAR(MAX) everywhere? Because MAX is different under the hood. Values up to 8000 bytes are stored in-row. Larger values get stored in separate LOB pages, with a pointer in the row. The engine treats MAX columns more cautiously: they can’t be part of an index key, have restrictions in certain T-SQL features, and on-the-fly sorting and memory grants get weird. Default to sized types. Use MAX when you genuinely need unbounded.

TEXT, NTEXT, IMAGE

Don’t. These are deprecated, have been deprecated since SQL Server 2005, and will eventually be removed. If you see them in an old database, plan a migration to VARCHAR(MAX), NVARCHAR(MAX), or VARBINARY(MAX). I still occasionally find them in the wild, always in code that dates back to the Bush administration.

Date and time types

DATETIME — the old one

Older type. 8 bytes. Precision of 3.33 milliseconds (rounds to .000, .003, .007). Range from 1753 to 9999. This is what older databases use.

Don’t use it for new code. The precision is weird, the range is huge, and it’s replaced by better types.

DATETIME2(n) — the modern one

6 to 8 bytes. Precision of up to 100 nanoseconds. Range from 0001 to 9999.

DATETIME2(0) — no fractional seconds. Second-precision. 6 bytes. Great for most timestamps. DATETIME2(3) — millisecond precision. 7 bytes. Use this if you need to match sub-second events. DATETIME2(7) — full 100ns precision. 8 bytes. Use for logs and event data.

Default to DATETIME2(0) or DATETIME2(3). More precise than DATETIME, smaller, and the range is sensible.

DATE and TIME

Two more for when you want just the date or just the time.

  • DATE — 3 bytes. Range 0001 to 9999. Use for birthdays, event dates, anything where the time of day is irrelevant.
  • TIME(n) — 3 to 5 bytes. Use rarely; most of the time you want a full datetime.

DATETIMEOFFSET(n) — timezone-aware

10 bytes. Stores the datetime plus an offset from UTC.

This is the right choice if you need to preserve the timezone information of the original event. Storing “the user clicked at 14:30 local time in Milan, which is 13:30 UTC” is legitimately different from storing “13:30 UTC with no idea what they were thinking.” We’ll have a whole lesson on the timezone swamp (lesson 15). For now: if you genuinely need timezone awareness, use DATETIMEOFFSET. If you don’t, use DATETIME2 and store everything in UTC.

SMALLDATETIME — don’t

4 bytes, minute precision, range 1900 to 2079. Obscure, saves a few bytes, sees you in 2079. Skip.

The special types worth knowing

UNIQUEIDENTIFIER

16 bytes. Stores a GUID. Generate with NEWID() (random) or NEWSEQUENTIALID() (monotonic).

Useful for distributed systems, primary keys that need to be generated outside the database, and integration with systems that use GUIDs. Bad choice for a clustered index (see lesson 21). Use it when you genuinely need a globally unique ID; don’t use it because it feels modern.

VARBINARY(n) / VARBINARY(MAX)

For binary data — files, images, encrypted blobs. MAX for anything large. In most apps you should probably be storing big files in blob storage (S3, Azure Blob) and the URL in SQL Server, but VARBINARY(MAX) exists for when you really need in-database binary.

BIT

Booleans. Covered above.

XML and JSON

XML is a full-featured XML type with querying support. Rarely the right call in 2026 unless you’re already XML-heavy.

JSON was, for many years, “just store it in NVARCHAR(MAX) and SQL Server has functions to query it.” SQL Server 2025 introduces a real JSON data type with binary storage. If you’re on an older version, NVARCHAR(MAX) + JSON_VALUE() / JSON_QUERY() / OPENJSON() is the idiom.

Implicit conversions: the silent killer

Every data type choice has a knock-on effect: comparisons and joins do better when types match.

If you have CustomerId INT and you write:

SELECT * FROM Customer WHERE CustomerId = '42';

SQL Server does an implicit conversion on the column side. The string '42' gets converted to INT first (actually the engine converts every CustomerId to VARCHAR and then compares, by the built-in type-precedence rules). The result: your index on CustomerId can no longer be seeked. You just demoted a seek into a scan, and you won’t see any error, just a slower query.

Here’s the rule Microsoft calls “SARGability” (Search ARGument-able): a predicate is SARGable when SQL Server can use it to seek an index. Functions, implicit conversions, and arithmetic on the indexed column all break SARGability. Good:

WHERE CreatedAt >= '2025-01-01' AND CreatedAt < '2026-01-01'

Bad:

WHERE YEAR(CreatedAt) = 2025

Bad:

WHERE CAST(CreatedAt AS DATE) = '2025-12-25'

Good:

WHERE CreatedAt >= '2025-12-25' AND CreatedAt < '2025-12-26'

Match your types. Don’t put functions on the indexed column. Your indexes will thank you.

Collation: the plot twist

Every string column has a collation — a set of rules that determine how characters compare and sort. SQL_Latin1_General_CP1_CI_AS is a common default (case-insensitive, accent-sensitive Western European). Latin1_General_100_CS_AS_SC_UTF8 is a Unicode case-sensitive UTF-8 collation introduced in SQL Server 2019.

Collations affect:

  • Whether 'foo' = 'FOO' is true (case sensitivity)
  • Whether 'cafe' = 'café' is true (accent sensitivity)
  • Sorting order (alphabetic? culture-specific?)

Mixing collations in joins causes collation conflicts that produce the fun error: Cannot resolve the collation conflict between "X" and "Y" in the equal to operation. You fix it by explicitly casting one side: col1 COLLATE Latin1_General_CI_AS = col2.

Lesson: pick one collation for your database, stick with it across all string columns, and don’t import data with a different collation unless you’re ready for the fun.

Run this on your own machine

A little demo of why data types matter. Copy-paste, run, read the times.

USE tempdb;
GO

-- Two tables, one well-typed, one sloppy
CREATE TABLE dbo.WellTyped (
    CustomerId INT           NOT NULL,
    Name       NVARCHAR(100) NOT NULL,
    CreatedAt  DATETIME2(0)  NOT NULL
);

CREATE TABLE dbo.Sloppy (
    CustomerId NVARCHAR(50)  NOT NULL,  -- numeric stored as string
    Name       NVARCHAR(MAX) NOT NULL,  -- unbounded for no reason
    CreatedAt  DATETIME       NOT NULL  -- old type
);

-- Insert 1M rows into each
WITH Nums AS (
    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY a.object_id) AS n
    FROM sys.all_objects a CROSS JOIN sys.all_objects b
)
INSERT INTO dbo.WellTyped (CustomerId, Name, CreatedAt)
SELECT n,
       CONCAT(N'Customer ', n),
       DATEADD(SECOND, n, '2020-01-01')
FROM Nums;

WITH Nums AS (
    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY a.object_id) AS n
    FROM sys.all_objects a CROSS JOIN sys.all_objects b
)
INSERT INTO dbo.Sloppy (CustomerId, Name, CreatedAt)
SELECT CAST(n AS NVARCHAR(50)),
       CONCAT(N'Customer ', n),
       DATEADD(SECOND, n, '2020-01-01')
FROM Nums;

-- Compare sizes
SELECT OBJECT_NAME(ps.object_id) AS table_name,
       SUM(ps.reserved_page_count) * 8 / 1024 AS reserved_mb
FROM sys.dm_db_partition_stats ps
WHERE ps.object_id IN (OBJECT_ID('dbo.WellTyped'), OBJECT_ID('dbo.Sloppy'))
GROUP BY ps.object_id;

-- Query on the CustomerId column — mismatched type
SET STATISTICS IO, TIME ON;

SELECT Name FROM dbo.WellTyped WHERE CustomerId = 12345;
SELECT Name FROM dbo.Sloppy    WHERE CustomerId = '12345';  -- matching string to string

SET STATISTICS IO, TIME OFF;

-- Cleanup
DROP TABLE dbo.WellTyped;
DROP TABLE dbo.Sloppy;

Run it. Note the reserved space difference, the query times, the IO statistics. The sloppy table will be meaningfully bigger. In a real system with 500 million rows, the difference is the difference between fitting on your backup window and not.

Next lesson: CREATE, ALTER, DROP — making and breaking tables, with practical patterns and the “oh no I dropped the wrong thing” drill.

Search