collections + dataclass: the small data structures you need

There’s a tax you pay in any language for not knowing the standard library well enough. In Python it shows up most often in two places: people writing twenty lines of dict-juggling that Counter or defaultdict would have handled in two, and people writing four-line __init__ methods that @dataclass would have generated for them.

This lesson is the small data structures. None of them are exciting. All of them are time you stop spending on boilerplate.

Counter: counting hashable things

from collections import Counter

words: list[str] = ["apple", "banana", "apple", "cherry", "banana", "apple"]
counts: Counter[str] = Counter(words)
# Counter({'apple': 3, 'banana': 2, 'cherry': 1})

counts.most_common(2)  # [('apple', 3), ('banana', 2)]
counts["apple"]        # 3
counts["mango"]        # 0  — never raises KeyError

Counter is a dict subclass that defaults missing keys to 0. You feed it any iterable of hashable items and you get back the frequency of each. The most_common(n) method is the killer feature — sorted, descending, top n. No sorted(items, key=lambda x: -x[1])[:n] ever again.

Real-world example: figuring out which IPs hit your error endpoint most:

from collections import Counter
from pathlib import Path

errors: Counter[str] = Counter()
for line in Path("access.log").read_text().splitlines():
    if " 500 " in line:
        ip: str = line.split()[0]
        errors[ip] += 1

print(errors.most_common(10))

Counters also compose with arithmetic operators:

a: Counter[str] = Counter("aabbc")
b: Counter[str] = Counter("abbcc")

a + b   # Counter({'b': 4, 'c': 3, 'a': 3})
a - b   # Counter({'a': 1})  — only positive counts kept
a & b   # Counter({'a': 1, 'b': 2, 'c': 1})  — min (intersection)
a | b   # Counter({'a': 2, 'b': 2, 'c': 2})  — max (union)

Multiset semantics. Useful for things like “how many of each token appears in both documents.”

defaultdict: auto-initializing keys

from collections import defaultdict

groups: defaultdict[str, list[int]] = defaultdict(list)

for user_id, tag in [(1, "a"), (2, "a"), (1, "b"), (3, "a")]:
    groups[tag].append(user_id)

# defaultdict(<class 'list'>, {'a': [1, 2, 3], 'b': [1]})

defaultdict(factory) creates the value by calling factory() the first time you access a missing key. With list, you get an empty list ready to .append() to. The “group these things by some key” pattern is so common that this is by far the most-used member of the collections module.

Other factories:

defaultdict(int) — auto-zero, like a primitive Counter
defaultdict(set) — auto-empty set for deduped grouping
defaultdict(dict) — nested dicts; key once, get a dict, stuff things into it
defaultdict(lambda: "unknown") — any callable that returns the default value

The alternative without defaultdict:

groups: dict[str, list[int]] = {}
for user_id, tag in pairs:
    if tag not in groups:
        groups[tag] = []
    groups[tag].append(user_id)

Or with setdefault:

groups.setdefault(tag, []).append(user_id)

Both work. defaultdict is cleaner once your loop body has more than one append.

AI assistance note. Modern AI coding assistants are reliably good at recognizing the “I’m grouping things” or “I’m counting things” patterns and suggesting defaultdict(list) or Counter once they see the loop. They’re worth the prompt. The opposite trap is more common: assistants tend to overproduce @dataclass for any structured data, including cases where a plain dict would have been simpler. If you find yourself with five dataclasses that each hold three fields and never get methods, consider whether a dict[str, Foo] would have been honest.

deque: the list you want when you’re appending and popping

from collections import deque

queue: deque[int] = deque([1, 2, 3])

queue.append(4)        # right side: [1, 2, 3, 4]
queue.appendleft(0)    # left side:  [0, 1, 2, 3, 4]
queue.pop()            # 4, removed from right
queue.popleft()        # 0, removed from left

A list has fast appends and pops at the right end and slow ones at the left, because removing from the front shifts every other element. A deque (pronounced “deck”) has O(1) operations on both ends.

Use it when:

You need a FIFO queue (append + popleft).
You need a LIFO stack — actually a list works fine for this.
You need a sliding window with a fixed maximum length:

recent: deque[float] = deque(maxlen=100)
for value in stream:
    recent.append(value)  # oldest auto-evicted when len > 100
    if len(recent) == 100:
        moving_avg: float = sum(recent) / 100

maxlen is the underrated feature. The deque silently drops the oldest entry when it would exceed the limit. Perfect for moving averages, rolling logs, “last N events.”

If you find yourself doing lst.pop(0) or lst.insert(0, x) on a long list, switch to a deque.

namedtuple: a tuple with field names

from collections import namedtuple

Point = namedtuple("Point", ["x", "y"])

p: Point = Point(3, 4)
p.x          # 3
p.y          # 4
p[0]         # 3 — also works as a tuple
x, y = p     # tuple unpacking still works

It’s a tuple — immutable, hashable, comparable, unpackable — but with named attribute access. namedtuple was the answer for years to “I want a small read-only struct.”

In modern Python (3.7+), the answer is usually @dataclass(frozen=True) instead, unless you specifically need the tuple-like behavior:

Unpacking by position
== comparison by tuple value
Working with code that expects a sequence (a database row, an unpacking-based API)

For everything else, dataclass is more flexible. There’s also typing.NamedTuple for the same idea with type hints baked into the class syntax:

from typing import NamedTuple

class Point(NamedTuple):
    x: float
    y: float

Same runtime behavior as collections.namedtuple, nicer to write.

OrderedDict: mostly redundant since 3.7

Before Python 3.7, dict didn’t guarantee insertion order. OrderedDict did. Since 3.7, regular dict preserves insertion order as a language guarantee. So OrderedDict is mostly historical.

It still has two unique features:

move_to_end(key) and move_to_end(key, last=False) for repositioning entries.
== comparison considers order. OrderedDict([("a", 1), ("b", 2)]) != OrderedDict([("b", 2), ("a", 1)]), but with regular dicts they’d be equal.

For 99% of code: just use dict. OrderedDict exists when you need those two features.

@dataclass: the modern struct

This is the big one. @dataclass was added in Python 3.7 (PEP 557) and replaces a huge swathe of boilerplate.

Before:

class Order:
    def __init__(self, id: int, customer: str, total: float, paid: bool = False) -> None:
        self.id = id
        self.customer = customer
        self.total = total
        self.paid = paid

    def __repr__(self) -> str:
        return f"Order(id={self.id!r}, customer={self.customer!r}, total={self.total!r}, paid={self.paid!r})"

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, Order):
            return NotImplemented
        return (self.id, self.customer, self.total, self.paid) == \
               (other.id, other.customer, other.total, other.paid)

After:

from dataclasses import dataclass

@dataclass
class Order:
    id: int
    customer: str
    total: float
    paid: bool = False

The decorator inspects the class annotations and generates __init__, __repr__, and __eq__ automatically. You write what’s actually unique about the class — the fields, their types, their defaults — and skip the rest.

o = Order(id=1, customer="Marco", total=49.99)
print(o)           # Order(id=1, customer='Marco', total=49.99, paid=False)
o.paid = True
o == Order(1, "Marco", 49.99, True)  # True

The variants worth knowing

frozen=True — immutable, hashable. Use for value objects, dictionary keys, anything that shouldn’t mutate.

@dataclass(frozen=True)
class Coord:
    x: float
    y: float

c = Coord(1.0, 2.0)
c.x = 5.0  # FrozenInstanceError
points: set[Coord] = {Coord(1, 2), Coord(3, 4)}  # works because frozen is hashable

slots=True (3.10+) — generates __slots__, saving memory and slightly speeding attribute access. Useful when you have millions of instances. The trade-off is no dynamic attribute assignment outside the declared fields.

@dataclass(slots=True)
class Tick:
    timestamp: float
    price: float
    volume: int

kw_only=True (3.10+) — forces all fields to be keyword-only at construction. Good when a class has many fields and positional construction would be unreadable.

@dataclass(kw_only=True)
class HttpRequest:
    url: str
    method: str = "GET"
    headers: dict[str, str] | None = None
    timeout: float = 30.0

# Must call as: HttpRequest(url="...", method="POST")

You can also mark individual fields kw-only with the KW_ONLY sentinel — see the docs when you need it.

Mutable defaults: the one trap

Don’t do this:

@dataclass
class Shopping:
    items: list[str] = []   # ValueError on class definition

Python catches this for you, because every instance would share the same list. Use field(default_factory=list):

from dataclasses import dataclass, field

@dataclass
class Shopping:
    items: list[str] = field(default_factory=list)
    notes: dict[str, str] = field(default_factory=dict)

The factory is called fresh for each new instance.

Pydantic vs dataclass

A common question. Both let you declare fields with types. The difference:

@dataclass does no validation. If you say id: int, you’ll get whatever the caller passes. Pass "5" and you get a string with a type hint that lies.
Pydantic validates and coerces. Pass "5" to a Pydantic model expecting int, and you get 5. Pass "hello" and you get a ValidationError.

Use Pydantic when:

You’re parsing untrusted input (HTTP body, JSON file, config file).
You’re at an API boundary and want guarantees.
You want JSON schema generation, serialization, validators on fields.

Use @dataclass when:

You’re modeling internal state and your own code is the only producer.
You don’t want a third-party dependency.
You don’t need validation; type hints are documentation here, not enforcement.

In a typical service: Pydantic for the request and response models, dataclasses for everything internal. attrs is the older third-party library that inspired both — still excellent, still in wide use, but if you’re starting fresh today the choice is @dataclass (stdlib) or Pydantic (validation).

Real-world example: API response model

from dataclasses import dataclass, field
from datetime import datetime, timezone
from collections import Counter, defaultdict

@dataclass(frozen=True, slots=True)
class LogEntry:
    timestamp: datetime
    level: str
    service: str
    message: str

@dataclass
class LogSummary:
    period_start: datetime
    period_end: datetime
    total: int = 0
    by_level: Counter[str] = field(default_factory=Counter)
    by_service: defaultdict[str, list[str]] = field(
        default_factory=lambda: defaultdict(list)
    )

def summarize(entries: list[LogEntry]) -> LogSummary:
    if not entries:
        now: datetime = datetime.now(tz=timezone.utc)
        return LogSummary(period_start=now, period_end=now)

    summary = LogSummary(
        period_start=min(e.timestamp for e in entries),
        period_end=max(e.timestamp for e in entries),
        total=len(entries),
    )
    for e in entries:
        summary.by_level[e.level] += 1
        if e.level == "ERROR":
            summary.by_service[e.service].append(e.message)
    return summary

LogEntry is frozen and slotted because there are millions of them and they don’t change. LogSummary is mutable because we build it up. Counter for level frequencies. defaultdict(list) for grouping error messages by service. Type hints throughout, no boilerplate, the structure of the data is visible at a glance.

That’s the goal: less code, more meaning per line.

This concludes Module 2 — Standard library mastery. Module 3 picks up with iterators, generators, and the itertools toolkit that makes Python code feel less like a sequence of for-loops and more like a pipeline.

References: collections — Container datatypes, dataclasses — Data Classes, typing.NamedTuple, PEP 557 — Data Classes. Retrieved 2026-05-01.