There’s a tax you pay in any language for not knowing the standard library well enough. In Python it shows up most often in two places: people writing twenty lines of dict-juggling that Counter or defaultdict would have handled in two, and people writing four-line __init__ methods that @dataclass would have generated for them.
This lesson is the small data structures. None of them are exciting. All of them are time you stop spending on boilerplate.
Counter: counting hashable things
from collections import Counter
words: list[str] = ["apple", "banana", "apple", "cherry", "banana", "apple"]
counts: Counter[str] = Counter(words)
# Counter({'apple': 3, 'banana': 2, 'cherry': 1})
counts.most_common(2) # [('apple', 3), ('banana', 2)]
counts["apple"] # 3
counts["mango"] # 0 — never raises KeyError
Counter is a dict subclass that defaults missing keys to 0. You feed it any iterable of hashable items and you get back the frequency of each. The most_common(n) method is the killer feature — sorted, descending, top n. No sorted(items, key=lambda x: -x[1])[:n] ever again.
Real-world example: figuring out which IPs hit your error endpoint most:
from collections import Counter
from pathlib import Path
errors: Counter[str] = Counter()
for line in Path("access.log").read_text().splitlines():
if " 500 " in line:
ip: str = line.split()[0]
errors[ip] += 1
print(errors.most_common(10))
Counters also compose with arithmetic operators:
a: Counter[str] = Counter("aabbc")
b: Counter[str] = Counter("abbcc")
a + b # Counter({'b': 4, 'c': 3, 'a': 3})
a - b # Counter({'a': 1}) — only positive counts kept
a & b # Counter({'a': 1, 'b': 2, 'c': 1}) — min (intersection)
a | b # Counter({'a': 2, 'b': 2, 'c': 2}) — max (union)
Multiset semantics. Useful for things like “how many of each token appears in both documents.”
defaultdict: auto-initializing keys
from collections import defaultdict
groups: defaultdict[str, list[int]] = defaultdict(list)
for user_id, tag in [(1, "a"), (2, "a"), (1, "b"), (3, "a")]:
groups[tag].append(user_id)
# defaultdict(<class 'list'>, {'a': [1, 2, 3], 'b': [1]})
defaultdict(factory) creates the value by calling factory() the first time you access a missing key. With list, you get an empty list ready to .append() to. The “group these things by some key” pattern is so common that this is by far the most-used member of the collections module.
Other factories:
defaultdict(int)— auto-zero, like a primitive Counterdefaultdict(set)— auto-empty set for deduped groupingdefaultdict(dict)— nested dicts; key once, get a dict, stuff things into itdefaultdict(lambda: "unknown")— any callable that returns the default value
The alternative without defaultdict:
groups: dict[str, list[int]] = {}
for user_id, tag in pairs:
if tag not in groups:
groups[tag] = []
groups[tag].append(user_id)
Or with setdefault:
groups.setdefault(tag, []).append(user_id)
Both work. defaultdict is cleaner once your loop body has more than one append.
AI assistance note. Modern AI coding assistants are reliably good at recognizing the “I’m grouping things” or “I’m counting things” patterns and suggesting
defaultdict(list)orCounteronce they see the loop. They’re worth the prompt. The opposite trap is more common: assistants tend to overproduce@dataclassfor any structured data, including cases where a plain dict would have been simpler. If you find yourself with five dataclasses that each hold three fields and never get methods, consider whether adict[str, Foo]would have been honest.
deque: the list you want when you’re appending and popping
from collections import deque
queue: deque[int] = deque([1, 2, 3])
queue.append(4) # right side: [1, 2, 3, 4]
queue.appendleft(0) # left side: [0, 1, 2, 3, 4]
queue.pop() # 4, removed from right
queue.popleft() # 0, removed from left
A list has fast appends and pops at the right end and slow ones at the left, because removing from the front shifts every other element. A deque (pronounced “deck”) has O(1) operations on both ends.
Use it when:
- You need a FIFO queue (
append+popleft). - You need a LIFO stack — actually a list works fine for this.
- You need a sliding window with a fixed maximum length:
recent: deque[float] = deque(maxlen=100)
for value in stream:
recent.append(value) # oldest auto-evicted when len > 100
if len(recent) == 100:
moving_avg: float = sum(recent) / 100
maxlen is the underrated feature. The deque silently drops the oldest entry when it would exceed the limit. Perfect for moving averages, rolling logs, “last N events.”
If you find yourself doing lst.pop(0) or lst.insert(0, x) on a long list, switch to a deque.
namedtuple: a tuple with field names
from collections import namedtuple
Point = namedtuple("Point", ["x", "y"])
p: Point = Point(3, 4)
p.x # 3
p.y # 4
p[0] # 3 — also works as a tuple
x, y = p # tuple unpacking still works
It’s a tuple — immutable, hashable, comparable, unpackable — but with named attribute access. namedtuple was the answer for years to “I want a small read-only struct.”
In modern Python (3.7+), the answer is usually @dataclass(frozen=True) instead, unless you specifically need the tuple-like behavior:
- Unpacking by position
==comparison by tuple value- Working with code that expects a sequence (a database row, an unpacking-based API)
For everything else, dataclass is more flexible. There’s also typing.NamedTuple for the same idea with type hints baked into the class syntax:
from typing import NamedTuple
class Point(NamedTuple):
x: float
y: float
Same runtime behavior as collections.namedtuple, nicer to write.
OrderedDict: mostly redundant since 3.7
Before Python 3.7, dict didn’t guarantee insertion order. OrderedDict did. Since 3.7, regular dict preserves insertion order as a language guarantee. So OrderedDict is mostly historical.
It still has two unique features:
move_to_end(key)andmove_to_end(key, last=False)for repositioning entries.==comparison considers order.OrderedDict([("a", 1), ("b", 2)]) != OrderedDict([("b", 2), ("a", 1)]), but with regular dicts they’d be equal.
For 99% of code: just use dict. OrderedDict exists when you need those two features.
@dataclass: the modern struct
This is the big one. @dataclass was added in Python 3.7 (PEP 557) and replaces a huge swathe of boilerplate.
Before:
class Order:
def __init__(self, id: int, customer: str, total: float, paid: bool = False) -> None:
self.id = id
self.customer = customer
self.total = total
self.paid = paid
def __repr__(self) -> str:
return f"Order(id={self.id!r}, customer={self.customer!r}, total={self.total!r}, paid={self.paid!r})"
def __eq__(self, other: object) -> bool:
if not isinstance(other, Order):
return NotImplemented
return (self.id, self.customer, self.total, self.paid) == \
(other.id, other.customer, other.total, other.paid)
After:
from dataclasses import dataclass
@dataclass
class Order:
id: int
customer: str
total: float
paid: bool = False
The decorator inspects the class annotations and generates __init__, __repr__, and __eq__ automatically. You write what’s actually unique about the class — the fields, their types, their defaults — and skip the rest.
o = Order(id=1, customer="Marco", total=49.99)
print(o) # Order(id=1, customer='Marco', total=49.99, paid=False)
o.paid = True
o == Order(1, "Marco", 49.99, True) # True
The variants worth knowing
frozen=True — immutable, hashable. Use for value objects, dictionary keys, anything that shouldn’t mutate.
@dataclass(frozen=True)
class Coord:
x: float
y: float
c = Coord(1.0, 2.0)
c.x = 5.0 # FrozenInstanceError
points: set[Coord] = {Coord(1, 2), Coord(3, 4)} # works because frozen is hashable
slots=True (3.10+) — generates __slots__, saving memory and slightly speeding attribute access. Useful when you have millions of instances. The trade-off is no dynamic attribute assignment outside the declared fields.
@dataclass(slots=True)
class Tick:
timestamp: float
price: float
volume: int
kw_only=True (3.10+) — forces all fields to be keyword-only at construction. Good when a class has many fields and positional construction would be unreadable.
@dataclass(kw_only=True)
class HttpRequest:
url: str
method: str = "GET"
headers: dict[str, str] | None = None
timeout: float = 30.0
# Must call as: HttpRequest(url="...", method="POST")
You can also mark individual fields kw-only with the KW_ONLY sentinel — see the docs when you need it.
Mutable defaults: the one trap
Don’t do this:
@dataclass
class Shopping:
items: list[str] = [] # ValueError on class definition
Python catches this for you, because every instance would share the same list. Use field(default_factory=list):
from dataclasses import dataclass, field
@dataclass
class Shopping:
items: list[str] = field(default_factory=list)
notes: dict[str, str] = field(default_factory=dict)
The factory is called fresh for each new instance.
Pydantic vs dataclass
A common question. Both let you declare fields with types. The difference:
@dataclassdoes no validation. If you sayid: int, you’ll get whatever the caller passes. Pass"5"and you get a string with a type hint that lies.- Pydantic validates and coerces. Pass
"5"to a Pydantic model expectingint, and you get5. Pass"hello"and you get aValidationError.
Use Pydantic when:
- You’re parsing untrusted input (HTTP body, JSON file, config file).
- You’re at an API boundary and want guarantees.
- You want JSON schema generation, serialization, validators on fields.
Use @dataclass when:
- You’re modeling internal state and your own code is the only producer.
- You don’t want a third-party dependency.
- You don’t need validation; type hints are documentation here, not enforcement.
In a typical service: Pydantic for the request and response models, dataclasses for everything internal. attrs is the older third-party library that inspired both — still excellent, still in wide use, but if you’re starting fresh today the choice is @dataclass (stdlib) or Pydantic (validation).
Real-world example: API response model
from dataclasses import dataclass, field
from datetime import datetime, timezone
from collections import Counter, defaultdict
@dataclass(frozen=True, slots=True)
class LogEntry:
timestamp: datetime
level: str
service: str
message: str
@dataclass
class LogSummary:
period_start: datetime
period_end: datetime
total: int = 0
by_level: Counter[str] = field(default_factory=Counter)
by_service: defaultdict[str, list[str]] = field(
default_factory=lambda: defaultdict(list)
)
def summarize(entries: list[LogEntry]) -> LogSummary:
if not entries:
now: datetime = datetime.now(tz=timezone.utc)
return LogSummary(period_start=now, period_end=now)
summary = LogSummary(
period_start=min(e.timestamp for e in entries),
period_end=max(e.timestamp for e in entries),
total=len(entries),
)
for e in entries:
summary.by_level[e.level] += 1
if e.level == "ERROR":
summary.by_service[e.service].append(e.message)
return summary
LogEntry is frozen and slotted because there are millions of them and they don’t change. LogSummary is mutable because we build it up. Counter for level frequencies. defaultdict(list) for grouping error messages by service. Type hints throughout, no boilerplate, the structure of the data is visible at a glance.
That’s the goal: less code, more meaning per line.
This concludes Module 2 — Standard library mastery. Module 3 picks up with iterators, generators, and the itertools toolkit that makes Python code feel less like a sequence of for-loops and more like a pipeline.
References: collections — Container datatypes, dataclasses — Data Classes, typing.NamedTuple, PEP 557 — Data Classes. Retrieved 2026-05-01.