The unit tests you write are the cases you thought of. The bugs in production are the cases you didn’t. Property-based testing closes that gap by letting the test framework generate the cases for you — hundreds of them per test run, biased toward the inputs most likely to break things.
In Python, the property-based testing library is hypothesis. It’s been mature for years, ships clean APIs, and integrates seamlessly with pytest. This lesson is the working knowledge: how to think about properties, how to write them, the patterns that find real bugs, and where property-based testing isn’t worth the cost.
Example-based versus property-based
A traditional pytest test is example-based: you pick concrete inputs and assert concrete outputs.
def test_round_price():
assert round_price(1.005) == 1.00
assert round_price(1.015) == 1.02
This works for the cases you wrote down. It says nothing about 1.005000000001, or -0.005, or 0.0, or float("nan"), unless you also wrote those down.
A property-based test asserts a property — something that should hold for all valid inputs — and lets the framework probe at it:
from hypothesis import given, strategies as st
@given(st.floats(min_value=0, max_value=10_000, allow_nan=False))
def test_round_price_is_idempotent(amount: float) -> None:
once = round_price(amount)
twice = round_price(once)
assert once == twice
Hypothesis runs this with a hundred different floats by default, biased toward edge cases (0.0, very small numbers, numbers with awkward binary representations). If any input fails the property, hypothesis tells you which one — and shrinks it to the smallest counterexample it can find.
The shrinking is the magic. If a giant random float fails, hypothesis doesn’t just say “it failed with 0.7281928281828”; it whittles the input down until it finds the simplest input that still fails. Often you get back something like 1e-300 or 0.0, and the bug is suddenly obvious.
The basic moves
Install and import:
pip install hypothesis
from hypothesis import given, strategies as st
A strategy is a description of an input space. Hypothesis comes with strategies for every common Python type:
st.integers() # any int
st.integers(min_value=0) # non-negative
st.floats(allow_nan=False) # finite floats only
st.text() # any unicode string
st.text(alphabet="abc", max_size=5) # restricted
st.lists(st.integers()) # list of ints
st.lists(st.integers(), min_size=1) # non-empty
st.dictionaries(st.text(), st.integers())
st.dates() # datetime.date
st.datetimes(timezones=st.timezones())
st.from_regex(r"^\d{3}-\d{4}$", fullmatch=True)
Combining strategies is just composition:
st.tuples(st.text(), st.integers())
st.lists(st.tuples(st.text(), st.integers()), max_size=10)
st.one_of(st.integers(), st.text())
For your own types, @st.composite builds a strategy out of others:
from hypothesis import strategies as st
from dataclasses import dataclass
@dataclass(frozen=True)
class Order:
id: int
amount: float
customer: str
@st.composite
def orders(draw: st.DrawFn) -> Order:
return Order(
id=draw(st.integers(min_value=1)),
amount=draw(st.floats(min_value=0, allow_nan=False, allow_infinity=False)),
customer=draw(st.text(min_size=1, max_size=50)),
)
@given(orders())
def test_order_amount_non_negative(order: Order) -> None:
assert order.amount >= 0
Once you have a strategy for your domain types, every test that needs them is a one-liner.
Properties worth testing
The skill of property-based testing is identifying properties. A few patterns that come up everywhere:
Round-trip
If your code encodes and decodes, encoding then decoding should give you back the original:
@given(st.dictionaries(st.text(), st.integers()))
def test_json_round_trip(data: dict[str, int]) -> None:
assert json.loads(json.dumps(data)) == data
This catches a surprising number of bugs in custom serializers — anything to do with quoting, encoding, special characters, empty containers.
Idempotence
Doing it twice equals doing it once:
@given(st.text())
def test_strip_is_idempotent(s: str) -> None:
assert s.strip().strip() == s.strip()
@given(st.lists(st.integers()))
def test_sort_is_idempotent(xs: list[int]) -> None:
assert sorted(sorted(xs)) == sorted(xs)
Idempotence is the natural property of any “normalising” function: rounding, sorting, deduplicating, canonicalising URLs, lowercasing.
Commutativity (where it should hold)
Some operations should give the same answer regardless of order:
@given(st.lists(st.integers()), st.lists(st.integers()))
def test_set_union_is_commutative(a: list[int], b: list[int]) -> None:
assert set(a) | set(b) == set(b) | set(a)
Monotonicity
Sorted output is non-decreasing. Adding to a counter never decreases it. The output of an “average” never exceeds the maximum input:
@given(st.lists(st.integers(), min_size=1))
def test_sort_is_non_decreasing(xs: list[int]) -> None:
s = sorted(xs)
assert all(a <= b for a, b in zip(s, s[1:], strict=False))
Invariants under transformation
Length doesn’t change after a permutation. Total doesn’t change after rounding errors are summed back in. The set of customer IDs is preserved across an ETL step.
@given(st.lists(st.integers()))
def test_reverse_preserves_length(xs: list[int]) -> None:
assert len(list(reversed(xs))) == len(xs)
A worked example: testing a price-rounding function
Here’s a function and the properties I’d write for it:
from decimal import Decimal, ROUND_HALF_EVEN
def round_price(amount: float) -> float:
"""Round to the nearest cent, banker's rounding."""
return float(
Decimal(str(amount)).quantize(Decimal("0.01"), rounding=ROUND_HALF_EVEN)
)
The properties:
from hypothesis import given, strategies as st
prices = st.floats(min_value=0, max_value=1_000_000, allow_nan=False, allow_infinity=False)
@given(prices)
def test_round_price_is_idempotent(amount: float) -> None:
once = round_price(amount)
assert round_price(once) == once
@given(prices)
def test_round_price_close_to_input(amount: float) -> None:
assert abs(round_price(amount) - amount) <= 0.005 + 1e-9
@given(prices)
def test_round_price_two_decimals(amount: float) -> None:
rounded = round_price(amount)
assert round(rounded * 100) == rounded * 100
Three properties, infinite test cases, and any of them failing tells you something specific is wrong. When I first ran this kind of suite on a real money-handling codebase, hypothesis found a case where 1e-308 didn’t round to zero because of a precision oddity in the Decimal conversion. Not a case I would have written by hand.
Stateful testing
Some bugs only show up in sequences of operations: the first call works, the second one corrupts state. Hypothesis handles this with RuleBasedStateMachine:
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
class CartMachine(RuleBasedStateMachine):
def __init__(self) -> None:
super().__init__()
self.cart = Cart()
self.expected_total = 0.0
@rule(amount=st.floats(min_value=0, max_value=100))
def add_item(self, amount: float) -> None:
self.cart.add(amount)
self.expected_total += amount
@rule()
def remove_last(self) -> None:
if self.cart.items:
removed = self.cart.remove_last()
self.expected_total -= removed
@invariant()
def total_matches(self) -> None:
assert abs(self.cart.total() - self.expected_total) < 1e-6
TestCart = CartMachine.TestCase
Hypothesis generates random sequences of add_item and remove_last calls and checks the invariant after each one. If the cart’s total() ever drifts from your bookkeeping, hypothesis shrinks the sequence to the shortest reproduction. This catches state-machine bugs that example tests can’t reach.
The cost, and how to control it
Property-based tests are slower than example tests. A test with a hundred runs takes more wall-time than a test with one input. For most codebases this is fine; the suite still finishes in seconds. For tests that hit a database, a network, or anything expensive, you tune with @settings:
from hypothesis import given, settings, strategies as st
@settings(max_examples=20, deadline=None)
@given(st.text())
def test_slow_thing(s: str) -> None:
...
max_examples reduces the number of generated cases. deadline=None disables hypothesis’s per-example time limit, which trips on tests that vary in speed. There’s also @settings(database=None) to disable the local database that hypothesis uses to remember failing cases between runs — useful in CI containers, annoying in local development.
A pattern I use: a slow profile for CI that runs max_examples=500, and a default profile for local development that uses the standard hundred. The CI run is more thorough; the local run is fast enough to keep me iterating.
When to skip
Property-based testing isn’t always the right tool.
- Pure UI code. “What property should this React component have?” Usually none worth automating.
- Code with strong network or randomness dependencies. If the function’s output depends on the response from a third-party API, hypothesis can’t generate that.
- One-off scripts. The investment doesn’t pay back.
- When the property is harder to express than the implementation. If you’d write a parallel implementation just to assert the property, you’ve gained nothing.
The sweet spot: pure functions over rich input spaces. Parsers, serializers, data transformations, financial calculations, sorting algorithms, anything that operates on user-shaped data.
Real bugs property tests catch
A short list of bugs I’ve personally caught with hypothesis, none of which my example tests had:
- A CSV writer that broke on values containing both a comma and a quote.
- A timestamp parser that mishandled the boundary between standard and daylight saving time.
- A leap-second-related bug in a duration calculator.
- A sort routine that was unstable on
Nonevalues where I’d assumedNonewould never appear. - An integer-overflow in a percentage calculation when the input was a 64-bit integer near
2^63. - A Unicode normalization mismatch that caused two strings that “looked the same” to compare unequal.
- A
dictmerger that lost keys when the same key appeared with different cases in different inputs.
All boring. All shippable as production bugs. All caught by a property that took five minutes to write.
The minimum recommended use
If you only adopt one habit from this lesson: for every pure function in your codebase that takes a recognisable input shape (a list of numbers, a string with a known alphabet, a dataclass with documented fields), write one property test. Idempotence, round-trip, monotonicity — pick whichever fits. The bar is low, the payoff is high, and the first time hypothesis hands you back a one-line input that crashes your code, you’ll be glad you did.
For documentation: hypothesis’s own docs at https://hypothesis.readthedocs.io/ are the canonical reference, with a strategies catalogue worth bookmarking.