PYTHON BUGS

teardown // under the hood
bug taxonomy, detection landscape, real failure modes

SCOPE & THESIS

Python's flexibility is its footgun surface. Late binding, duck typing, mutable defaults, the GIL, async colored functions, two string types — every convenience has a failure mode. Most of those failure modes are silent: the program runs, returns a wrong answer, and you find out at 3am from a customer.

This teardown enumerates the bug families that actually hit production Python codebases. For each, the canonical bad code, the fix, the runtime mechanism, and whether existing tools catch it. The goal is a complete map: what tools catch, what needs human or LLM review, what hides forever.

// who this is for Engineers running static analyzers and wondering why the output is mostly noise. Skill authors building bug-finding agents. Anyone who has ever debugged a def f(x=[]): and wanted to understand why.

// the categories

familywhat hits productiontool coverage
runtimeIndexError, KeyError, AttributeError on None, TypeError on duck-typingpartial — types catch some, runtime catches the rest
logicmutable defaults, late binding, off-by-one, shadowed namesmixed — mutable default is caught, late binding is not
typesstr/bytes, shallow vs deep copy, falsy gotchastype checkers, when configured
resourceleaked files / sockets / locks / db connectionspartial — context manager hints exist
concurrencyraces, deadlocks, async/sync mismatches, GIL assumptionspoor — most needs semantic review
securityinjection (sql/shell/eval), pickle, hardcoded credsbandit catches the obvious; subtle injection often missed
performanceO(n²) inner loops, string concat, list-as-set, materializationperflint exists but rarely run
smellsgod classes, deep nesting, magic numbers, eval/exec abusecomplexity tools catch metrics, not semantics
// the rule of three Most "tools find your bugs" claims rely on the user enabling every ruleset, every plugin, every type-checker stricture. In practice projects run defaults. The bugs that actually slip through are ones that need opt-in tooling or cross-function semantic reasoning. The latter is where LLMs add value over linters.

RUNTIME EXCEPTIONS

The bugs that crash visibly. Less dangerous than logic bugs (you find them fast) but still ubiquitous because Python's late binding hides them from static analysis.

// the none dereference

Almost every Python attribute or subscript bug eventually traces to a None sneaking in where an object was expected. Functions that can return None but usually don't are landmines.

bad# dict.get returns None when missing; no second arg = no default
def user_age(users, name):
    return users.get(name).age  # AttributeError if name absent
gooddef user_age(users, name):
    user = users.get(name)
    if user is None:
        return None
    return user.age
// detection pyright --strict catches this via reportOptionalMemberAccess. mypy --strict with --no-implicit-optional catches it. ruff default does not. The bug shows up because dict.get() returns Optional[T] in its type stub and the access ignores the Optional.

// bare except

Catches everything including SystemExit, KeyboardInterrupt, and the bug you needed to see. The most cited code smell that's also a real correctness bug.

badtry:
    do_work()
except:  # swallows ctrl-c, swallows MemoryError, swallows the real bug
    log("oops")
goodtry:
    do_work()
except (ValueError, KeyError) as e:
    log("recoverable: %s", e)
// detection ruff E722 in default rules. Easy catch. Bigger issue is except Exception: — also wrong in many contexts, but not in the default ruleset.

// the index-and-pop race

Pattern that looks defensive but isn't:

badif queue:
    item = queue.pop(0)  # IndexError if another thread popped first

The if queue check is not atomic with the pop. Under threads this races. Under asyncio across await points it races too if you yielded between them.

goodtry:
    item = queue.pop(0)
except IndexError:
    item = None

// off-by-one in slicing

Python's half-open intervals are intuitive once internalized but trip up people thinking in inclusive C-style ranges.

bad# take the last N items; off-by-one if you forget slice semantics
last_n = items[len(items) - n - 1:]  # takes n+1 items
goodlast_n = items[-n:]  # pythonic and correct for n > 0

LOGIC & STATE

The most uniquely Pythonic bugs live here. Mutable defaults and late binding closures are interview questions because they keep biting production code. The fact that they're known doesn't stop them.

// THE classic: mutable default arguments

Default argument values are evaluated once, when the def statement executes, not every call. A mutable default is shared across every invocation.

baddef add_item(item, items=[]):
    items.append(item)
    return items

add_item(1)  # [1]
add_item(2)  # [1, 2]  -- not [2]!
add_item(3)  # [1, 2, 3]
gooddef add_item(item, items=None):
    if items is None:
        items = []
    items.append(item)
    return items
// detection ruff B006 in the B ruleset catches this. The rule is opt-in (not in defaults) but extremely high signal. Every project that runs ruff should have B selected. Pylint catches it as W0102 dangerous-default-value.

// late binding in closures

Closures capture variable names, not values. When the closure runs, it sees the variable's current value, not the value at capture time. This breaks loops that build callables.

badhandlers = []
for i in range(3):
    handlers.append(lambda: i)

[h() for h in handlers]  # [2, 2, 2]  -- all see final i
good# bind i as a default arg, evaluated at lambda definition
handlers = [lambda i=i: i for i in range(3)]
[h() for h in handlers]  # [0, 1, 2]
// detection ruff B023 catches function definitions that don't bind loop variables. Default-off. Real bug, often missed.

// is vs ==

is compares identity (same object in memory). == compares value. They coincidentally agree for small ints (-5 to 256 are interned) and short strings — until they don't.

bad# two distinct objects with equal value
a = [1, 2, 3]
b = [1, 2, 3]
a == b   # True  -- same value
a is b   # False -- different objects

# small ints (-5..256) are cached, which masks the difference
x, y = 5, 5
x is y   # True  -- implementation detail, not a guarantee
// detection ruff E711 catches == None, ruff F632 catches is with a literal. The general "wrong identity check" needs semantic review.

// truthy/falsy traps

0, 0.0, "", [], {}, set(), None, and False are all falsy. A function returning 0 as a meaningful value is indistinguishable from one returning None by a truth check.

badcount = users.get_count()  # returns int, possibly 0
if not count:
    handle_no_users()  # fires when count == 0 (correct) AND when count is None (might be wrong)
goodif count is None:
    handle_missing()
elif count == 0:
    handle_empty()

// dict ordering assumptions

Dict insertion order has been guaranteed since Python 3.7. Iteration order matches insertion order. But code written against earlier Python or ported from other languages still sometimes assumes alphabetical, hash, or arbitrary order. The fix is usually to wrap the dict in sorted() or use collections.OrderedDict for clarity of intent.

// shadowed builtins

list, dict, id, type, filter, map, sum — assigning to any of these in module scope shadows the builtin for the rest of the file. Subsequent code that expects the builtin breaks silently.

bad# at module top level
list = get_users()
# 200 lines later
ids = list(map(get_id, items))  # TypeError: 'list' object is not callable
// detection ruff A001/A002/A003 from the flake8-builtins plugin. Opt-in. Catches every shadowing case.

DATA & TYPE ISSUES

Python's flexibility around types makes "it works in dev" mean "it might break in prod under different input." The bugs in this family are about silent behavior changes when types differ.

// str / bytes mixing

The most-fought-over Python 3 transition. str and bytes don't compare equal, don't concatenate, don't substitute into each other. A function that accepts either silently does the wrong thing.

baddef starts_with_x(s):
    return s[0] == "x"

starts_with_x("xyz")       # True
starts_with_x(b"xyz")      # False -- s[0] is int 120, "x" is str
gooddef starts_with_x(s: str | bytes):
    if isinstance(s, bytes):
        return s.startswith(b"x")
    return s.startswith("x")

// shallow vs deep copy

copy.copy() copies the outer container. Nested mutable objects are still shared. list[:] and dict.copy() have the same behavior. copy.deepcopy() recursively copies.

badimport copy
defaults = {"perms": ["read"]}
user_config = copy.copy(defaults)
user_config["perms"].append("write")

defaults["perms"]  # ["read", "write"]  -- defaults mutated!
gooduser_config = copy.deepcopy(defaults)
user_config["perms"].append("write")
defaults["perms"]  # ["read"]

// reference aliasing

Assignment in Python binds names to objects; it does not copy. Same for function arguments. Mutation through one name shows up through every other name pointing at the same object.

baddef scale(values, factor):
    for i in range(len(values)):
        values[i] *= factor  # mutates caller's list

original = [1, 2, 3]
scale(original, 2)
original  # [2, 4, 6]

Returning a new list instead of mutating is almost always the right answer unless mutation is the documented contract.

// encoding mismatches

Reading bytes as the wrong encoding doesn't always raise. It silently produces wrong-looking text (mojibake) that propagates through your system. latin-1 is the worst offender — it accepts any byte but interprets non-ASCII as garbage.

badopen("data.txt").read()  # platform-dependent default encoding
goodopen("data.txt", encoding="utf-8").read()
// detection ruff PLW1514 catches missing encoding on open(). Opt-in via the PL plugin set. In Python 3.10+, set PYTHONWARNDEFAULTENCODING=1 to surface this at runtime.

RESOURCE LIFECYCLE

Files, sockets, db connections, locks. Each is a finite resource. Leaks compound: small per-request leak times millions of requests equals process restart.

// the bare open()

Python doesn't close files automatically when their reference goes out of scope. CPython does via reference counting most of the time, but PyPy and other implementations don't, and exceptions can leave references dangling.

baddata = open("big.json").read()  # file may stay open until GC
goodwith open("big.json") as f:
    data = f.read()  # closed on scope exit, even on exception
// detection ruff SIM115 catches open() outside a with. pylint R1732 same. ResourceWarning fires at runtime if a file is GC'd while open. Easy class of bug to find statically.

// db connections without context managers

Same pattern, higher stakes. A leaked connection can block the entire pool.

badconn = pool.acquire()
result = conn.execute(query)
if result.empty():
    return None  # leak! conn never released on this path
conn.release()
goodwith pool.acquire() as conn:
    result = conn.execute(query)
    if result.empty():
        return None  # released on exit anyway

// lock release in error paths

Acquiring a threading.Lock manually requires releasing on every code path. One exception with no finally and you have a permanently held lock.

badlock.acquire()
do_work()  # if this raises, lock is never released
lock.release()
goodwith lock:
    do_work()

// __del__ surprises

__del__ runs when the object is garbage collected, which is not guaranteed to be when the last reference drops. Reference cycles can leave __del__ never called (pre-3.4) or called in undefined order (post-3.4). Don't put critical cleanup there.

The correct primitive for "do X when this resource is no longer needed" is a context manager, an explicit close method, or weakref.finalize.

// growing caches

Module-level dicts used as caches grow without bound. They survive every garbage collection. They survive request boundaries. A long-running process slowly leaks until OOM.

bad_cache = {}

def expensive(key):
    if key not in _cache:
        _cache[key] = compute(key)
    return _cache[key]
goodfrom functools import lru_cache

# bounded, evicts least-recently-used
@lru_cache(maxsize=1024)
def expensive(key):
    return compute(key)

CONCURRENCY & ASYNC

The highest-LLM-value bug family. Tools rarely catch async/sync mismatches or shared-state races. Required reading: Glyph's posts, Brett Cannon on coroutines, and the asyncio-gotchas pattern catalog.

// blocking calls in async functions

An async def function that calls time.sleep(), requests.get(), or open().read() blocks the event loop. Every other coroutine waiting for the loop is stalled. Latency spikes for unrelated requests.

badasync def fetch(url):
    time.sleep(1)           # blocks the loop
    r = requests.get(url)   # blocks the loop
    return r.text
goodasync def fetch(url):
    await asyncio.sleep(1)
    async with aiohttp.ClientSession() as sess:
        async with sess.get(url) as r:
            return await r.text()
// detection ruff --select=ASYNC catches ASYNC251 (time.sleep), ASYNC210 (blocking HTTP). Default ruff misses both. This rule set should be table-stakes for any async-heavy project.

// task not awaited

asyncio.create_task() returns a task. If you don't keep a reference, the garbage collector can collect the task while it's still running. The task gets cancelled mid-flight.

badasync def main():
    asyncio.create_task(background_work())  # task may be GC'd
    await other_work()
good_background = set()
async def main():
    t = asyncio.create_task(background_work())
    _background.add(t)
    t.add_done_callback(_background.discard)
    await other_work()

// shared mutable state across awaits

The deceiving bug. Pure-Python statements between two await points are atomic relative to other coroutines (the loop can't preempt them). But the moment you await, another coroutine can run, mutate shared state, and return.

badbalance = 100

async def transfer(amount):
    global balance
    current = balance              # read
    await log_transfer(amount)     # yields! another coroutine can run
    balance = current - amount     # write based on stale read
goodbalance_lock = asyncio.Lock()

async def transfer(amount):
    global balance
    async with balance_lock:
        current = balance
        await log_transfer(amount)
        balance = current - amount

// queue.Queue vs asyncio.Queue

Two queues, same name. queue.Queue uses OS threads and blocks. asyncio.Queue uses the event loop and yields. Using the wrong one in async code blocks the loop or deadlocks.

badfrom queue import Queue       # thread-safe, blocking

q = Queue()
async def consumer():
    while True:
        item = q.get()           # blocks the entire event loop
        process(item)
goodfrom asyncio import Queue     # yields to the loop

q = Queue()
async def consumer():
    while True:
        item = await q.get()
        process(item)

// GIL assumptions

The Global Interpreter Lock makes single-bytecode operations atomic. Most operations are not single-bytecode. counter += 1 is read, increment, write — three bytecodes. Under threading, races appear at bytecode boundaries even with the GIL.

"Python has the GIL so I don't need locks" is the most expensive misconception in Python concurrency.

counter += 1 // what looks atomic (counter is a shared global) | v LOAD_GLOBAL counter // thread A reads 5 LOAD_CONST 1 // thread B preempts, also reads 5 BINARY_OP += // both compute 6 STORE_GLOBAL counter // both write 6 -- one increment lost

SECURITY VULNS

Python's flexibility extends to unsafe primitives. eval, exec, pickle, yaml.load, and string interpolation into shell or SQL are all ways to turn user input into code execution.

// pickle from untrusted sources

pickle.loads() is equivalent to executing arbitrary Python. A pickle payload can spawn shells, exfiltrate data, install backdoors. There is no way to make it safe.

baddata = pickle.loads(request.body)  # RCE
gooddata = json.loads(request.body)
# or use a typed schema with msgpack, protobuf, etc.
// detection bandit B301 (blacklist pickle), B403 (import pickle). Both default. Reliable catch for the obvious cases. Indirect pickle use through libraries (joblib, dill) often missed.

// sql injection via f-strings

badcursor.execute(f"SELECT * FROM users WHERE name = '{name}'")
# name = "x'; DROP TABLE users; --"
goodcursor.execute("SELECT * FROM users WHERE name = %s", (name,))
// detection bandit B608 catches f-string SQL. Sqlalchemy's text() with bound params is the right primitive. Format-string SQL is a CWE-89 classic.

// shell injection

bados.system(f"convert {filename} out.png")
# filename = "x.png; rm -rf /"
goodsubprocess.run(["convert", filename, "out.png"], check=True)

// yaml.load

Default yaml.load() constructs arbitrary Python objects from the input. Yes, including running code. yaml.safe_load() is the safe variant — it only constructs basic types.

badconfig = yaml.load(open("config.yaml"))  # can execute code
goodconfig = yaml.safe_load(open("config.yaml"))

// eval / exec on user input

Self-explanatory. eval(user_input) is "please run my user's code as me."

// hardcoded credentials

badAPI_KEY = "sk-proj-AbCd1234..."

Beyond the obvious "don't commit secrets," API keys in code leak to logs, error reports, AI assistants, every clone of the repo. git filter-repo can scrub history but only if you notice.

// detection bandit B105 / B106 / B107 catches hardcoded password literals. Truffhog, gitleaks, and github's secret scanning catch more patterns. Bare API keys still slip through if they don't match known token patterns.

PERFORMANCE PATTERNS

Most Python performance problems are algorithmic or idiomatic, not the GIL. The GIL is the convenient excuse; the actual cost is usually quadratic loops or unnecessary materialization.

// O(n²) via list membership

A list's in operator scans linearly. Doing it inside a loop over another collection is O(n × m).

baddef find_overlap(items, blacklist):
    return [x for x in items if x in blacklist]
# O(n * m) -- m is len(blacklist)
gooddef find_overlap(items, blacklist):
    bl = set(blacklist)
    return [x for x in items if x in bl]
# O(n + m)
// detection No direct ruff rule catches x in [a, b, c]x in {a, b, c}. ruff PLR1714 is the closest, but only flags the related x == a or x == b pattern. The cross-function case (list passed in) needs LLM-level reasoning either way.

// string concatenation in loops

Strings are immutable. Each concat allocates a new string and copies. Quadratic behavior masquerading as a loop.

badresult = ""
for chunk in chunks:
    result += chunk  # O(n²) total
goodresult = "".join(chunks)  # O(n)

// list materialization where a generator would do

list(map(...)) when you only iterate once. [x for x in big] when you just need to check membership. Materializing means allocating, populating, and (often) tearing down a list you didn't need.

bad# reads whole file into memory just to check if any line matches
if any([is_match(line) for line in open("huge.log")]):
    handle()
good# generator, short-circuits on first match
if any(is_match(line) for line in open("huge.log")):
    handle()

// repeated attribute lookup in hot loops

Python looks up attributes by name at every access. obj.method in a loop hits the object's __getattribute__ every iteration. Hoisting saves real time on inner loops.

badfor item in items:
    self.processor.handler.process(item)  # 3 lookups per iter
goodprocess = self.processor.handler.process
for item in items:
    process(item)

// loops where pandas/numpy vectorize

The largest delta. A Python loop over a numpy array can be 100-1000x slower than the vectorized equivalent.

badresult = []
for i in range(len(arr)):
    result.append(arr[i] * 2 + 1)
goodresult = arr * 2 + 1  # vectorized; runs in C

// loading whole files when streaming would do

JSON files of moderate size, CSVs that exceed RAM, log scans. f.read() is the default but rarely the right choice for files over a few MB.

SMELLS & MAINTAINABILITY

Not bugs in the "wrong answer" sense, but causes of future bugs. Worth flagging in a code review but not blocking.

// god functions

A 500-line function with 20 parameters and 8 levels of nesting. Every call site has to reason about every branch. Refactoring fear is high; bug rate is higher.

// detection radon cc measures cyclomatic complexity. Functions over CC=15 are warning territory; over 30 should be split. ruff C901 in the C ruleset flags by complexity threshold.

// magic numbers and strings

if status == 7: is a bug magnet. Six months later nobody knows what 7 means. StatusCode.PROCESSED is self-documenting and refactorable.

// deep nesting

Each if or for level doubles the mental model. Early return / guard clauses flatten nesting and make every branch explicit.

baddef handle(req):
    if req:
        if req.user:
            if req.user.active:
                if req.action:
                    return dispatch(req)
    return None
gooddef handle(req):
    if not req or not req.user or not req.user.active:
        return None
    if not req.action:
        return None
    return dispatch(req)

// duplicated logic

The same five-line block appearing in three places. The bug fix that lands in one place and misses the others. The classic case for extraction.

// eval / exec / globals() abuse

Every now and then someone uses globals() as a dispatcher or exec to build a class dynamically. The performance is terrible, the bug surface is enormous, and the alternative is almost always a dict of callables.

DETECTION LANDSCAPE

What runs, what catches what, where the holes are.

// tool matrix

toolstrong atweak atconfig burden
ruffstyle, imports, simple bugs, perf hints; fastcross-function semantic, opt-in rulesets often skippedlow
pyrighttypes, control flow, optional accessdynamic code, untyped third-party libslow (strict mode adds work)
mypytypes, strict null safetyslow, weaker flow analysis than pyrightmedium
banditsecurity CWE patternsnarrow scope, lots of FPs in default configlow
vulturedead code, unused symbolsFPs on plugins, reflection, public APIlow
perflintperformance idiomsnarrow rule set, rarely run in CIlow
semgrepcustom syntactic patterns; powerfulrequires rules; setup investmentmedium
radon / xenoncomplexity metricsmetrics, not bugs; needs human interpretationlow

// what no tool catches well

The above is where LLM-assisted review earns its keep over tool-only output.

// the prioritization gap Default tool output is too noisy to act on. A typical ruff check --select=ALL on a real codebase produces hundreds to thousands of findings. The hard problem isn't generating candidates; it's culling, ranking, and presenting the 10-15 that matter. That gap — between linter output and actionable report — is the python-analyzer skill's reason to exist.

// recommended baseline config

For a new project running x8r-style analysis discipline, start here:

pyproject.toml[tool.ruff]
select = [
  "E", "F", "W",        # pyflakes + pycodestyle
  "B",                  # flake8-bugbear (mutable defaults, etc)
  "SIM",                # simplifications
  "ASYNC",              # async pitfalls
  "PERF",               # performance idioms
  "RUF",                # ruff-specific
  "S",                  # bandit-equivalent security
  "UP",                 # pyupgrade
]

[tool.pyright]
typeCheckingMode = "strict"

[tool.mypy]
strict = true

This alone catches ~70% of what the python-analyzer skill flags in the bug families above. The skill's job is the remaining ~30% that needs cross-function reasoning, plus the prioritization layer on top of the raw output.