Cryptography Foundations · Part 3 of 8·Anonymity Engineering·2026-05-02·16 min read·intermediate

Hash functions and message authentication

Cryptographic hashes from first principles: SHA-2, SHA-3, BLAKE3, what they each guarantee, why HMAC exists, and the length-extension trap that motivates careful MAC design.

A cryptographic hash function is one of the most-used primitives in modern computing. Every git commit, every TLS certificate, every digital signature, every blockchain transaction, every password-storage scheme involves one or more hashes. The properties they actually guarantee are subtle — collision resistance, preimage resistance, behavior as a pseudo-random function — and getting them confused leads to broken protocols. This module is the foundational pass: enough to read protocol specs that say "the verifier MUST compute SHA-256 over the transcript" and know what that buys you, why HMAC exists in addition to plain hashing, and where length-extension attacks still bite naive constructions.

Prerequisites

Module 2.1 — Symmetric encryption, block ciphers, AES. The keyed-permutation framing helps when contrasting with hash functions.
Module 2.2 — Stream ciphers and AEAD construction. MAC-inside-AEAD is one application; this module covers the standalone MACs that came first.

Learning objectives

By the end of this module you should be able to:

State precisely what a cryptographic hash function guarantees, and which guarantee matters for which application.
Distinguish SHA-2, SHA-3, and BLAKE3 at the design-family level without turning the module into a feature-comparison chart.
Explain why HMAC is the right way to build a keyed authenticator from a hash, and why a plain hash(key || message) is structurally unsafe.
Identify where length-extension attacks matter (Merkle-Damgard hashes used naively) and where they don't (HMAC, sponge-based hashes, properly-keyed constructions).
Recognize hash usage patterns in real protocols — certificate fingerprints, content addressing, transcript binding, key derivation — and reason about which property each pattern actually depends on.

What a cryptographic hash actually promises

A cryptographic hash function takes an arbitrary-length input and produces a fixed-length output (the digest, 256 bits for SHA-256, 512 bits for SHA-512, 256 or other lengths for SHA-3 variants). The output should be deterministic — the same input always produces the same digest — and should look pseudo-random.

Three formal properties define what "cryptographic" means:

Preimage resistance. Given a digest H, it should be computationally infeasible to find any input m such that hash(m) = H. Used when you're publishing the digest of something secret and don't want attackers to recover the secret. Approximate cost to break: 2^n operations for an n-bit hash output.

Second-preimage resistance. Given a specific input m1, it should be infeasible to find a different input m2 such that hash(m1) = hash(m2). Used when you've committed to m1's digest and don't want attackers to substitute a different m2 with the same digest. Approximate cost to break: 2^n operations.

Collision resistance. It should be infeasible to find any two distinct inputs m1 ≠ m2 such that hash(m1) = hash(m2). The attacker gets to choose both. Approximate cost to break: 2^(n/2) operations — the birthday bound. This is why SHA-256 is "256-bit hash with 128-bit collision resistance."

These three are formally distinct. Collision resistance is the strongest; preimage resistance is the weakest. A hash function that has collision resistance also (under standard assumptions) has the other two. But many uses of hash functions only need preimage resistance, and confusing them leads to over-engineering.

Two informal properties that matter just as much:

Pseudorandom function (PRF) behavior. The output should look indistinguishable from a uniformly random bitstring to anyone who hasn't computed it. This is what enables hashes to act as random-feeling identifiers, key-derivation building blocks, and transcript bindings. Cryptographic hashes approximate PRF behavior even though the formal proofs are more complex.

Domain separation. Different uses of the same hash function in the same protocol should never collide. The standard way to achieve this is to prefix every input with a context string identifying its purpose: hash("certificate-fingerprint:" || cert). A protocol that hashes both certificates and transcripts under the same hash without domain separation can sometimes confuse a digest of one for a digest of the other.

The point: "hash" is one word for a tool that does several distinct jobs. Knowing which job your protocol needs is the difference between a sound design and a paper-and-pencil break.

Three design families

Modern hash functions come from three lineages, distinguished by their internal construction.

Merkle-Damgard hashes. SHA-1, SHA-2 (SHA-224, SHA-256, SHA-384, SHA-512). Process the input as a sequence of fixed-size blocks, maintaining a chained internal state. Each block updates the state via a compression function; the final state is the digest. The construction is simple, well-studied, and has a structural quirk — length-extension — that we'll cover.

Sponge constructions. SHA-3 (Keccak family — SHA3-224, SHA3-256, SHA3-384, SHA3-512, SHAKE128, SHAKE256). Maintain a large internal state; "absorb" input blocks into the state via XOR plus permutation, then "squeeze" output blocks out. The sponge model naturally handles arbitrary-length output and resists length-extension by construction.

Tree-hash families. BLAKE3 and similar. Hash chunks of input independently, then combine the chunk hashes via a tree structure. This makes hashing parallelizable across CPU cores — large inputs hash much faster — and provides arbitrary-length output without separate constructions.

Each family has trade-offs:

SHA-2 is the workhorse. Universal hardware acceleration (SHA-NI on Intel, ARM Cryptography Extensions). Decades of cryptanalysis. Mandated by every standards body. Use it when you want "the boring secure default that everything supports."
SHA-3 is the hedge. Different math from SHA-2 (Keccak permutation vs SHA-2's compression function). If a future cryptographic break hits SHA-2, SHA-3 is the immediate fallback. Standardized by NIST in 2015. Slower in software than hardware-accelerated SHA-2 but faster than software SHA-2 on many platforms.
BLAKE3 is the performance pick. Fastest in software on multicore CPUs. Tree-based parallelism gives near-linear speedup. Not a NIST standard, which matters for compliance contexts. Specification is short and clear; implementations are simple.

For protocol design in 2026, the typical decision matrix:

Need to interoperate with existing standards. Use SHA-256 or SHA-384. They're mandatory in TLS, X.509, PKI, virtually everything.
Building a new internal protocol with no compatibility constraints. SHA-256 is a fine default; BLAKE3 is the right choice if you're hashing large amounts of data and care about throughput.
High-paranoia post-cryptanalytic hedge. Use SHA3-256 alongside or instead of SHA-256.
Avoid SHA-1 entirely. Collisions have been demonstrated; it's deprecated for cryptographic use. Existing SHA-1 uses (legacy git, some HMAC contexts) are being migrated as fast as practical.

SHA-1 is dead

A short paragraph because it keeps coming up:

SHA-1 had its first published collision in 2017 (Stevens, Bursztein, et al. — the "SHAttered" attack). The attack produced two PDF files with the same SHA-1 digest. Two years later, chosen-prefix collisions became practical: the attacker can choose what each document looks like up to a point, then craft suffixes that collide. By 2020, SHA-1 collisions were affordable on cloud GPU rentals.

SHA-1 is structurally Merkle-Damgard with a 160-bit output (so 80-bit collision resistance from the birthday bound). The known attacks brought collision cost from 2^80 to under 2^65. That gap matters: 2^80 is impractical for any adversary; 2^65 is a few weeks on a GPU farm.

What this means concretely:

TLS certificates with SHA-1 signatures: deprecated, removed from major browser trust stores by 2017.
Git: still uses SHA-1 by default for object addressing as of 2026. Git is moving to SHA-256, but the migration is slow because of the installed base. Treat git's SHA-1 as a content addressing scheme with weak collision resistance, not as a security primitive.
HMAC-SHA-1: still considered safe in some contexts because HMAC's security doesn't depend on collision resistance of the underlying hash. But it's a hedge; don't pick HMAC-SHA-1 for new designs.

The clean rule: SHA-1 is acceptable only when collisions are not part of the threat model. For everything else, SHA-256 or SHA-384.

Hashes are not MACs

A common naive construction:

authenticator = hash(key || message)

The sender computes this and transmits the message plus authenticator. The receiver, who also has the key, recomputes and verifies.

This almost works but isn't safe. The vulnerability is length-extension in Merkle-Damgard hashes (SHA-1, SHA-2). Given the digest of a message, an attacker can compute the digest of message || some_padding || extra_data without knowing the key, by treating the digest as the internal state and continuing the hash from there. The receiver verifies the authenticator on the longer message and accepts it.

Concrete example: an old web protocol used MD5(secret || params) to authenticate URL parameters. An attacker who saw params=foo and its authenticator could compute params=foo||padding||admin=true's authenticator without knowing the secret, and forge an authenticated request that elevates privileges.

This is the "hash is not a MAC" lesson. Hash functions are not designed to be authenticators in their raw form; protocols that use them as such have been broken repeatedly.

The fix is HMAC — keyed hashing with a structure that prevents length-extension by construction.

HMAC

HMAC (Hash-based Message Authentication Code) is a wrapper around any hash function that produces a keyed authenticator. The construction:

HMAC(key, message) = hash( (key XOR opad) || hash( (key XOR ipad) || message ) )

ipad (inner padding) is the byte 0x36 repeated to the hash's block size; opad (outer padding) is 0x5C repeated. The construction does an inner hash with one keyed prefix, then an outer hash with a different keyed prefix wrapping the inner result.

Why this is safe:

Length-extension is broken. The outer hash makes the inner result the content of a finished hash invocation, not the internal state. An attacker who sees the final digest can't extend it because they don't have the inner-hash internal state.
Provably secure under reasonable assumptions. The HMAC construction has formal security proofs: if the underlying hash's compression function is a pseudorandom function, HMAC is too. This means HMAC-SHA-256 is a PRF — usable as a key-derivation primitive, an authenticator, or a transcript binding.
Works with any hash. The construction is hash-agnostic. HMAC-SHA-256, HMAC-SHA-512, HMAC-SHA3-256 all work with the same code path; just swap the hash function.
Simple, fast, well-supported. Every cryptographic library has HMAC.

The cost is roughly two hash invocations per message (inner + outer). For SHA-256, that's negligible — a single core authenticates hundreds of MB/s.

HMAC has been the dominant standalone MAC for two decades. Modern AEAD constructions (GCM, Poly1305) replace it for joint encryption-and-authentication, but HMAC is still the right answer when:

You only need authentication (no encryption).
You need a key-derivation primitive (HKDF — Module 2.6).
You need a PRF for protocol design.

Length-extension attacks in detail

A Merkle-Damgard hash function processes input in blocks, maintaining a fixed-size internal state. The compression function is:

state[0] = IV   (fixed initial value)
state[i+1] = compress(state[i], block[i])
digest = state[final]

The published digest is the final state. Crucially, the digest IS the internal state at the end of the input. An attacker who has the digest and knows or guesses the input length can:

Take the digest as the new initial state.
Append any data they choose, padded correctly for the hash.
Continue hashing from there.

The result is a digest of original_input || padding || attacker_chosen_data. The attacker doesn't know the original input — but they don't need to, because the hash function's internal state is fully captured in the digest.

This is the structural weakness. SHA-256, SHA-512, SHA-1, MD5 all have it. The fixes:

HMAC. As described.
Truncate the digest. A SHA-512/256 (truncated SHA-512 to 256 bits) doesn't expose the full internal state. Length-extension is harder. SHA-2 standards include truncated variants for this reason.
Sponge / SHA-3. The sponge construction's output is squeezed from the state; the digest is not the state. Length-extension is impossible by construction.
BLAKE2 / BLAKE3. Tree-hash designs that don't fall into the Merkle-Damgard trap.

The pragmatic guidance: for new designs, use HMAC or a sponge-based hash. Don't ever use raw SHA-256 as a MAC.

Where hashes show up in real protocols

A whirlwind tour of the patterns:

Certificate fingerprints. A 32-byte SHA-256 of a certificate's DER bytes uniquely identifies the certificate (assuming collision resistance, which is fine for SHA-256). Used in pinning, transparency logs, and out-of-band cert distribution.

Content addressing. Identify a piece of content by its hash. Git does this for objects. IPFS does it for distributed file systems. Software supply chain systems (Sigstore, in-toto) do it for build artifacts. The hash is the address; if you want to verify the content you fetched is the content you asked for, recompute the hash and compare.

Password storage prelude. Passwords are not hashed with SHA-256 directly — they're hashed with a slow function (Argon2, bcrypt) specifically designed to make brute-force attacks expensive. Plain SHA-256 of passwords is too fast; an attacker with a GPU farm can try billions of guesses per second. Plain hashing is not enough for password storage.

Transcript binding. TLS 1.3, WireGuard, Noise: hash the entire handshake transcript. The Finished message includes a MAC over the transcript, ensuring both parties saw the same handshake. Any modification by a man-in-the-middle would change the transcript hash and invalidate the Finished. SHA-256 or SHA-384 (matching the cipher suite's hash) is what's used.

Digital signatures. Signing schemes typically sign the hash of the message, not the message itself. Speeds up signing (signing a 32-byte hash is fast; signing a 1 MB message is slow). Requires collision resistance: if collisions are findable, an attacker can get a signature on m1 and present it as a signature on m2 with the same hash.

Key derivation. HKDF (Module 2.6) is built on HMAC. Every key in modern protocols is derived via a hash-based KDF.

The unifying observation: hashes are everywhere, but they're not all doing the same job. A certificate fingerprint cares about second-preimage resistance. A content addressing scheme cares about collision resistance. A signature commitment cares about collision resistance. A transcript binding cares about PRF-ness. Pick the right hash and the right use of the hash for each.

Hands-on exercise

Exercise 1 — Hash the same file with multiple algorithms

# Create a 100 MB random test file
dd if=/dev/urandom of=/tmp/test.bin bs=1M count=100 2>/dev/null

echo "SHA-256:"
time openssl dgst -sha256 /tmp/test.bin

echo ""
echo "SHA3-256:"
time openssl dgst -sha3-256 /tmp/test.bin

# BLAKE3 if you have b3sum installed (cargo install blake3 / brew install b3sum)
if command -v b3sum >/dev/null; then
    echo ""
    echo "BLAKE3:"
    time b3sum /tmp/test.bin
fi

Compare the times. On a modern x86 box with SHA-NI, SHA-256 typically wins outright due to hardware acceleration. On older platforms or ARM laptops without SHA hardware, BLAKE3 often wins. SHA3-256 is consistently slower in software because Keccak isn't widely accelerated.

The output lengths are all 32 bytes (64 hex characters). The values differ across algorithms because they're different functions. Same input, different digests.

Exercise 2 — Build and verify an HMAC

"""HMAC of a message; verify; tamper; re-verify."""

import hmac
import hashlib

KEY = b"super-secret-key-32-bytes-long!!"
MESSAGE = b"transfer $100 from Alice to Bob"

# Compute HMAC-SHA-256
tag = hmac.new(KEY, MESSAGE, hashlib.sha256).digest()
print(f"HMAC-SHA-256 tag: {tag.hex()}")

# Verify (good message)
assert hmac.compare_digest(
    tag,
    hmac.new(KEY, MESSAGE, hashlib.sha256).digest()
)
print("verify good message: OK")

# Tamper with the message
TAMPERED = b"transfer $999 from Alice to Bob"
expected_tampered_tag = hmac.new(KEY, TAMPERED, hashlib.sha256).digest()
assert tag != expected_tampered_tag, "different messages must produce different tags"
print("verify tampered message: REJECTED (tag mismatch)")

# Try without the key
WRONG_KEY = b"another-32-byte-secret-key-but!!"
wrong_tag = hmac.new(WRONG_KEY, MESSAGE, hashlib.sha256).digest()
assert tag != wrong_tag, "different keys must produce different tags"
print("verify with wrong key: REJECTED")

Three things to notice:

The tag is 32 bytes (64 hex), the SHA-256 output size.
A single-byte change in the message produces a completely unrelated tag (avalanche effect).
Without the key, the attacker cannot produce a valid tag.

hmac.compare_digest is intentionally constant-time. Using == to compare tags can leak timing information — early-mismatch comparisons return faster — and has been used to extract authentication tags via timing attacks in the wild. Use compare_digest for tag verification.

Common misconceptions

"A hash proves who sent the message." A plain hash proves the message wasn't modified — but anyone can recompute it. To prove who sent the message, you need a keyed construction (HMAC, signature, AEAD with a known key). A hash without a key is integrity only, not authentication.

"Collision resistance is the only property that matters." Many protocol uses care about preimage resistance (storing digests of secrets), PRF behavior (key derivation), or domain separation (avoiding cross-context collisions). Picking the right property to depend on is part of correct protocol design.

"SHA-3 replaced SHA-2." SHA-3 was standardized in 2015 as an alternative to SHA-2, not a replacement. NIST explicitly maintained both. SHA-2 remains the dominant choice for performance and ubiquity reasons; SHA-3 is the cryptographic-diversity hedge.

"BLAKE3 is insecure because it's newer." Novelty isn't a security argument. BLAKE3 is based on BLAKE2 (2012), which was a SHA-3 competition finalist; both have substantial cryptanalysis. BLAKE3 has solid security proofs and a clean specification. The reason to prefer SHA-2/SHA-3 over BLAKE3 in compliance-bound contexts is standardization status, not security.

"sha256(key || message) is basically HMAC." It is not. The naive construction is vulnerable to length-extension attacks for Merkle-Damgard hashes (which SHA-256 is). HMAC's specific structure — inner and outer hashing with different keyed prefixes — eliminates this and provides formal security proofs that the naive construction lacks. Use HMAC or a sponge-based MAC; never the naive form.