Hashing and Merkle Trees

Section II: Cryptographic Primitives

Army Cyber Institute

April 9, 2026

Objectives and Roadmap

  • Define a cryptographic hash function

  • Explain key properties: preimage, second‑preimage, collision resistance, and avalanche.

  • Build and verify Merkle tree commitments and inclusion proofs; discuss variants and pitfalls.

  • Understand the relevance of hashing to commit information within blockchain technology.

What Is a Cryptographic Hash Function?

  • Definition: A deterministic algorithm that maps input of any length → fixed-length digest.
  • Formally: \(h : \{0,1\}^* \rightarrow \{0,1\}^n\)
  • Key properties:
    • Deterministic: same input → same output.
    • Fast to compute.
    • Infeasible to invert (one-way).
    • Collision-resistant (no two distinct inputs share same hash).
  • Output length typically 256 or 512 bits (e.g., SHA-256).

Hash Function Properties

Property Meaning Importance
Pre-image resistance Hard to find input for given hash. Protects passwords and commitments.
Second pre-image resistance Hard to find another input with same hash. Prevents forgery.
Collision resistance Hard to find any two inputs with same hash. Ensures uniqueness and integrity.
Avalanche effect Tiny input change → major output change. Detects even 1-bit alteration.
Uniform Distribution Outputs should be uniformly distributed no structural shortcuts should exist.

Pre-Image Resistance

  • Definition: Given a hash value \(y\), it should be computationally infeasible to find any input \(x\) such that \(h(x) = y\).
  • Intuition: Hashing is a one-way street — easy to go forward, practically impossible to reverse.
  • Example:
    • \(h(\text{"password123"})\) \(\rightarrow\) ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f
    • \(h(\text{"correcthorsebatterystaple"})\) \(\rightarrow\) cbe6beb26479b568e5f15b50217c6c83c0ee051dc4e522b9840d8e291d6aaf46
  • Use Cases:
    • Password storage
    • Commitment schemes
    • Blockchain mining puzzles

Second Pre-image Resistance

  • Definition: Given one input \(x_1\), it should be infeasible to find another distinct input \(x_2 \neq x_1\) such that \(h(x_1) = h(x_2)\).
  • Intuition: You can’t find a “different message” that produces the same hash.
  • Example:
    • If \(h(\text{"Alice pays Bob \$10"})\) is known,
      an attacker shouldn’t be able to find another message like “Alice pays Mallory $100” with the same digest.
  • Use Case: Prevents document or transaction substitution attacks.

Collision Resistance

  • Definition: It should be infeasible to find any two distinct inputs \(x_1, x_2\) such that \(h(x_1) = h(x_2)\).
  • Intuition: No two different messages share the same fingerprint.
  • Example:
    • MD5 and SHA-1 are broken because practical collisions have been found.
    • SHA-256 and SHA-3 are currently collision resistant.
  • Use Case:
    • Essential for digital signatures, certificates, and blockchains.

Avalanche Effect

  • A hallmark of good hash and cipher design.
  • Definition: A small change in the input causes a large, unpredictable change in the output.
  • Ensures:
    • No visible correlation between input and output.
    • Robust diffusion — even minimal edits radically alter results.
  • Demonstration:
    • Hash(“HELLO”) → 3615f80c9d293ed7...
    • Hash(“HELLo”) → c6f4b0a7af12e91b...

Uniform Distribution of Output

  • Definition: Hash outputs should be uniformly distributed across the entire output space \(\{0,1\}^n\).
  • Intuition: Every possible digest value is equally likely — no patterns or biases.
  • Consequences:
    • Prevents clustering that could reveal input structure.
    • Ensures fair random-like behavior in protocols (e.g., load balancing, mining).
  • Verification:
    • Empirically tested with randomness and correlation tests.

Math Spotlight: Birthday Paradox and Work Factors

  • Birthday paradox: Collisions become likely far sooner than intuition suggests — only after about \(2^{n/2}\) hashes for an \(n\)-bit output.
    • Analogy: only ~23 people are needed for a 50% chance two share a birthday, not 183!
  • Work factors:
    • Collision search ≈ \(2^{n/2}\)
    • Pre-image / Second pre-image search ≈ \(2^n\)
    • Hence: 128-bit collision security ⇒ 256-bit hash output.
  • Expected collisions:
    For \(k \ll 2^{n/2}\) random hashes:
    \(\displaystyle E[\text{collisions}] \approx \frac{k^2}{2^{n+1}}\)

Choose digest lengths ensuring \(2^{n/2}\) work is infeasible — e.g., SHA-256 or SHA-3.

Micro lab: Hashing

Use the following microlab to demonstrate the principles of hashing using real algorithms

SHA Family Overview

Algorithm Output (bits) Status Notes
SHA-1 160 Deprecated Collisions found
SHA-2 224–512 Secure Bitcoin uses SHA-256
SHA-3 224–512 Secure sponge construction
  • SHA-3 complements, not replaces, SHA-2.
  • Both widely deployed for integrity, digital signatures, and blockchain linking.
  • Continuous public evaluation ensures resistance to emerging attacks.

Merkle–Damgård: How SHA-256 is Built

Core idea: compress each message block one at a time, chaining the output into the next round.

md IV IV f1 compress IV->f1 B1 Block 1 B1->f1 h1 h₁ f1->h1 f2 compress h1->f2 B2 Block 2 B2->f2 h2 h₂ f2->h2 fn compress h2->fn Bn Block n ‖ length Bn->fn H Digest fn->H

  • Each block is fed into a fixed-size compression function along with the previous output; the final output is the hash.
  • A length field is appended to the last block (Merkle–Damgård strengthening) to prevent trivial collisions.
  • Used by SHA-1, SHA-224, SHA-256, SHA-384, SHA-512.

Aside: Length-Extension Attack

Root cause: in Merkle–Damgård, the digest is the final internal state — so it can be used as a starting point to keep hashing.

  1. Attacker sees server broadcast token and "amount=100", where token = SHA256(secret ∥ "amount=100") — they don’t know secret.
  2. They load token directly into SHA-256’s internal state (it’s just 8 words of 32 bits each).
  3. They feed in pad ∥ "amount=1000000" — continuing the hash as if they were the original signer.
  4. They produce token2′ = SHA256(secret ∥ "amount=100" ∥ pad ∥ "amount=1000000").
  5. The server recomputes SHA256(secret ∥ forged_body) — and it matches token2′.

Any scheme of the form H(secret ∥ data) used as a MAC is broken against length extension. This affected Flickr’s API (2009), several Amazon S3 auth schemes, and many custom-rolled API signatures.

Aside: HMAC

Mitigation: Hash-based Message Authentication Code (HMAC)

\[\text{HMAC}(k, m) = H\bigl(k \oplus \text{opad} \;\|\; H(k \oplus \text{ipad} \;\|\; m)\bigr)\]

The outer hash takes the result of the inner hash as input — an attacker extending the inner computation gets a value that is immediately re-hashed with the key, producing garbage.

Sponge Construction: How SHA-3 is Built

Core idea: absorb message blocks into a large state, then squeeze output out — the hidden interior is never exposed.

sponge M1 M₁ xor1 ⊕ rate bits M1->xor1 perm1 permute (f) xor1->perm1 xor2 ⊕ rate bits perm1->xor2 state M2 M₂ M2->xor2 perm2 permute (f) xor2->perm2 out extract rate bits perm2->out rate only D Digest out->D

  • State = rate (r) + capacity (c). Each round XORs a message block into the rate portion only, then permutes the full state.
  • The capacity bits are never output — they are the security margin. Target: c = 2 × desired security level.
  • Used by SHA-3/Keccak. Ethereum uses Keccak-256 (pre-standardization padding — not identical to SHA3-256).

Merkle–Damgård vs. Sponge Construction

Merkle–Damgård (SHA-256) Sponge (SHA-3)
Structure Iterated compression Absorb → permute → squeeze
Digest = internal state? Yes No — capacity is hidden
Length-extension vulnerable? Yes No
Used in Bitcoin / Ethereum SHA-256 / Keccak-256 Keccak-256

The sponge’s hidden capacity makes length-extension impossible by design — you can’t reconstruct the full internal state from the digest alone.

Aside: Hash Salting

  • Definition: A salt is a random, unique value added to each input before hashing.
    • Formally: \(h = \text{Hash}(\text{salt} \,\|\, \text{password})\)
    • Salt is stored in cleartext alongside the hash.
  • Purpose:
    • Defeats precomputed rainbow table attacks.
    • Ensures identical inputs produce different hashes.
    • Forces attackers to recompute hashes for each user.
  • Properties:
    • Each salt should be unique (ideally random 128 bits).
    • Salts don’t need to be secret — just unpredictable.
    • Common in password databases, key derivation, and digital commitments.
  • Example:
    • \(h_1 = \text{SHA256}(\text{"A9F2"} \,\|\, \text{"password"})\)
    • \(h_2 = \text{SHA256}(\text{"B1C4"} \,\|\, \text{"password"})\) → different result.

Hashing as Commitment

  • A document can be represented by its hash.
  • The hash is unique to its contents.
  • Example:
    • You want to prove ownership of intellectual property (document, book, etc).
    • You publicly reveal the hash of the document (ie, tweet it out)
    • At a later date, you can prove the document existed by showing the hash matches your tweet.
  • Publishing the hash “commits” you to the document without revealing it.

Merkle Trees: Motivation

  • Problem: How can we represent and verify a large collection of items with a single short value?
  • Goal: Detect any tampering and prove membership of an item without revealing or re‑sending everything.
  • Key idea: Build a tree of hashes so that one small value (the root hash) commits to all leaves.
  • Payoff:
    • Fast verification: proofs size grows like \(O(\log n)\).
    • Tamper‑evident: any change propagates up to the root.
    • Storage‑friendly: keep one root instead of all items.

What Is a Merkle Tree?

  • A Merkle tree is a hash tree:
    • Leaves: \(\ell_i = H(x_i)\) where \(x_i\) is the \(i\)‑th data block.
    • Internal node: \(h = H(\text{left} \parallel \text{right})\).
    • Root: the top hash commits to all leaves below.
  • Concatenation \(\parallel\) is order‑sensitive: \(H(a\parallel b) \ne H(b\parallel a)\) in general.
  • The root hash is a commitment to the whole dataset.

MerkleExample4 R r = H(p1 ‖ p2) p1 p1 = H(a ‖ b) p1->R p2 p2 = H(c ‖ d) p2->R a a = H(A) a->p1 b b = H(B) b->p1 c c = H(C) c->p2 d d = H(D) d->p2

Building a Merkle Tree

  1. Split your dataset into items \(x_1,\ldots,x_n\).
  2. Compute leaf hashes: \(\ell_i = H(\text{LeafTag}\parallel x_i)\).
  3. Pair adjacent hashes and compute parents: \(p_j = H(\text{NodeTag}\parallel \ell_{2j-1} \parallel \ell_{2j})\).
  4. Repeat step 3 until a single root remains.
  5. If \(n\) is odd, define a rule (e.g., duplicate the last hash or carry it up) and apply it consistently.
  • Tags (a.k.a. domain separation): include a short constant like LeafTag vs NodeTag to avoid ambiguity.

Worked Example (4 Items)

  • Let items be \(A,B,C,D\).
  • Leaves: \(a=H(\text{Leaf}\parallel A)\), \(b=H(\text{Leaf}\parallel B)\), \(c=H(\text{Leaf}\parallel C)\), \(d=H(\text{Leaf}\parallel D)\).
  • Parents: \(p_1=H(\text{Node}\parallel a\parallel b)\), \(p_2=H(\text{Node}\parallel c\parallel d)\).
  • Root: \(r=H(\text{Node}\parallel p_1\parallel p_2)\).

MerkleExample4 R r = H(Node ‖ p1 ‖ p2) p1 p1 = H(Node ‖ a ‖ b) R->p1 p2 p2 = H(Node ‖ c ‖ d) R->p2 a a = H(Leaf ‖ A) p1->a b b = H(Leaf ‖ B) p1->b c c = H(Leaf ‖ C) p2->c d d = H(Leaf ‖ D) p2->d

Merkle Proofs (Membership)

  • To prove an item \(A\) is in the tree with root \(r\):
    1. Provide \(A\) and its authentication path: the list of sibling hashes from the leaf up.
    2. The verifier recomputes: \(a=H(\text{Leaf}\parallel A)\), then combines with each sibling in the correct left/right order to reconstruct \(r\).
  • If the recomputed root equals the claimed \(r\), membership is verified.
  • Proof size: about \(\lceil \log_2 n \rceil\) hashes.

Why Proofs Are Efficient

  • For \(n\) leaves, depth is \(\approx \log_2 n\) (binary tree).
  • Proof length: \(\lceil \log_2 n \rceil\) hashes; e.g.,
    • \(n=1{,}000\) → about 10 hashes.
    • \(n=1{,}000{,}000\) → about 20 hashes.
  • Verification time: \(O(\log n)\) hash calls.
  • Storage: You may store only the root (and a policy that defines how the tree is built).

Tamper Evidence & Security Assumptions

  • Security relies on properties of the hash function \(H\):
    • Collision resistance: infeasible to find \(x\ne x'\) with \(H(x)=H(x')\).
    • Second‑preimage resistance: given \(x\), hard to find \(x'\ne x\) with same hash.
  • Ordering matters: use left/right labels and explicit concatenation.
  • Domain separation: tag leaves vs internal nodes (e.g., 0x00 for leaves, 0x01 for nodes).

Design Choices & Variants

  • Branching factor: binary vs \(k\)‑ary (trades depth for node size).
  • Sorted vs original order: some systems sort leaves for determinism; others preserve insertion order.
  • Odd number of leaves: duplicate last leaf, or promote it—pick one rule and publish it.
  • Sparse trees: massive key spaces with mostly empty leaves (useful when indices matter: concatenate node value with index or node tag prior to hashing).

Where You’ll See Them (Non‑Exhaustive)

  • Content‑addressed data systems: version control stores (e.g., commit graphs as hash‑linked structures).
  • Transparency logs: public, append‑only logs use Merkle roots to audit inclusion and history.
  • Package & software updates: metadata signed via Merkle roots to ensure integrity.
  • Filesystems & databases: integrity trees over blocks/pages for fast corruption detection.
  • Blockchain: Block headers include a Merkle root committing to all transactions in the block.

Practice: Micro Labs

Review the following interactive labs:

Summary

  • A cryptographic hash provides binding commitments with strong one‑way and collision resistance.
  • A Merkle tree is a hash‑based commitment to a whole dataset.
  • Membership proofs use sibling hashes along a path to rebuild the root.
  • Efficiency: verification cost and proof size grow logarithmically.
  • Security: depends on a good hash function and unambiguous formatting rules.

References

[1]
J. Katz and Y. Lindell, Introduction to Modern Cryptography, 3rd ed. Chapman and Hall/CRC, 2020. doi: 10.1201/9781351133036.
[2]
A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, “Handbook of Applied Cryptography.” CRC Press, Aug. 07, 2011. Accessed: Oct. 23, 2025. [Online]. Available: https://cacr.uwaterloo.ca/hac/
[3]
National Institute of Standards and Technology (US), “Secure hash standard,” National Institute of Standards and Technology (U.S.), Washington, D.C., NIST FIPS 180-4, 2015. doi: 10.6028/NIST.FIPS.180-4.
[4]
D. Boneh and V. Shoup, “Hash Functions.” 2020. Available: https://intensecrypto.org/public/lec_07_hash_functions.html
[5]
M. Stevens, E. Bursztein, P. Karpman, A. Albertini, and Y. Markov, “The first collision for full SHA-1.” Accessed: Oct. 30, 2025. [Online]. Available: https://eprint.iacr.org/2017/190
[6]
D. R. Stinson and M. B. Paterson, Cryptography: Theory and practice, Fourth edition, first issued in paperback. Boca Raton, FL: Chapman & Hall/CRC Press, 2022.
[7]
G. Leurent and T. Peyrin, SHA-1 is a Shambles.” Accessed: Oct. 30, 2025. [Online]. Available: https://sha-mbles.github.io/
[8]
CWI Amsterdam and Google Research, SHAttered.” Accessed: Oct. 30, 2025. [Online]. Available: https://shattered.io/
[9]
G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, “The Keccak Reference version 3.0.” Team Keccak, Jan. 14, 2011. Accessed: Oct. 30, 2025. [Online]. Available: https://keccak.team/files/Keccak-reference-3.0.pdf
[10]
D. Temoshok, “Digital Identity Guidelines: Authentication and Authenticator Management,” National Institute of Standards and Technology, Gaithersburg, MD, NIST SP 800-63B-4, 2025. doi: 10.6028/NIST.SP.800-63B-4.
[11]
J. Bonneau and S. E. Schechter, “Towards reliable storage of 56-bit secrets in human memory,” in USENIX Security Symposiym, Aug. 2014. Accessed: Oct. 30, 2025. [Online]. Available: https://users.umiacs.umd.edu/~tudor/courses/ENEE757/Fall15/papers/Bonneau14.pdf
[12]
S. Haber and W. S. Stornetta, “How to time-stamp a digital document,” J. Cryptology, vol. 3, no. 2, pp. 99–111, Jan. 1991, doi: 10.1007/BF00196791.
[13]
Q. H. Dang, “Recommendation for applications using approved hash algorithms,” National Institute of Standards and Technology, Gaithersburg, MD, NIST SP 800-107r1, 2012. doi: 10.6028/NIST.SP.800-107r1.
[14]
D. Bayer, S. Haber, and W. S. Stornetta, “Improving the Efficiency and Reliability of Digital Time-Stamping,” in Sequences II, R. Capocelli, A. De Santis, and U. Vaccaro, Eds., New York, NY: Springer New York, 1993, pp. 329–334. doi: 10.1007/978-1-4613-9323-8_24.
[15]
H. Massias, X. S. Avila, and J.-J. Quisquater, “Design of a Secure Timestamping Service with Minimal Trust Requirements,” in 20th Symposium on Information Theory in the Benelux, Haasrode, Belgium, May 1999. Available: https://cdn.nakamotoinstitute.org/docs/secure-timestamping-service.pdf
[16]
S. Nakamoto, “Bitcoin: A Peer-to-Peer Electronic Cash System.” Satoshi Nakamoto Institute, Oct. 31, 2008. Accessed: Sep. 12, 2025. [Online]. Available: https://cdn.nakamotoinstitute.org/docs/bitcoin.pdf
[17]
Nakamoto Institute, “Bitcoin Library (Primary Sources).” 2014. Available: https://nakamotoinstitute.org/library/bitcoin/
[18]
A. Narayanan, J. Bonneau, E. Felten, A. Miller, and S. Goldfeder, Bitcoin and Cryptocurrency Technologies. Princeton University Press, 2016.