Blockchain Course – Hashing and Merkle Trees

Objectives and Roadmap

Define a cryptographic hash function
Explain key properties: preimage, second‑preimage, collision resistance, and avalanche.
Build and verify Merkle tree commitments and inclusion proofs; discuss variants and pitfalls.
Understand the relevance of hashing to commit information within blockchain technology.

What Is a Cryptographic Hash Function?

Definition: A deterministic algorithm that maps input of any length → fixed-length digest.
Formally: $h : \{0,1\}^* \rightarrow \{0,1\}^n$
Key properties:
- Deterministic: same input → same output.
- Fast to compute.
- Infeasible to invert (one-way).
- Collision-resistant (no two distinct inputs share same hash).
Output length typically 256 or 512 bits (e.g., SHA-256).

Hash Function Properties

Property	Meaning	Importance
Pre-image resistance	Hard to find input for given hash.	Protects passwords and commitments.
Second pre-image resistance	Hard to find another input with same hash.	Prevents forgery.
Collision resistance	Hard to find any two inputs with same hash.	Ensures uniqueness and integrity.
Avalanche effect	Tiny input change → major output change.	Detects even 1-bit alteration.
Uniform Distribution	Outputs should be uniformly distributed	no structural shortcuts should exist.

Pre-Image Resistance

Definition: Given a hash value $y$, it should be computationally infeasible to find any input $x$ such that $h(x) = y$.
Intuition: Hashing is a one-way street — easy to go forward, practically impossible to reverse.
Example:
- $h(\text{"password123"})$ $\rightarrow$ ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f
- $h(\text{"correcthorsebatterystaple"})$ $\rightarrow$ cbe6beb26479b568e5f15b50217c6c83c0ee051dc4e522b9840d8e291d6aaf46
Use Cases:
- Password storage
- Commitment schemes
- Blockchain mining puzzles

Second Pre-image Resistance

Definition: Given one input $x_1$, it should be infeasible to find another distinct input $x_2 \neq x_1$ such that $h(x_1) = h(x_2)$.
Intuition: You can’t find a “different message” that produces the same hash.
Example:
- If $h(\text{"Alice pays Bob \$10"})$ is known,
  an attacker shouldn’t be able to find another message like “Alice pays Mallory $100” with the same digest.
Use Case: Prevents document or transaction substitution attacks.

Collision Resistance

Definition: It should be infeasible to find any two distinct inputs $x_1, x_2$ such that $h(x_1) = h(x_2)$.
Intuition: No two different messages share the same fingerprint.
Example:
- MD5 and SHA-1 are broken because practical collisions have been found.
- SHA-256 and SHA-3 are currently collision resistant.
Use Case:
- Essential for digital signatures, certificates, and blockchains.

Avalanche Effect

A hallmark of good hash and cipher design.
Definition: A small change in the input causes a large, unpredictable change in the output.
Ensures:
- No visible correlation between input and output.
- Robust diffusion — even minimal edits radically alter results.
Demonstration:
- Hash(“HELLO”) → 3615f80c9d293ed7...
- Hash(“HELLo”) → c6f4b0a7af12e91b...

Uniform Distribution of Output

Definition: Hash outputs should be uniformly distributed across the entire output space $\{0,1\}^n$.
Intuition: Every possible digest value is equally likely — no patterns or biases.
Consequences:
- Prevents clustering that could reveal input structure.
- Ensures fair random-like behavior in protocols (e.g., load balancing, mining).
Verification:
- Empirically tested with randomness and correlation tests.

Math Spotlight: Birthday Paradox and Work Factors

Birthday paradox: Collisions become likely far sooner than intuition suggests — only after about $2^{n/2}$ hashes for an $n$-bit output.
- Analogy: only ~23 people are needed for a 50% chance two share a birthday, not 183!
Work factors:
- Collision search ≈ $2^{n/2}$
- Pre-image / Second pre-image search ≈ $2^n$
- Hence: 128-bit collision security ⇒ 256-bit hash output.
Expected collisions:
For $k \ll 2^{n/2}$ random hashes:
$\displaystyle E[\text{collisions}] \approx \frac{k^2}{2^{n+1}}$

Choose digest lengths ensuring $2^{n/2}$ work is infeasible — e.g., SHA-256 or SHA-3.

Micro lab: Hashing

Use the following microlab to demonstrate the principles of hashing using real algorithms

Overall Slide Concept This lab slide establishes a hands-on checkpoint where students can see avalanche, collisions, and output distribution instead of only hearing about them abstractly. It matters here because the lesson has now built enough theory for students to test the ideas interactively. Emphasize that the lab is about observing how secure and insecure hash designs behave differently under the same experiments.

Key Points - Use the lab to compare secure hash behavior with deliberately weak or broken constructions. - Avalanche demos make diffusion visible by comparing nearly identical inputs. - Collision and distribution demos help students connect theory to concrete attack intuition.

Walkthrough 1. Run the avalanche demo on SHA-1 and SHA-256 and compare nearly identical inputs. 2. Switch to MD5 or the built-in collision demo to show how broken or weaker designs fail. 3. Run the uniform-distribution visualizations and compare strong hashes against the homemade example.

Sources [7]; [8]; [5]

SHA Family Overview

Algorithm	Output (bits)	Status	Notes
SHA-1	160	Deprecated	Collisions found
SHA-2	224–512	Secure	Bitcoin uses SHA-256
SHA-3	224–512	Secure	sponge construction

SHA-3 complements, not replaces, SHA-2.
Both widely deployed for integrity, digital signatures, and blockchain linking.
Continuous public evaluation ensures resistance to emerging attacks.

Merkle–Damgård: How SHA-256 is Built

Core idea: compress each message block one at a time, chaining the output into the next round.

Each block is fed into a fixed-size compression function along with the previous output; the final output is the hash.
A length field is appended to the last block (Merkle–Damgård strengthening) to prevent trivial collisions.
Used by SHA-1, SHA-224, SHA-256, SHA-384, SHA-512.

Overall Slide Concept This slide establishes Merkle–Damgård as the iterative compression structure behind SHA-256 and related hashes. It matters here because students are about to see how internal construction details create both strengths and pitfalls. Emphasize that the digest is the final chaining state after processing the padded blocks.

Key Points - Merkle–Damgård hashes process fixed-size message blocks one at a time with a chaining state. - Padding and the appended length field make the final encoded message unambiguous. - Compression function means a fixed-size function that mixes one block and the current state into the next state.

Walkthrough 1. Start from the initialization vector and feed in the first message block. 2. Reuse the resulting state as input to the compression function for each later block. 3. After the final padded block, treat the last state value as the digest.

Sources None

Aside: Length-Extension Attack

Root cause: in Merkle–Damgård, the digest is the final internal state — so it can be used as a starting point to keep hashing.

Attacker sees server broadcast token and "amount=100", where token = SHA256(secret ∥ "amount=100") — they don’t know secret.
They load token directly into SHA-256’s internal state (it’s just 8 words of 32 bits each).
They feed in pad ∥ "amount=1000000" — continuing the hash as if they were the original signer.
They produce token2′ = SHA256(secret ∥ "amount=100" ∥ pad ∥ "amount=1000000").
The server recomputes SHA256(secret ∥ forged_body) — and it matches token2′.

Any scheme of the form H(secret ∥ data) used as a MAC is broken against length extension. This affected Flickr’s API (2009), several Amazon S3 auth schemes, and many custom-rolled API signatures.

Overall Slide Concept This slide establishes length extension as a structural weakness of Merkle–Damgård hashes when they are misused as message authentication codes. It matters here because students need to see that a secure hash can still be unsafe in the wrong construction. Emphasize that the problem comes from exposing the final internal state as the digest.

Key Points - If a digest reveals the final chaining state, an attacker can continue hashing without knowing the original secret prefix. - Constructions like H(secret || data) are unsafe as MACs against this attack. - A MAC is a message authentication code used to prove both integrity and shared-secret knowledge.

Walkthrough 1. The attacker starts from a known token that equals the final internal state of the original hash. 2. They continue hashing additional padded data as though they were the original sender. 3. The verifier recomputes the same extended body and accepts the forged token if the scheme is badly designed.

Sources None

Aside: HMAC

Mitigation: Hash-based Message Authentication Code (HMAC)

\[\text{HMAC}(k, m) = H\bigl(k \oplus \text{opad} \;\|\; H(k \oplus \text{ipad} \;\|\; m)\bigr)\]

The outer hash takes the result of the inner hash as input — an attacker extending the inner computation gets a value that is immediately re-hashed with the key, producing garbage.

Sponge Construction: How SHA-3 is Built

Core idea: absorb message blocks into a large state, then squeeze output out — the hidden interior is never exposed.

State = rate (r) + capacity (c). Each round XORs a message block into the rate portion only, then permutes the full state.
The capacity bits are never output — they are the security margin. Target: c = 2 × desired security level.
Used by SHA-3/Keccak. Ethereum uses Keccak-256 (pre-standardization padding — not identical to SHA3-256).

Overall Slide Concept This slide establishes the sponge construction behind SHA-3 and contrasts it with Merkle–Damgård at the state-design level. It matters here because students need to see that construction choices can eliminate some attacks by design rather than only by mitigation. Emphasize that the hidden capacity portion is the key conceptual difference.

Key Points - Sponge hashes absorb message blocks into part of a larger internal state, then squeeze output back out. - The capacity portion stays hidden and serves as a security margin. - Ethereum uses Keccak-256, which is closely related to but not identical with standardized SHA3-256.

Walkthrough 1. XOR each input block into the public rate portion of the state. 2. Permute the full state after each absorption round. 3. Extract output from the rate portion while the hidden capacity remains unexposed.

Sources None

Merkle–Damgård vs. Sponge Construction

	Merkle–Damgård (SHA-256)	Sponge (SHA-3)
Structure	Iterated compression	Absorb → permute → squeeze
Digest = internal state?	Yes	No — capacity is hidden
Length-extension vulnerable?	Yes	No
Used in Bitcoin / Ethereum	SHA-256 / Keccak-256	Keccak-256

The sponge’s hidden capacity makes length-extension impossible by design — you can’t reconstruct the full internal state from the digest alone.

Aside: Hash Salting

Definition: A salt is a random, unique value added to each input before hashing.
- Formally: $h = \text{Hash}(\text{salt} \,\|\, \text{password})$
- Salt is stored in cleartext alongside the hash.
Purpose:
- Defeats precomputed rainbow table attacks.
- Ensures identical inputs produce different hashes.
- Forces attackers to recompute hashes for each user.
Properties:
- Each salt should be unique (ideally random 128 bits).
- Salts don’t need to be secret — just unpredictable.
- Common in password databases, key derivation, and digital commitments.
Example:
- $h_1 = \text{SHA256}(\text{"A9F2"} \,\|\, \text{"password"})$
- $h_2 = \text{SHA256}(\text{"B1C4"} \,\|\, \text{"password"})$ → different result.

Hashing as Commitment

A document can be represented by its hash.
The hash is unique to its contents.
Example:
- You want to prove ownership of intellectual property (document, book, etc).
- You publicly reveal the hash of the document (ie, tweet it out)
- At a later date, you can prove the document existed by showing the hash matches your tweet.
Publishing the hash “commits” you to the document without revealing it.

Merkle Trees: Motivation

Problem: How can we represent and verify a large collection of items with a single short value?
Goal: Detect any tampering and prove membership of an item without revealing or re‑sending everything.
Key idea: Build a tree of hashes so that one small value (the root hash) commits to all leaves.
Payoff:
- Fast verification: proofs size grows like $O(\log n)$.
- Tamper‑evident: any change propagates up to the root.
- Storage‑friendly: keep one root instead of all items.

What Is a Merkle Tree?

A Merkle tree is a hash tree:
- Leaves: $\ell_i = H(x_i)$ where $x_i$ is the $i$‑th data block.
- Internal node: $h = H(\text{left} \parallel \text{right})$.
- Root: the top hash commits to all leaves below.
Concatenation $\parallel$ is order‑sensitive: $H(a\parallel b) \ne H(b\parallel a)$ in general.
The root hash is a commitment to the whole dataset.

Building a Merkle Tree

Split your dataset into items $x_1,\ldots,x_n$.
Compute leaf hashes: $\ell_i = H(\text{LeafTag}\parallel x_i)$.
Pair adjacent hashes and compute parents: $p_j = H(\text{NodeTag}\parallel \ell_{2j-1} \parallel \ell_{2j})$.
Repeat step 3 until a single root remains.
If $n$ is odd, define a rule (e.g., duplicate the last hash or carry it up) and apply it consistently.

Tags (a.k.a. domain separation): include a short constant like LeafTag vs NodeTag to avoid ambiguity.

Overall Slide Concept This slide establishes the deterministic recipe for building a Merkle tree so that independent parties arrive at the same root. It matters here because many practical bugs come from vague rules about tagging, ordering, or odd leaf counts rather than from the hash algorithm itself. Emphasize that consistency rules are part of security, not mere implementation detail.

Key Points - Leaves and internal nodes should be encoded unambiguously, often with domain-separation tags. - Odd numbers of leaves require a published rule such as duplication or promotion. - Domain separation means marking different object types differently before hashing so they cannot be confused.

Walkthrough 1. Hash each input item into a leaf value using the chosen leaf encoding rule. 2. Pair adjacent hashes and hash each pair into parent nodes using the node rule. 3. Repeat until only one root remains, applying the odd-leaf rule consistently when needed.

Sources [1]; [2]

Worked Example (4 Items)

Let items be $A,B,C,D$.
Leaves: $a=H(\text{Leaf}\parallel A)$, $b=H(\text{Leaf}\parallel B)$, $c=H(\text{Leaf}\parallel C)$, $d=H(\text{Leaf}\parallel D)$.
Parents: $p_1=H(\text{Node}\parallel a\parallel b)$, $p_2=H(\text{Node}\parallel c\parallel d)$.
Root: $r=H(\text{Node}\parallel p_1\parallel p_2)$.

Overall Slide Concept This slide establishes the mechanical process of computing a small Merkle tree by hand so students can see tamper propagation directly. It matters here because the abstract tree definition becomes much easier to trust once students watch one concrete example build upward. Emphasize how one changed leaf forces every ancestor on its path to change.

Key Points - The root is computed entirely from repeated local hash combinations. - A change to one data item changes its leaf, then its parent, and eventually the root. - Worked examples make proof verification easier because students can trace the exact path manually.

Walkthrough 1. Hash A, B, C, and D into four leaf digests. 2. Combine the leaves pairwise to compute two parent hashes. 3. Hash the two parents together to obtain the final root commitment.

Sources [1]; [2]

Merkle Proofs (Membership)

To prove an item $A$ is in the tree with root $r$:
1. Provide $A$ and its authentication path: the list of sibling hashes from the leaf up.
2. The verifier recomputes: $a=H(\text{Leaf}\parallel A)$, then combines with each sibling in the correct left/right order to reconstruct $r$.
If the recomputed root equals the claimed $r$, membership is verified.
Proof size: about $\lceil \log_2 n \rceil$ hashes.

Overall Slide Concept This slide establishes the idea of a Merkle membership proof as a compact path from one item to the published root. It matters here because this is the payoff students care about in blockchains and transparency systems: proving inclusion without sending everything. Emphasize that the verifier needs the correct sibling values and the correct left/right positions.

Key Points - A membership proof consists of the target item plus the sibling hashes needed to rebuild the root. - Verification works by hashing upward in the correct order until the claimed root is reconstructed. - Authentication path is another name for the ordered list of sibling hashes used in the proof.

Walkthrough 1. Hash the claimed item into its leaf value. 2. Combine that leaf with each supplied sibling hash in the correct left/right order. 3. Accept the proof only if the final reconstructed value equals the published root.

Sources [1]; [2]

Why Proofs Are Efficient

For $n$ leaves, depth is $\approx \log_2 n$ (binary tree).
Proof length: $\lceil \log_2 n \rceil$ hashes; e.g.,
- $n=1{,}000$ → about 10 hashes.
- $n=1{,}000{,}000$ → about 20 hashes.
Verification time: $O(\log n)$ hash calls.
Storage: You may store only the root (and a policy that defines how the tree is built).

Tamper Evidence & Security Assumptions

Security relies on properties of the hash function $H$:
- Collision resistance: infeasible to find $x\ne x'$ with $H(x)=H(x')$.
- Second‑preimage resistance: given $x$, hard to find $x'\ne x$ with same hash.
Ordering matters: use left/right labels and explicit concatenation.
Domain separation: tag leaves vs internal nodes (e.g., 0x00 for leaves, 0x01 for nodes).

Design Choices & Variants

Branching factor: binary vs $k$‑ary (trades depth for node size).
Sorted vs original order: some systems sort leaves for determinism; others preserve insertion order.
Odd number of leaves: duplicate last leaf, or promote it—pick one rule and publish it.
Sparse trees: massive key spaces with mostly empty leaves (useful when indices matter: concatenate node value with index or node tag prior to hashing).

Where You’ll See Them (Non‑Exhaustive)

Content‑addressed data systems: version control stores (e.g., commit graphs as hash‑linked structures).
Transparency logs: public, append‑only logs use Merkle roots to audit inclusion and history.
Package & software updates: metadata signed via Merkle roots to ensure integrity.
Filesystems & databases: integrity trees over blocks/pages for fast corruption detection.
Blockchain: Block headers include a Merkle root committing to all transactions in the block.

Practice: Micro Labs

Review the following interactive labs:

First: Merkle-trees Lab
Second: Hashing Merkle Lab

Overall Slide Concept This slide establishes a practice checkpoint for building and breaking small Merkle examples interactively. It matters here because students now have enough theory to benefit from manipulating tree structure and proofs directly. Emphasize that the goal is to visualize proof construction, failure, and root recomputation.

Key Points - The labs let students build trees, alter values, and observe proof failure visually. - They reinforce how block size, sibling paths, and recomputation affect the final root. - Practical manipulation helps students internalize why specification details matter.

Walkthrough 1. Use the Merkle-trees lab to generate a tree from sample text and vary the block size. 2. Modify a value or hash and observe how the proof fails until the tree is recomputed. 3. Use the second lab to walk through proof generation and verification step by step.

Sources [1]; [2]

Summary

A cryptographic hash provides binding commitments with strong one‑way and collision resistance.
A Merkle tree is a hash‑based commitment to a whole dataset.
Membership proofs use sibling hashes along a path to rebuild the root.
Efficiency: verification cost and proof size grow logarithmically.
Security: depends on a good hash function and unambiguous formatting rules.

References

[1]

J. Katz and Y. Lindell, Introduction to Modern Cryptography, 3rd ed. Chapman and Hall/CRC, 2020. doi: 10.1201/9781351133036.

[2]

A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, “Handbook of Applied Cryptography.” CRC Press, Aug. 07, 2011. Accessed: Oct. 23, 2025. [Online]. Available: https://cacr.uwaterloo.ca/hac/

[3]

National Institute of Standards and Technology (US), “Secure hash standard,” National Institute of Standards and Technology (U.S.), Washington, D.C., NIST FIPS 180-4, 2015. doi: 10.6028/NIST.FIPS.180-4.

[4]

D. Boneh and V. Shoup, “Hash Functions.” 2020. Available: https://intensecrypto.org/public/lec_07_hash_functions.html

[5]

M. Stevens, E. Bursztein, P. Karpman, A. Albertini, and Y. Markov, “The first collision for full SHA-1.” Accessed: Oct. 30, 2025. [Online]. Available: https://eprint.iacr.org/2017/190

[6]

D. R. Stinson and M. B. Paterson, Cryptography: Theory and practice, Fourth edition, first issued in paperback. Boca Raton, FL: Chapman & Hall/CRC Press, 2022.

[7]

G. Leurent and T. Peyrin, “SHA-1 is a Shambles.” Accessed: Oct. 30, 2025. [Online]. Available: https://sha-mbles.github.io/

[8]

CWI Amsterdam and Google Research, “SHAttered.” Accessed: Oct. 30, 2025. [Online]. Available: https://shattered.io/

[9]

G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, “The Keccak Reference version 3.0.” Team Keccak, Jan. 14, 2011. Accessed: Oct. 30, 2025. [Online]. Available: https://keccak.team/files/Keccak-reference-3.0.pdf

[10]

D. Temoshok, “Digital Identity Guidelines: Authentication and Authenticator Management,” National Institute of Standards and Technology, Gaithersburg, MD, NIST SP 800-63B-4, 2025. doi: 10.6028/NIST.SP.800-63B-4.

[11]

J. Bonneau and S. E. Schechter, “Towards reliable storage of 56-bit secrets in human memory,” in USENIX Security Symposiym, Aug. 2014. Accessed: Oct. 30, 2025. [Online]. Available: https://users.umiacs.umd.edu/~tudor/courses/ENEE757/Fall15/papers/Bonneau14.pdf

[12]

S. Haber and W. S. Stornetta, “How to time-stamp a digital document,” J. Cryptology, vol. 3, no. 2, pp. 99–111, Jan. 1991, doi: 10.1007/BF00196791.

[13]

Q. H. Dang, “Recommendation for applications using approved hash algorithms,” National Institute of Standards and Technology, Gaithersburg, MD, NIST SP 800-107r1, 2012. doi: 10.6028/NIST.SP.800-107r1.

[14]

D. Bayer, S. Haber, and W. S. Stornetta, “Improving the Efficiency and Reliability of Digital Time-Stamping,” in Sequences II, R. Capocelli, A. De Santis, and U. Vaccaro, Eds., New York, NY: Springer New York, 1993, pp. 329–334. doi: 10.1007/978-1-4613-9323-8_24.

[15]

H. Massias, X. S. Avila, and J.-J. Quisquater, “Design of a Secure Timestamping Service with Minimal Trust Requirements,” in 20th Symposium on Information Theory in the Benelux, Haasrode, Belgium, May 1999. Available: https://cdn.nakamotoinstitute.org/docs/secure-timestamping-service.pdf

[16]

S. Nakamoto, “Bitcoin: A Peer-to-Peer Electronic Cash System.” Satoshi Nakamoto Institute, Oct. 31, 2008. Accessed: Sep. 12, 2025. [Online]. Available: https://cdn.nakamotoinstitute.org/docs/bitcoin.pdf

[17]

Nakamoto Institute, “Bitcoin Library (Primary Sources).” 2014. Available: https://nakamotoinstitute.org/library/bitcoin/

[18]

A. Narayanan, J. Bonneau, E. Felten, A. Miller, and S. Goldfeder, Bitcoin and Cryptocurrency Technologies. Princeton University Press, 2016.