MD5 Collision Attack

Ifeanyi Ukadike · October 11, 2023

crypto md5

MD5 is a type of one-way hash function.

A one-way hash function is a cryptographic function that takes an input and generates a fixed-size output. There are several properties of a one-way hash function. Some of these include:

getting the original input from the hash value must be computationally infeasible.
changing a single character in the input should result in a completely different hash value.
it should have a negligible probability of producing the same hash output for different inputs

When a one-way hash function produces the same hash output for different inputs, this is known as a collision. A good one-way hash function must be collision-resistant. The collision-resistance property ensures that it is computationally infeasible to find two inputs that produce the same hash output. If a hash function lacks collision resistance, it becomes vulnerable to collision attacks such as birthday attacks, where an attacker tries to find two inputs that produce the same hash value.

Collision attacks can be achieved through various techniques, such as brute force searching, exploiting weaknesses or vulnerabilities in the hash function, or using advanced mathematical algorithms.

One-way hash functions are used in password storage, data verification, digital signatures, and data integrity checks.

SeedLabs: MD5 Collision Attack Lab

MD5 Collision Attack Lab

Generating Two Different Files with the Same MD5 Hash

This lab section deals with generating two different files with the same MD5 hash values. To achieve this, the beginning parts of both files need to be the same; which means they have to share the same prefix. A prefix refers to a specific set of bytes that are added to the beginning of data before it is hashed.

The md5collgen program allows us to provide a prefix file with any arbitrary content.

#!/usr/bin/env python3

with open("prefix.txt", "w") as file:
    file.write("A" * 64)

$ md5collgen -p prefix.txt -o out1.bin out2.bin

task-1-a

When I check whether the out1.bin and out2.bin files are distinct using the diff command, the result is that out1.bin and out2.bin differ.

task-1-b

However, when I use the md5sum command to check the MD5 hash of out1.bin and out2.bin, the result is that the hash values of out1.bin and out2.bin are the same.

task-1-c

Since the output of md5collgen: out1.bin and out2.bin are binary files, we need to use a hex editor to view and edit them. When I view the binary files with a hex editor, I can see the characters I supplied in the prefix file.

Inspecting the binary files, I noticed the following:

if the length of your prefix file is not a multiple of 64, md5collgen makes up for the length by filling the block with zeros.

task-1-d

whereas if the prefix file is exactly 64 bytes, there is no zero padding.

task-1-e

the data generated by md5collgen is not completely different for the out1.bin and out2.bin output files?

Understanding MD5’s Property

This lab section deals with trying to understand some of the properties of the MD5 algorithm. These properties include:

MD5 divides the input data into blocks of 64 bytes
it then computes the hash iteratively on these blocks via a compression function. This compression function takes a 64-byte data block and the outcome of the previous iteration as input.
the output of the compression function is a 128-bit Intermediate Hash Value (IHV).
the IHV for the first iteration is a fixed predetermined value
the IHV for the last iteration becomes the final hash value of the data.

Based on the above, given two inputs M and N, if MD5(M) = MD5(N), then for any input T, MD5(M || T) = MD5(N || T), where || represents concatenation.

What this simply means is that if inputs M and N have the same hash, should I add the same suffix T to them, their outputs will have the same hash value. To demonstrate this, I will concatenate the ls binary to out1.bin and out2.bin respectively, and calculate the hash value of their outputs.

cat out1.bin /usr/bin/ls > out3.bin
cat out2.bin /usr/bin/ls > out4.bin
md5sum out3.bin out4.bin

from the results in the screenshot below, we can see that the outputs produce the same hash value.

task-2-a

Generating Two Executable Files with the Same MD5 Hash

This lab section deals with creating two different versions of a program given below, such that the contents of their “xyz” arrays are different, but the hash values of the executables are the same.

#include <stdio.h>

unsigned char xyz[200] = {
  /* The actual contents of this array are up to you */
};

int main()
{
  int i;
  for (i=0; i<200; i++){
    printf("%x", xyz[i]);
  }
  printf("\n");
}

I can go about this in two ways:

working at the source code level, i.e., generating two versions of the above C program, such that after compilation, their corresponding executable files have the same MD5 hash value.
working at the binary level, i.e., putting some arbitrary values in the “xyz” array, compiling the code to binary, and using a hex editor tool to modify the content of the “xyz” array.

In order to find where the contents of the array are stored in the binary, I fill the array with “A”s (0x41) so I can easily find them in the binary. I write a Python program that will generate 200 ‘A’s formatted for easy copy to the c code.

#!/usr/bin/env python3

#array = input("Type in the characters you want to transform: ")
array = "A" * 200

result = bytes.hex(array.encode())
a = 0; b = 0

for _ in result[::2]:
    print (f"0x{result[a:a+2]}, ", end="")
    a += 2
    b += 1
    # Make each output 10 characters per line
    if b % 10 == 0:
        print()    
print()

When I compile the c code and inspect the binary in a hex editor, I am able to identify the array.

task-3-a

From inside the array, I can divide the executable file into three parts:

a prefix (The length of the prefix needs to be multiple 0f 64 bytes)
a 128-byte region
a suffix.

the array starts from 12320 to 12519. since the array has to be a part of the prefix, I need to look for a number within the array that is a multiple of 64. I ended up with 12352.

$ head -c 12352 md5col > prefix

the 128-byte region will be generated by md5collgen

$ md5collgen -p prefix -o out1.bin out2.bin

the suffix will start from 12352 + 128 = 12480.

$ tail -c +12481 md5col > suffix

finally, I added the suffix to out1.bin and out2.bin

$ cat suffix >> out1.bin
$ cat suffix >> out2.bin

When I check the hash values of out1.bin and out2.bin, the values are the same. However, when I check out1.bin and out2.bin with the diff utility, the programs differ.

task-3-b

Making the Two Programs Behave Differently

This lab section deals with creating two independent programs that perform different functions but share the same hash value.

If an attacker can get his malicious software to share the same hash as a benign software, a certificate that is valid for the benign software is also valid for the malicious program. Therefore, the attacker has successfully obtained a valid certificate for his malicious program. If other people trust the certificate issued by the authority, they will download the malicious program.

One approach is to create two arrays X and Y and compare the contents of these two arrays; if they are the same, the benign code is executed; otherwise, the malicious code is executed.

#include <stdio.h>

unsigned char X[200] = {
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
};

unsigned char Y[200] = {
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
    0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
};

int compare()
{
    for(int i = 0; i < 200;i++){
        if (X[i] == Y[i]){
            continue;
        } else{
            return 1;
        }
    }
    return 0;
}


int main()
{
    if (compare() == 0){
        printf("Benign code executed..\n");
    } else{
        printf("Malicious code executed..\n");
    }
    printf("\n");
}

The image below illustrates what the two versions of the program will look like.

Due to the complexity of this task, I broke it down into a number of sub-tasks.

1. Identifying the Prefix

After compiling the program and opening it in a hex editor, I was able to determine that the prefix is 12352. The first array starts from 12320 to 12519. since the array has to be a part of the prefix, I need to look for a number within the array that is a multiple of 64.

task-4-a

2. Extracting the Prefix

Next is to extract the prefix using the head utility

$ head -c 12352 md5col2 > prefix

3. Generating Collision Data using md5collgen

Next is to generate two different programs that will generate the same hash

$ md5collgen -p prefix -o out1 out2

4. Extracting the 128 bytes Collision Data

When md5collgen is run, it generates 128 bytes of data that is added to the prefix. This 128 bytes of data is responsible for the collision. The next step is to extract these 128 bytes of data and save them for post-processing. These 128 bytes of data are at the end of the file and can be extracted using the tail utility.

$ tail -c 128 out1 > P
$ tail -c 128 out2 > Q

5. Make Two Copies of the Program

$ cp md5col2 malicious_code
$ cp md5col2 benign_code

6. Edit the Benign Program

Because of the complexity of the program, the method of concatenation that was used in a previous task is not feasible. The binary will have to be edited in a hex editor. Another option though is to use the dd utility.

As predetermined:

the first array starts at offset 12320; however, 12320 is not a multiple of 64, so I had to select 12352 which is a multiple of 64
the difference between 12352 and 12320 is 32
the second array starts at the offset 12544; however, since I am going to be comparing array X with array Y to determine the program flow of control, I need to ensure that where I will place the collision data in array X and array Y match
I had to move the offset of array X by 32 to make up for the block size; this means I also have to move array Y by 32
the sum of 12544 and 32 is 12576

$ dd if=P of=benign_code bs=1 count=128 seek=12352 conv=notrunc
$ dd if=P of=benign_code bs=1 count=128 seek=12576 conv=notrunc

The above code simply means: copy 128 bytes of data from the file “P” 1 byte at a time into the file “good_code” from the offset 12352 and 12576 respectively; do not truncate the output file.

task-4-b

6. Edit the Malicious Program

As with the above, the determined data is the same:

$ dd if=Q of=malicious_code bs=1 count=128 seek=12352 conv=notrunc
$ dd if=P of=malicious_code bs=1 count=128 seek=12576 conv=notrunc

task-4-c

7. Testing the Programs

All that is left is to test the programs to see if it works as expected.

The first test is to see if benign_code and malicious_code produce the same hash value:

$ md5sum benign_code malicious_code

From the screenshot below, they indeed have the same hash value

task-4-d

The second test is to see if they execute differently: From the screenshot below, they indeed do execute differently

task-4-e

Thanks for reading…

Share: Twitter, Facebook