UUID Algorithm Principles: Using Python to Show Why It Hardly Ever Repeats

In distributed system development, we often need to generate unique identifiers for data. From database primary keys to distributed cache keys, from log tracking IDs to message queue message IDs, these scenarios all have extremely high requirements for the uniqueness of identifiers. And UUID (Universally Unique Identifier) is an excellent solution to such problems. Today, we will delve into the algorithm principles of UUID, implement it with Python code, and answer a core question: Why do UUIDs almost never repeat?

What is UUID?

UUID is a standardized unique identifier defined by the IETF (Internet Engineering Task Force), whose full name is Universally Unique Identifier. It is a 128-bit number, usually represented as a 36-character string in the format of five segments of hexadecimal numbers: 8-4-4-4-12. For example: f81d4fae-7dec-11d0-a765-00a0c91e6bf6.

Behind this seemingly simple string lies an exquisite algorithm design. Its most prominent feature is that it can generate globally unique identifiers in a distributed system without the coordination of a central authority. This means that even without a network connection, the UUIDs generated by different devices are almost impossible to repeat.

The Five Versions of UUID

UUID does not have only one generation method; instead, it defines 5 versions, each suitable for different scenarios:

Version 1 (based on timestamp and MAC address): Generated by combining the current timestamp, clock sequence, and node (usually a MAC address)
Version 2 (DCE security version): Introduces POSIX UID/GID on the basis of Version 1, and is rarely used
Version 3 (MD5 hash based on namespace): Calculated for the namespace and name through the MD5 hash algorithm
Version 4 (randomly generated): Completely relies on random numbers for generation
Version 5 (SHA-1 hash based on namespace): Similar to Version 3, but uses the SHA-1 hash algorithm

In actual development, Versions 1 and 4 are the most commonly used. Version 1 is suitable for scenarios that require sorting by time, while Version 4 is suitable for scenarios with higher requirements for randomness.

Generating UUID with Python

The uuid module in Python's standard library provides functions for generating various versions of UUID. Let's take a look at how to use it:

import uuid

# Generate Version 1 UUID (based on timestamp and MAC address)
uuid1 = uuid.uuid1()
print(f"Version 1 UUID: {uuid1}")

# Generate Version 4 UUID (randomly generated)
uuid4 = uuid.uuid4()
print(f"Version 4 UUID: {uuid4}")

# Generate Version 3 UUID (MD5 hash based on namespace and name)
namespace = uuid.NAMESPACE_URL
name = "https://example.com"
uuid3 = uuid.uuid3(namespace, name)
print(f"Version 3 UUID: {uuid3}")

# Generate Version 5 UUID (SHA-1 hash based on namespace and name)
uuid5 = uuid.uuid5(namespace, name)
print(f"Version 5 UUID: {uuid5}")

Running this code, you will get output similar to the following:

Version 1 UUID: f81d4fae-7dec-11d0-a765-00a0c91e6bf6
Version 4 UUID: 3b9a7b3a-9c7d-4e5f-8a9b-0c1d2e3f4a5b
Version 3 UUID: 1b9d6bcd-bbfd-366d-9b5d-ab8dfbbd4bed
Version 5 UUID: 886313e1-3b8a-5372-9b90-0c9aee199e5d

In-depth Analysis of UUID Algorithm Principles

Generation Principle of Version 1 UUID

The generation of Version 1 UUID is the most complex and best reflects the exquisite design of UUID. Its 128 bits consist of the following parts:

60-bit timestamp: Counted based on 100-nanosecond intervals since 00:00:00 on October 15, 1582
14-bit clock sequence: Used to handle the problem of clock backtracking
48-bit node identifier: Usually a MAC address, ensuring that UUIDs generated by different devices are different

The pseudocode is as follows:

def generate_uuid1():
    # Get the current timestamp (in 100-nanosecond units)
    timestamp = get_current_timestamp()
    # Get the clock sequence (to handle clock backtracking)
    clock_seq = get_clock_sequence()
    # Get the node identifier (usually a MAC address)
    node = get_mac_address()
    
    # Combine each part into a 128-bit number
    uuid = (timestamp << 68) | (clock_seq << 48) | node
    
    # Convert to standard string format
    return format_uuid(uuid)

There are several key points to note here:

Timestamp starting point: October 15, 1582 is chosen because it is the day when the Gregorian calendar was adopted, which has historical significance.
Clock sequence: When the system clock is backtracked, the clock sequence will increment to ensure that the generated UUID is still unique even if the timestamp becomes smaller.
Node identifier: Using a MAC address can ensure that UUIDs generated by different devices do not conflict, but it also brings privacy issues. Therefore, some implementations will use randomly generated node identifiers instead of MAC addresses.

Generation Principle of Version 4 UUID

The generation of Version 4 UUID is relatively simple, and it mainly relies on random numbers:

122-bit random number: Provides the main guarantee of uniqueness
6 fixed bits: Used to identify the UUID version and variant

The pseudocode is as follows:

def generate_uuid4():
    # Generate 122-bit random number
    random_bits = generate_random_bits(122)
    # Set version bits (4 bits) and variant bits (2 bits)
    version = 4 << 12
    variant = 2 << 62
    # Combine all parts
    uuid = random_bits | version | variant
    # Convert to standard string format
    return format_uuid(uuid)

The randomness of Version 4 UUID is the key to its uniqueness, and we will discuss its collision probability in detail later.

Generation Principles of Version 3 and Version 5 UUIDs

Versions 3 and 5 UUIDs are generated based on namespaces and names, and their principles are similar:

Convert the namespace UUID to a byte sequence
Convert the name to a UTF-8 encoded byte sequence
Concatenate the namespace bytes and name bytes
Perform hash calculation on the concatenated result (MD5 for Version 3, SHA-1 for Version 5)
Take the first 128 bits of the hash result and set the version bits and variant bits

The advantage of this method is that for the same namespace and name, the same UUID can always be generated, which is very useful in some scenarios.

Why Do UUIDs Almost Never Repeat?

This is the question that everyone is most concerned about. The uniqueness of UUID is guaranteed through the following aspects:

Huge space UUID is 128 bits, which means there are a total of 2^128 possible UUIDs. How large is this number? It is approximately 3.4×10^38. To give you an intuitive understanding:
- The number of sand grains on Earth is approximately 7.5×10^18
- The number of stars in the observable universe is approximately 10^22
- 2^128 is approximately 3.4×10^16 times the number of stars in the observable universe This means that even if we generate 1 trillion UUIDs per second, it will take about 10^18 years to use up all possible UUIDs, which is far longer than the age of the universe (about 13.8 billion years).
Well-designed randomness For Version 4 UUID, all 122 bits are randomly generated. According to probability theory, the probability of collision between two randomly generated UUIDs is extremely low. Specifically, after generating n UUIDs, the probability of collision can be calculated using the approximate formula: P(n) ≈ n² / (2×2^128) When n=10^12, the collision probability is approximately 10^24 / (2×3.4×10^38) ≈ 1.47×10^-15, which is an almost negligible probability.
Uniqueness in time and space For Version 1 UUID, in addition to the increment of the timestamp ensuring the uniqueness of UUIDs generated on the same device, the node identifier (usually a MAC address) also ensures the uniqueness of UUIDs generated by different devices. Since MAC addresses are unique worldwide, combined with the continuously increasing timestamp, the uniqueness of UUID is guaranteed from both time and space dimensions.
Handling clock backtracking In distributed systems, clock backtracking is a common problem (such as NTP server adjusting time). Version 1 UUID handles this problem through the clock sequence: when clock backtracking is detected, the clock sequence will increment to ensure that the generated UUID is still unique even if the timestamp becomes smaller.

Probability of UUID Collision in Practical Applications

Although UUIDs may theoretically collide, in practical applications, this probability is so low that it can be ignored. Let's look at some specific figures:

Generating 1 billion Version 4 UUIDs, the collision probability is approximately 10^-18
Generating 1 trillion Version 4 UUIDs, the collision probability is approximately 10^-15
Even generating 10^20 Version 4 UUIDs (which would require every person on Earth to generate 1 million UUIDs per second for 100 years), the collision probability is only about 10^-8

To understand this probability more intuitively, there is a vivid metaphor: the probability of you being hit by a meteorite is higher than the probability of generating two identical UUIDs.

Application Scenarios of UUID

The characteristics of UUID make it very useful in many scenarios:

Distributed databases: In sharded databases, UUID can be used as a globally unique primary key to avoid ID conflicts between different shards.
Distributed cache: As a cache key, ensuring that cache keys generated by different nodes do not conflict.
Log tracking: In distributed systems, using UUID as a tracking ID can track the complete link of requests across services.
Message queues: As the unique identifier of messages, ensuring idempotent processing of messages.
File naming: In distributed file systems, using UUID as a file name can avoid naming conflicts.
Session identification: As the unique identifier of user sessions, it is more secure than auto-incrementing IDs and not easy to guess.

Advantages and Disadvantages of UUID

Advantages

Global uniqueness: Uniqueness can be guaranteed in distributed systems without coordination.
No need for a central authority: The generation process does not depend on a central server.
Cross-platform compatibility: The UUID standard is widely supported, and almost all programming languages have corresponding implementations.
Rich information: Version 1 UUID contains time information, and the generation time can be roughly judged.

Disadvantages

Long length: The 36-character length is much longer than the auto-incrementing ID, which may increase storage and transmission costs.
Disorder: UUID is unordered and not suitable as a primary key that needs to be sorted.
Poor readability: Compared with auto-incrementing IDs, UUID has poor readability.

Summary

UUID is an exquisite unique identifier generation scheme. It ensures that identifiers generated in distributed systems almost never repeat through a huge space, well-designed randomness, and the combination of time and space.

Whether it is the generation method of Version 1 based on timestamp and MAC address or Version 4 based on random numbers, each has its applicable scenarios. In actual development, we can choose the appropriate UUID version according to specific needs.

Although UUID is not perfect, its length and disorder may bring some problems, but in most distributed systems, its advantages far outweigh its disadvantages, and it is the preferred one for implementing globally unique identifiers.

Next time when you use UUID in your code, you might as well think about the exquisite design behind these 36 characters and how it marks your data uniquely in the vast digital space.

Leapcell: The Best of Serverless Web Hosting

Finally, I would like to recommend a platform most suitable for deploying Python services: Leapcell

🚀 Build with Your Favorite Language

Develop effortlessly in JavaScript, Python, Go, or Rust.

🌍 Deploy Unlimited Projects for Free

Only pay for what you use—no requests, no charges.

⚡ Pay-as-You-Go, No Hidden Costs

No idle fees, just seamless scalability.

📖 Explore Our Documentation

🔹 Follow us on Twitter: @LeapcellHQ

Why UUIDs Almost Never Collide: A Python Deep Dive