URLid Methodology

A universal identifier system for data reconciliation

What is URLid?

URLid is a methodology created by Álvaro Justen (Turicas) from Pythonic Café to facilitate the creation of universal identifiers for entities across different datasets. It was developed to address the challenges of cross-referencing public data from various sources in Brazil, but it can be applied to any data reconciliation scenario. This type of methodology helps systems interconnect more easily and enables the creation of powerful data structures such as graph databases, facilitating complex data analysis and relationships across diverse datasets.

The core idea of URLid is to generate a unique, consistent identifier for each object of an entity (such as a specific person, a particular company, or an individual candidacy) using a combination of available data. This method produces identifiers of the same data type (UUID) regardless of the entity type, without relying on a central database or expensive JOINs.

For a detailed explanation of the methodology (in Portuguese), you can watch the conference talk: Conciliação de datasets públicos com URLid (PGConfBrasil 2022).

How it Works

  1. Define a base URL as your namespace (e.g., https://id.brasil.io/)
  2. Create a slug for the entity type (e.g., person for individuals)
  3. Define a method to create a Raw ID from available data; this can be a combination of various fields available in the databases
  4. Generate a URL combining the namespace, entity type, version, and Raw ID
  5. Pass the URL through the UUID version 5 algorithm to generate the final identifier

Understanding Raw IDs

Before we dive into Raw IDs, let's clarify two important terms:

A Raw ID is a unique string that identifies a specific object within an entity type. It's a crucial component of the URLid methodology and forms the basis for generating the final UUID. Here are some key points about Raw IDs:

Here are some examples of Raw IDs for different entity types:

This diversity in Raw ID formats presents a challenge when trying to cross-reference different types of entities. That's where the URLid methodology comes in - it provides a universal method to create consistent identifiers across all entity types, regardless of their original Raw ID format.

Examples

Python Example


import uuid

def person_urlid(id_number, name):
    """Generates URLid based on partial Brazilian CPF identification number and name"""
    # Using last 6 digits of the identification number
    clean_id = id_number.replace(".", "").replace("-", "")
    assert len(clean_id) == 11
    id_part = clean_id[3:9]
    name_slug = name.upper().replace(" ", "-")
    raw_id = f"{id_part}-{name_slug}"
    url = f"https://id.brasil.io/person/v1/{raw_id}/"
    return uuid.uuid5(uuid.NAMESPACE_URL, url)

def company_urlid(registration_number):
    """Generates URLid based on core Brazilian CNPJ identification number"""
    # Using first 8 digits of the company registration number
    clean_id = registration_number.replace(".", "").replace("/", "").replace("-", "")
    assert len(clean_id) == 14
    raw_id = clean_id[:8]
    url = f"https://id.brasil.io/company/v1/{raw_id}/"
    return uuid.uuid5(uuid.NAMESPACE_URL, url)

# Usage
print(person_urlid("12345678901", "John Doe"))  # a575a1af-54f4-534a-b781-a78a0e2f48d7
print(company_urlid("12.345.678/0001-99"))  # d20b15d2-25e6-5908-8064-835f700e21a7
            

PostgreSQL Implementation


CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

CREATE OR REPLACE FUNCTION urlid(base_url TEXT, entity TEXT, version TEXT, raw_id TEXT)
RETURNS UUID AS $$
BEGIN
  RETURN uuid_generate_v5(
    uuid_ns_url(),
    base_url || '/' || entity || '/' || version || '/' || raw_id || '/'
  );
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Usage example:
SELECT urlid('https://id.brasil.io', 'person', 'v1', '456789-JOHN-DOE'); -- a575a1af-54f4-534a-b781-a78a0e2f48d7
SELECT urlid('https://id.brasil.io', 'company', 'v1', '12345678'); -- d20b15d2-25e6-5908-8064-835f700e21a7
            

Flexibility of URLid

It's important to note that URLid is a methodology, not a rigid system tied to specific entities or implementations. While we've provided examples based on entities used in Brasil.IO (such as persons, companies, and electoral candidates), these are merely illustrative. You're free to define and use any entities that suit your specific needs or domain.

A key aspect of URLid's flexibility lies in its use of URLs as a mechanism for generating unique identifiers. It's crucial to understand that these URLs don't need to actually exist or resolve to web pages; they serve purely as a framework for identifier creation. The hierarchical nature of URLs ensures that, when using the same base URL, collisions within the same entity type are impossible. While you can use any domain in your URLs, it's recommended to base them on domains you own or control. This practice not only guarantees uniqueness but also opens up exciting future possibilities. For instance, we could potentially implement ways to serve these objects via the Web, transforming these virtual URLs into actual endpoints that provide entity information. This could lead to the creation of a distributed, interlinked system of entity data, significantly enhancing the power and utility of the URLid methodology. By thinking ahead in this way, we're not just solving current data reconciliation challenges, but also laying the groundwork for more advanced, interconnected data systems in the future.

The power of URLid lies in its flexibility: as long as you follow the core principles of the methodology - defining a base URL, creating entity slugs, and generating raw IDs - you can apply URLid to any set of entities you choose. This makes URLid adaptable for use in various contexts, whether for internal systems within an organization or for data exchange between different parties. The key is consistency in applying the method, which ensures that the resulting UUIDs will be unique and reproducible across systems that agree on the same URLid implementation.

References