Skip to content

base-64-url-no-pad inconsistent with RFC 4648 and previous use #158

@IS4Code

Description

@IS4Code

The specification mandates the use of base-64-url-no-pad for the u Multibase header as defined in the document, however that definition seems to differ from base64url in RFC 4648, and even from how the u header had been defined in https://github.com/multiformats/multibase. Is this difference intentional or erroneous?

Specifically, it refers to a particular algorithm, however that algorithm is written for both base58 and base64, whose common usage, crucially, differs in the direction from which the bits are grouped. This leads to the two algorithms agreeing only when the input string has a length of multiple of 3:

const message = 'Hello world'
const bytes = new TextEncoder().encode(message);

// EhlbGxvIHdvcmxk
console.log(baseEncode(bytes, 64, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"))

// SGVsbG8gd29ybGQ=
console.log(btoa(message))

I would expect these two to be equal (sans the padding). I have also never seen this reverse-directioned base64 used anywhere else, as it is also somewhat more computationally expensive.

Following with more elaboration I got on this from the Named Information designated expert:

Normal base64url encoding (RFC 4648) splits a sequence of 8-bit bytes into 6-bit segments from the start (most-significant bit of the first byte).
CID’s base-64-url-no-pad splits a sequence of 8-bit bytes into 6-bit segments from the end (least-significant bit of the last byte).

Consider the single byte: 11111111 = 0xFF
Base64url splits it from the start then append four 0’s to fill the last segment: 111111 | 110000 = _w
Base-64-url-no-pad splits it from the end then assumes leading 0’s in the first segment: 000011 | 111111 = D_
And the handling of leading 0x00 bytes is also different. Consider 0x00 0x00 0x00.
Base64url splits 3 8-bit bytes into 4 6-bit chars (3*8 = 24 = 4*6): AAAA
Base-64-url-no-pad replaces each of the 3 leading 0x00 bytes with 1 char: AAA

ASIDE: The encoding algorithm is defined in terms of log(256) / log(targetBase) * length + 1. It mixes floating-point and integer arithmetic without enough care. There is a step (5) to "skip leading zeros in the base-encoded result”, which may correct any ambiguity from the floating-point step.

If this discrepancy is an error, it would seem to me as the best solution to simply refer to RFC 4648's base64url encoding with no padding as the definition for u.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Errataclass 3Other changes that do not add new features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions