A Base64 Surprise

Published Jan 23, 2022 - 11 min read - Text Only

Table of contents

A Base64 Surprise
Intro to Base 64
What's padding?
Consequence
Mitigations
- Update 2022-01-29

This weekend I decided to dip back into rust, something I've done on and off for a few days every year for.. seven or so years.

Thinking of trying out rust — Archived Tweet

This weekend project was intended for me to

Familiarize myself with rust again.
Familiarize myself with rocket again.
Try out ring - "Safe, fast, small crypto using Rust" now that I've experienced these primitives with mbedtls in C.
Try out Blake3 which is supposed to be really fast.
Try out Argon2 winner of the password hashing competition as a memory-compute hard hash function now often used for human passwords in storage.

As part of my manual testing, I often mangle the data in some way to ensure that whatever is signed no longer can verify, and whatever is encrypted will no longer decrypt.

Toy signing endpoint

Toy verification endpoint accepts

Toy verification endpoint rejects

Things went as expected until one invalidation test failed.

Surprise at base64 being manipulated at the end without consequence — Archived Tweet

By changing the last character of the signature, it still passed.

I did a double take and then started to remember when I wrote a base64 encoder / decoder in C.

So what's happening here?

Well the string you see does not have a == because the padding was not included in the output intentionally. But the end is in fact padded when decoded.

Intro to Base 64

Base 64 is a way to encode octets of data in an alphabet (there are multiple base 64 standard, base 64 web url, with or without padding) which can be exchanged in a context where the encoding is necessary such as in a url or a header.

You may have seen hex encoding, where octets or bytes are split into two characters each with an alphabet of 16 characters 0123456789ABCDEF. That makes for 4 bits of information per character. (See log2(16) = 4)

Likewise Base 64 uses 64 characters which are case sensitive with the alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/, the web url encoded version uses -_ instead of +/. Both use = as padding. This means that each base64 character contains 6 bits of information per character. (See log2(64) = 6)

What's padding?

While hex encoding can evenly fit a byte across two characters (which become two bytes when transmitted or stored), Base 64 packs more in at 6 bits per character. However in computing we operate on octets (8 bits or a byte) so a single character is not enough to convey a full octet. Instead at least two characters are needed to convey one octet.

$ echo -n "A" | base64
QQ==
$ echo "QQ==" | base64 -d | xxd
00000000: 41                                       A

echo -n means to pipe whatever you say without a newline n added at the end.
base64 is a command line tool that encodes or decodes -d what is given to it.
xxd is a tool that displays a hex representation of the data and an ascii rendition at the right side.

You may have noticed, there are four characters here to represent just the byte containing "A". That's because it takes 24 bits (or 4 * 6) to have a whole number of octets (in this case 3 octets). See Least common multiple for more on this.

This is where things get interesting. When encoded data does not round up to 3 octets, padding is added to signify where the input stops, even though four 6-bit characters are expected. You'll see "=" at the end of base64 for this reason, unless the padding is removed.

But there's more to padding than just the "=="! See, "QQ" there is 12 bits, but we're only getting 8 bits out.

Check this out.

$ echo "QP==" | base64 -d | xxd
00000000: 40                                       @
$ echo "QQ==" | base64 -d | xxd
00000000: 41                                       A
$ echo "QR==" | base64 -d | xxd
00000000: 41                                       A
$ echo "QS==" | base64 -d | xxd
00000000: 41                                       A
$ echo "Qf==" | base64 -d | xxd
00000000: 41                                       A
$ echo "Qg==" | base64 -d | xxd
00000000: 42                                       B

Now hold on, that means that QQ, QR, ... all the way to Qf all mean the letter "A".

Well it may appear that way to the program that decodes base64. The decoder is truncating the extra bits that do not fit into an octet.

Let me show what's actually happening if we didn't have any padding. I'll replace the "==" with "AA" since that is in the alphabet and has the value of 0.

$ echo "QPAA" | base64 -d | xxd
00000000: 40f0 00                                  @..
$ echo "QQAA" | base64 -d | xxd
00000000: 4100 00                                  A..
$ echo "QRAA" | base64 -d | xxd
00000000: 4110 00                                  A..
$ echo "QSAA" | base64 -d | xxd
00000000: 4120 00                                  A .
$ echo "QfAA" | base64 -d | xxd
00000000: 41f0 00                                  A..
$ echo "QgAA" | base64 -d | xxd
00000000: 4200 00                                  B..

See now how the last four bits show up in the next nibble (half a byte)?

That's what's being truncated. The higher order two bits of the second character are still decoded and used in the second nibble (the 1 in 41), but the remaining four bits are truncated and ignored.

Consequence

So why is this a big deal?

Let's consider the following system.

Bob can create a message that authorizes a transaction to Alice. It has a timestamp of validity and a cryptographic signature applied so it cannot be forged. It is encoded as base64 for ease of transmission and handling.
Alice receives the transaction authorization and hands it to the bank.
The bank derives a nonce from the payload and performs the transaction, this is how it deduplicates requests.
If Alice submitted it again unmodified, the transaction would be rejected because it has already been performed during the validity period as the nonce has already been recorded as used.

What if Alice wanted more of Bob's money?

If Bob's payload was conveniently padded by base64, then Alice can manipulate the payload to appear as a new transaction. Alice could perform a replay attack against this system with attacker-chosen payloads.

See, as far as the bank can tell, the transaction authorization appears unique on the outside, the signature passes on the inside, and the bank can ignorantly perform this transaction multiple times.

$ echo -n "QR==" | md5
f0625b437a9a15918f124a3ad74d2ff9
$ echo -n "QS==" | md5
fd3ff138ebdf4cfb90d736eec5025773

Don't use md5 for cryptographic purposes anymore, this is merely for demonstration.

Bob's signature only protected the content it was tied to, the amount and validity date. His signature was an ineffective protection for a fault in the protocol which allowed for truncated data to be manipulated.

Mitigations

Change the protocol of the transaction to have a nonce chosen by Bob, which the bank references so that the transaction cannot be retried.
Change the bank's nonce derivation to use the base64 decoded output instead of the input verbatim.
Switch to a base64 decoder which refuses inputs where the padding bits are not equal to zero, ensuring that only one valid representation is accepted. - Recommended by arodland

A consideration of adding a client nonce to the transaction protocol: using a nonce (number used only once) should be secured by the signature in the message. All clients issuing transaction authorization messages should carefully handle their nonces, never using them again.

One recommendation is to use a timestamp for a portion of the bits and sufficient randomness for the rest of the bits, hopefully a total of 128 bits. You may also consider using UUID's, or at least UUIDv4.

Aside, a draft for UUIDv7 is closer to the idea of time plus randomness.

Update 2022-01-29

Arodland recommended one mitigation is to use a base64 decoder which rejects tampered input. This has been incoporated into the mitigation list above. Thanks! I had not considered that.

@jedisct1 recommended I switched to another library in rust which @jedisct1 contributed to: namely ct-codecs. When I swapped to this implementation I found that my project now rejects tampered messages!

Rejecting tampered base64

Turns out, @jedisct1 shared this problem back in 2017 with an example tweet, @CiPHPerCoder then provided an interesting script which demonstrates this problem.

PHP base64 decoding is susceptible — Archived Tweet

CiPHPerCoder's Script — Archived Script Evaluation

There's a lot to learn out there and I am grateful for these recommendations.
Thank you!