URL Encoders are Inconsistent with Requirements

Sometimes, a URL has to be rewritten to fit technical requirements. It’s a legitimate URL that can go into your browser’s address bar as it is. But, if you want to embed it in a mailto link or in some other contexts, you may need to encode it to be allowed in that context. Several websites do URL encoding and decoding automatically, but they make mistakes when encoding.

The specification that, most of the time, controls how to do URL encoding is RFC 3986, Section 2. It says essentially that certain characters must always be encoded, certain other characters may never be encoded, and a third set of characters must sometimes be encoded and must sometimes be left as they are. That last set is the problem area for automatic encoding.

That third set is not discretionary, either. Whether to encode depends on why you’re using the character. You might have to encode it or you might have to refrain from encoding it. But it’s not supposed to be a coin toss. One automatic encoder might encode them always while another automatic encoder might never encode them. None of the encoders I’ve seen ask about your usage of a character in the maybe set in order to determine whether to encode it or not. Most people would blame the encoder for slowing down to ask and would switch to another, although that’s what the encoder should do. As a result, you have to edit some characters manually.

These are the three kinds of characters:

Always encode these, some definitely always and some probably always, as annotated:

(to %20) (space, soft space, breaking space, or ordinary space (a line can break or wrap at this kind of space and is much more common than a hard or nonbreaking space))

" (to %22) (double straight quotation mark) (I think this character should always be URL-encoded)

# (to %23) (hash or pound sign) (always URL-encode)

% (to %25) (percent sign) (always URL-encode)

/ (to %2F) (forward slash, slash, solidus, or virgule) (always URL-encode)

: (to %3A) (colon) (always URL-encode)

? (to %3F) (question mark) (always URL-encode)

@ (to %40) (at-sign or commercial at sign) (always URL-encode)

[ (to %5B) (opening bracket, opening square bracket, left bracket, or left square bracket) (always URL-encode)

\ (to %5C) (backslash, reverse solidus, or reverse virgule) (always URL-encode)

] (to %5D) (closing bracket, closing square bracket, right bracket, or right square bracket) (always URL-encode)

^ (to %5E) (circumflex accent or exponentiating sign) (I think this character should always be URL-encoded)

` (to %60) (grave accent) (I think this character should always be URL-encoded)

{ (to %7B) (opening brace, left brace, opening curly bracket, or left curly bracket) (I think this character should always be URL-encoded)

| (to %7C) (vertical line, vertical bar, or pipe) (I think this character should always be URL-encoded)

} (to %7D) (opening brace, right brace, opening curly bracket, or right curly bracket) (I think this character should always be URL-encoded)

Never encode these:

0–9 (10 digits)

A–Z (26 uppercase or capital letters)

a–z (26 lowercase letters)

- (hyphen or minus sign)

. (dot, period, or full stop)

_ (underscore, lowline, low line, or underline)

~ (tilde or approximation sign)

Sometimes encode these, depending on the criteria that follow below, so you likely should encode selectively and manually:

! (to %21) (exclamation point or bang)

$ (to %24) (dollar sign)

& (to %26) (ampersand)

' (to %27) (straight apostrophe or single straight quotation mark)

( (to %28) (opening parenthesis or left parenthesis)

) (to %29) (closing parenthesis or right parenthesis)

* (to %2A) (asterisk)

+ (to %2B) (plus sign)

, (to %2C) (comma)

; (to %3B) (semicolon)

< (to %3C) (opening angle bracket, left angle bracket, or less-than sign (never used in a URL))

= (to %3D) (equals sign)

> (to %3E) (closing angle bracket, right angle bracket, or greater-than sign (never used in a URL))

When to encode the last set depends on whether the character has to be in plaintext before the whole string is decoded or not until after decoding. That is not always obvious, but I don’t have more specific advice.

The angle brackets < and > may be used to enclose a URL but may not be inside a URL. They separate a URL from any adjacent characters that might be confusing, such as punctuation in a sentence.

Letters other than the 26, such as letters with accent marks or other diacritics (diacritical marks), and other characters, of which there are thousands, probably can’t be percent-encoded. The percent-encoding scheme allows only 256 characters. If there is a method, I don’t know what it is. But it may be that such characters are never allowed in URLs anyway.

The numbers after the percent signs are always two-digit hexadecimal numbers, or base-16 numbers, because they go so well with the binary base-2 numbers, the zeros and ones that practically run computers. Hex numbers use 0–9 and A–F as digits. Lower-case and capitals when used in hex numbers mean the same thing (for example, a = A) but, for consistency, capitals should be used for those numbers, even if (for any reason) what you are encoding is a lower-case letter.

These characters come from the character set often known as US-ASCII, without nonprinting characters such as the tab. Additional characters such as those in high ASCII or that are not in ASCII at all, even if they might be in Unicode, are not included. Characters that are not in either ASCII set cannot be URL-encoded. That includes characters from foreign alphabets or scripts, for which ASCII-based indirect representations have instead been developed, for Internationalized Domain Names.

I know, all this is more work. Like that’s what we need. And maybe I’m being too cautious with specs. But we may need this to avoid technical problems from noncompliance.