Beware of HTML5 email validation of double-byte domains

One of the welcome features of HTML5 is the additional of many more form input types and properties: email, tel, number, etc. Currently only the latest versions of Chrome, Opera and Safari actually validate fields and prevent form submission if something's not right but I've recently discovered a problem with both browsers incorrectly invalidating email with double-byte internationalised domain names (IDNs):

Valid: <input type="email" name="email" value="kyle@example.com">
Invalid: <input type="email" name="email" value="kyle@日本.jp">

Personally I think IDNs are a stupid idea and will lead to the balkanization of the Internet but they should still pass email validation. I suspect IDN email validation isn't working in Chrome, Opera and Safari for a couple of reasons:

  • IDNs rely on client software to convert multibyte domains to ASCII before they even touch the Internet. Modern browsers do this when you type a double-byte URL and Chrome, Opera and Safari seem to be doing this with HTML 5 url input but not with email input leading to double-byte email domains being passed raw for validation.
  • Double-byte characters are not recognised by RegEx as \w so fail the domain name search used by almost all RegEx email validators: ([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4}). Chrome, Opera and Safari are probably using something like this to validate email so will mark any raw double-byte domain as invalid.

The simple solution is for Chrome, Opera and Safari to apply the same double-byte to ASCII conversion it applies to URLs to email but until this happens, a simple workaround is to use text input type with a looser HTML5 pattern property instead of the email input type:

<input type="text" name="email" value="kyle@日本.jp" pattern="[^ @]*@[^ @]*">
  1. I've modified the RegEx to handle the .travel TLD. Thanks to @pornelski for the heads up.

  2. And IDN TLDs. Thanks again to @pornelski for info on this.

  3. I find it surprising that the regexes didn't have \w matching Unicode letters. When you do that kind of thing in Java's implementation, for instance, you get that for free.

  4. After having a read of the HTML5 Specification, it appears that this is where the issue crops up.

    In the spec, in the section discussing valid email addresses, the domain portion is described as being "defined in RFC 1034 section 3.5."
    http://dev.w3.org/html5/spec/Overview.html#e-mail-state

    Because of this, implementers have only gone as far as validating an email-state text field as per the specification, the extra step of Punycode encoding the domain portion hasn't been done.

    I'm guessing the best place to bring this up is in via the methods discussed in the HTML5 spec:
    "If you wish to make comments regarding this document, please send them to public-html-comments@w3.org (subscribe, archives) or whatwg@whatwg.org (subscribe, archives), or submit them using our public bug database. All feedback is welcome."

    As a side note, the url-state text field part of the specification links to a spec which makes allowances for IDNs.

  5. After looking through the WebKit source, the email pattern as of today is this (it is not using \w at all):

    static const char emailPattern[] =
    "[a-z0-9!#$%&'*+/=?^_`{|}~.-]+" // local part
    "@"
    "[a-z0-9-]+(\\.[a-z0-9-]+)+"; // domain part

    From a casual look, these seem to match the HTML5 spec at the moment.

  6. Thanks for clarifying this Alex. Problem's already been sent upstairs.

Elucidate us with your wisdom