I suggest you ...

Enable search and notifications for email addresses using the "+" syntax

A lot of people use a syntax such as troyhunt+foo@hotmail.com where foo is a unique identifier for the site. They do this so that if they begin getting spammed, they can identify the source their email came from.

At the moment, HIBP treats this is a totally unique email address so if I've search for the parent email address without the "+" syntax, it won't be found. This idea is to ensure that searches and notifications recognise the syntax and return addresses that are logically still the same account.

One thing HIBP would also need to do is specify which account alias was in the breach or paste. For example, I would want to know that it was troyhunt+bar@hotmail.com that was exposed in the XYZ breach.

Edit: Just to put the value of this into context, I've just run some stats on the Adobe breach. Of the the 152,989,508 rows in the dump, only 49,905 email addresses have a "+" in the address so that's 0.03% of entries. That number is also a bit high as it includes junk entries. I'm definitely not ruling this idea out - it's still planned - I just wanted to give a sense of how useful it would be.

Edit: To add to this idea, Robert's comment about a period in the email is also very valid. I'd want to be very clear about the ubiquity of this practice across mail providers, but it's certainly a good suggestion and worth further investigation.

963 votes
Vote
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    AdminTroy Hunt (Admin, Have I been pwned?) shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    55 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)
      Submitting...
      • seizedengine commented  ·   ·  Flag as inappropriate

        Adding my vote to this. Its completely understandable that it is a significant development effort for a very small percentage of people however that group would appreciate it greatly. I think the number of users who do plus aliasing is also a group that is strongly security aware and are more likely to be subscribers to HIBP. In fact in posting this comment I am using a plus aliased account.

      • Anonymous commented  ·   ·  Flag as inappropriate

        I think people who use + emails are both more likely to use haveibeenpwned and less likely to have their passwords compromised due to being more selective about the websites they use.

      • AdminTroy Hunt (Admin, Have I been pwned?) commented  ·   ·  Flag as inappropriate

        It's not that simple Paul, there's a lot of other downstream impact by now having more data in the database than was originally in the breach. There are other processes this feeds into not to mention the way it changes the search for the reasons I've already mentioned.

        At this point in time, the fact remains that this pattern is used by almost nobody based on the data I'm seeing in the breaches. I'll keep assessing it and I *would* like to do this at some point, but it'd be a very bad ROI on the effort right now.

      • Paul commented  ·   ·  Flag as inappropriate

        I feel like you're making this too hard. If the input is troyhunt+bar@hotmail.com then just load it into your database as troyhunt@hotmail.com. done. Easy peasy. Strip the + during search, too. If a user searches troyhunt+bar@hotmail.com, give them results for troyhut@hotmail.com (which includes all results from all aliases).

        Or if for data integrity you really want to keep the + addresses in the database, just load the address twice; once as troyhunt@hotmail.com and once as troyhunt+bar@hotmail.com (for the rare instances where a + is actually part of the e-mail).

        You said these are in there very rarey, so you won't be duplicating much data. And for users who use +addresses, the website is somewhat useless. The whole point of +addresses is they're throwaway. I create one on the spot and forget I created it. It's not possible for me to search for every +address I've used. I'd rather have results that are overly cautious (troy.hunt@hotmail.com AND/OR an alias of troy.hunt@hotmail.com was pwned) than just not have any clue if an alias I used was pwned.

        ======

        Alternatively, handle this on the search end. You allow domain owners to search for multiple addresses based on domain. Maybe if I prove I own troyhunt@hotmail.com then I can search for permutations (based on rules you setup). For gmail maybe you allow + and dot. For hotmail just +. Etc. If this is a rarely used feature then optimization probably isn't very important. Results could just come as a spreadsheet/json like the domain results do.

        ======

        Re: the frequency.... If I were a spammer, whenever I saw a +address I would strip off the + and everything after it anyway. If any processing was done on these dumps, that could make the + seem even rarer... I do think this is a rarely used feature, but it could be a percentage point or two more frequent than it seems.

      • Rob commented  ·   ·  Flag as inappropriate

        Hi Troy - first off thanks for the fantastic facility you provide via this website.

        Personally I use gmail accounts for most things. With regard to the placement of the ‘.’, I only use two variants - one without any, and one with them in to make the address more readable, so that's not a big deal for me; however, I make heavy use of ‘+ addressing’, not necessarily for major websites, but for things like forums, newsletters, sites I think may spam etc.

        I've been analyzing the haveibeenpwned report for our company domain. There are just over 1200 entries. Of those there are only 4 entries that are not included in the "Online Spambot" list, and those four are all genuine users.
        As a sample I've gone through all users starting with a "c", and at most 12 of 144 are potentially genuine. The vast majority of the rest look to be auto-generated, plus some invalid ones but based on real users' surnames.

        Based on those stats and also the unlikelihood of auto-generated spam email addresses being created with plus addressing, I therefore suspect any addresses that do contain plus addressing are very likely to be genuine accounts.

        Extracting and entering into the site all the plus variants of my email addresses from my password safe, and then on an ongoing basis adding new ones every time I sign up to a site could be quite onerous.

      • Stijn Crevits commented  ·   ·  Flag as inappropriate

        I too use the +string method to identify sites that leaked my email address to spam lists. So this ofcourse results in a long list of variations of email.address+string@provider.com.
        It would ofcourse be nice if I could receive HIBP notifications for all of them, without having to enter each of these email address permutations.

        But I can understand that this request results in some difficulties, as mentioned by others in the topic.

        To David's comments, it appears that the email provider of Brian Krebs DOES follow the spec to heart and differentiates between email addresses based on capitalization (see https://twitter.com/briankrebs/status/940362654434168833).
        However, the implementation of the spec differs between providers. As mentioned earlier, Google doesn't care about the periods, some providers (or sites) don't allow the use of + or other signs, ...

        The question for ignoring everything after a +-sign would be whether there are email providers who allow registering different email addresses based on the value behind the +-sign. I.e. could foobar+alice@provider.com be a different user than foobar+bob@gmail.com?

      • AdminTroy Hunt (Admin, Have I been pwned?) commented  ·   ·  Flag as inappropriate

        To David's comments, this shows how tricky the situation is; there's the spec, the practices by various mail providers and then the patterns people general use. I'm very cautious about making assumptions on these as they may not always hold true under all circumstances which then means ending up with a kludge of provider-specific hacks (i.e. always ignore the dot in Gmail addresses). I'm sure everyone can see the challenge and even if solved, there's still just that tiny percentage of people for whom it would make any difference at all.

      • David A Bacher commented  ·   ·  Flag as inappropriate

        Periods are significant in the local part of email addresses per RFC 3522 (https://tools.ietf.org/html/rfc5322#section-3.4.1).

        Email addresses are case sensitive, and period, plus and hyphen are significant characters.

        It's safe to assume that no sane system administrator is going to set up mailboxes that differ only incase, or where the local name is a bunch of quoted printable Emojis, etc. However, note that I used the word "sane" so there are probably thousands, maybe millions of systems doing it out there somewhere on the Internet. :P

        Also, hyphen is valid in the global DNS system, and so whatever you do -- don't just strip it from the whole address. That causes significant problems for users whose domains actually have a minus sign in them.

        But if you do this sort of normalization, easiest way is a set of regular expression substitutions based on the domain name. Since the local part is determined by the ISP in question, the rules have to vary and so you worry about the big guys.

      • Mike commented  ·   ·  Flag as inappropriate

        Well, if anyone would know it'd be you. Thanks for your willingness to engage!

      • AdminTroy Hunt (Admin, Have I been pwned?) commented  ·   ·  Flag as inappropriate

        Mike, you'd be surprised at how mainstream the HIBP user base is, largely because of how much press it gets in the general media. But even if I was off by a factor of 10 (which I'm almost certainly not), in an incident like River City Media, the percentage of people using this pattern rounds to 0% even with 2 decimal points of precision!

        I understand this is important to the people using it, but I need to look at the impact from the effort and at present, it remains near non-existent.

      • Mike commented  ·   ·  Flag as inappropriate

        Troy, I'd argue that your user base is not represented by the data in breaches. Obviously a very small percentage of people in the world use a + in their email (as evidenced by your research). But, I'd wager that a much larger percentage of people using HIBP do.

        I assume your hope is that even the most technologically illiterate users come to HIBP. However, I imagine that most users are already security conscious and don't fall into that group.

      • AdminTroy Hunt (Admin, Have I been pwned?) commented  ·   ·  Flag as inappropriate

        Since Antonios has left a comment and I've also just loaded the largest data set ever into HIBP, I thought I'd add a current figure to the discussion here:

        0.0038% was the percentage of people with a + in their email address in the River City Media spam list. 1 in every 26k people is a hard ROI to justify when there's a fair bit of work to invest!

        I'll keep monitoring the use of this pattern, but as of now, it remains *exceptionally* rare.

      • Antonios Chariton commented  ·   ·  Flag as inappropriate

        From your blogs I think you are using a Key, Value data structure, which means when a query comes, your data store needs an exact *key* to find the value (if it has been breached or not). That's probably the best data structure for HIBP since it can scale infinitely, however it will not allow you to query troy+*@hunt.com.. I guess the only way to address that is to either canonicalize the data as you add it, by removing everything after "+" (or ".", or "-"), which means this will only work with new data sets, or change the table schema / contents of "Value", which is very unlikely to happen.. Another solution would be to create a new "table" with all e-mails with "+", ".", or "-", and then query both when someone requests information, only that this time you format the "Value" of those "Keys" accordingly.. Although it may seem like a lot of work, the earlier it is done, the better it will be as it will include more datasets..

      • Henrik commented  ·   ·  Flag as inappropriate

        Unfortunatley there are some online-services that don't accept emailadresses with + sign. I had problems with 2 services in the last month.

      • AdminTroy Hunt (Admin, Have I been pwned?) commented  ·   ·  Flag as inappropriate

        To Kem's question, we're *always* talking tiny percentages. I just checked the last set of data I loaded which was a spam list and only 0.009% of emails used the + syntax.

        This is something I still want to add folks, but it'll be to the benefit of a tiny percentage of the community.

      • Kem Jones commented  ·   ·  Flag as inappropriate

        Tony,

        I'm in the 0.03% and have been for years. It's been a fantastic way to identify abusers of my email address. I'd love to see this feature implemented and would be glad to help any way I can.

        The 0.03% stats appear to be from November 2014. Do you have more recent stats? (Today is November 29, 2016.) I'm curious if more people have caught on to this technique these days...

        Thanks,
        Kem

      • Wout Mertens commented  ·   ·  Flag as inappropriate

        I would be perfectly happy if this were only implemented for new breaches and if it didn't tell me the exact tagged e-mail address.

        Under these conditions you would only need to canonicalize email addresses as they come in and the rest of the code would work as-is. Convert to lowercase, strip generic subfields, add a special case for gmail dots and yahoo hyphens, store it and that would be it.

      • Anonymous commented  ·   ·  Flag as inappropriate

        Would it not be possible to just add a, "This email address has tags," checkbox to the search, with a small tooltip telling confused people what it is? That way the extra code to search for tagged addresses never gets executed for the 99% of people it's not relevant for.

      Feedback and Knowledge Base