Or if for data integrity you really want to keep the + addresses in the database, just load the address twice; once as troyhunt@hotmail.com and once as troyhunt+bar@hotmail.com (for the rare instances where a + is actually part of the e-mail).
You said these are in there very rarey, so you won't be duplicating much data. And for users who use +addresses, the website is somewhat useless. The whole point of +addresses is they're throwaway. I create one on the spot and forget I created it. It's not possible for me to search for every +address I've used. I'd rather have results that are overly cautious (troy.hunt@hotmail.com AND/OR an alias of troy.hunt@hotmail.com was pwned) than just not have any clue if an alias I used was pwned.
======
Alternatively, handle this on the search end. You allow domain owners to search for multiple addresses based on domain. Maybe if I prove I own troyhunt@hotmail.com then I can search for permutations (based on rules you setup). For gmail maybe you allow + and dot. For hotmail just +. Etc. If this is a rarely used feature then optimization probably isn't very important. Results could just come as a spreadsheet/json like the domain results do.
======
Re: the frequency.... If I were a spammer, whenever I saw a +address I would strip off the + and everything after it anyway. If any processing was done on these dumps, that could make the + seem even rarer... I do think this is a rarely used feature, but it could be a percentage point or two more frequent than it seems.
I feel like you're making this too hard. If the input is troyhunt+bar@hotmail.com then just load it into your database as troyhunt@hotmail.com. done. Easy peasy. Strip the + during search, too. If a user searches troyhunt+bar@hotmail.com, give them results for troyhut@hotmail.com (which includes all results from all aliases).
Or if for data integrity you really want to keep the + addresses in the database, just load the address twice; once as troyhunt@hotmail.com and once as troyhunt+bar@hotmail.com (for the rare instances where a + is actually part of the e-mail).
You said these are in there very rarey, so you won't be duplicating much data. And for users who use +addresses, the website is somewhat useless. The whole point of +addresses is they're throwaway. I create one on the spot and forget I created it. It's not possible for me to search for every +address I've used. I'd rather have results that are overly cautious (troy.hunt@hotmail.com AND/OR an alias of troy.hunt@hotmail.com was pwned) than just not have any clue if an alias I used was pwned.
======
Alternatively, handle this on the search end. You allow domain owners to search for multiple addresses based on domain. Maybe if I prove I own troyhunt@hotmail.com then I can search for permutations (based on rules you setup). For gmail maybe you allow + and dot. For hotmail just +. Etc. If this is a rarely used feature then optimization probably isn't very important. Results could just come as a spreadsheet/json like the domain results do.
======
Re: the frequency.... If I were a spammer, whenever I saw a +address I would strip off the + and everything after it anyway. If any processing was done on these dumps, that could make the + seem even rarer... I do think this is a rarely used feature, but it could be a percentage point or two more frequent than it seems.