Yes, stay tuned to this thread: https://twitter.com/troyhunt/status/1164291579705610240
Can you expand on what you mean by "helpful for notifications"? Even with a filter, you'd still need to run the same number of queries and the data returned is small and compressed, plus you can pull the date of the incident from the API that lists the breaches and easily filter the returned records that way.
I'm keeping this idea open but also still under consideration. The main reason is that the present model of searching domains ensures the requestor still has control of the domain at the point of search. A persistent API key could still be used by someone who leaves the organisation and should no longer be authorised to access the data.
That's the background, there's a few things in the pipeline that *may* make this more feasible, but there's no timeline as of now.
No change, there's still a threshold beyond where I simply can't return the data within a reasonable amount of time without impacting the service.
I'll leave this idea sitting here as it's something I'd like to optimise in the future, but this is also only an issue when there's a huge result set due to either the number of domains or the number of email addresses on them. At some point, there has to be a cut-off on execution time. That said, I do think I can optimise the query by asyncing the more of the Table Storage requests. Stay tuned.
No change to status as of now, complexity remains the same as does prevalence of use.
If everything after the + is stripped, that information is no longer available to the owner of the address. For example, if I load a spam list and someone used "+netflix" then they no longer know it came from Netflix. Yes, they've has to explicitly check that address but many people also have domain-wide searches and this would screw that up.
In short, nothing yet has changed with this idea: the pattern is still at very close to 0% usage and the same barriers still exist.
Then you lose the information about where the breach likely came from which in cases like the last breach, if very important to people. Plus, applying this to one sole email provider feels exceptionally dirty and misses the same pattern used by other providers (ie outlook.com).
It's not that simple Paul, there's a lot of other downstream impact by now having more data in the database than was originally in the breach. There are other processes this feeds into not to mention the way it changes the search for the reasons I've already mentioned.
At this point in time, the fact remains that this pattern is used by almost nobody based on the data I'm seeing in the breaches. I'll keep assessing it and I *would* like to do this at some point, but it'd be a very bad ROI on the effort right now.
To David's comments, this shows how tricky the situation is; there's the spec, the practices by various mail providers and then the patterns people general use. I'm very cautious about making assumptions on these as they may not always hold true under all circumstances which then means ending up with a kludge of provider-specific hacks (i.e. always ignore the dot in Gmail addresses). I'm sure everyone can see the challenge and even if solved, there's still just that tiny percentage of people for whom it would make any difference at all.
Mike, you'd be surprised at how mainstream the HIBP user base is, largely because of how much press it gets in the general media. But even if I was off by a factor of 10 (which I'm almost certainly not), in an incident like River City Media, the percentage of people using this pattern rounds to 0% even with 2 decimal points of precision!
I understand this is important to the people using it, but I need to look at the impact from the effort and at present, it remains near non-existent.
Since Antonios has left a comment and I've also just loaded the largest data set ever into HIBP, I thought I'd add a current figure to the discussion here:
0.0038% was the percentage of people with a + in their email address in the River City Media spam list. 1 in every 26k people is a hard ROI to justify when there's a fair bit of work to invest!
I'll keep monitoring the use of this pattern, but as of now, it remains *exceptionally* rare.
To Kem's question, we're *always* talking tiny percentages. I just checked the last set of data I loaded which was a spam list and only 0.009% of emails used the + syntax.
This is something I still want to add folks, but it'll be to the benefit of a tiny percentage of the community.
Since there's a comment in here about "go read the RFC", here's the RFC that describes subaddressing: https://tools.ietf.org/html/rfc5233
Adding both the full address with the "+" and a normalised ones without it would be an option, but it's difficult to do retrospectively and would mean enumerating back through hundreds of millions of records (I can't just query for everything containing a plus due to the data structure).
I'll continue to monitor this and if it either becomes easier to implement or more popular with users (it's still *extremely* rare) then I'll reassess.
The problem with a domain alias is that it's even less predictable than the "+" pattern. There's no assurance whatsoever that all domains will continue to work consistently in an interchangeable fashion nor is there necessarily a canonical list of them for each email provider.
I'd also be surprised if if many people actually used them in that way. Per the stats here, use of the "+" syntax is extremely rare and I can only imagine that domain substitution is even less so.
That's along the lines of what I've been considering Nate, the main challenge is that when someone searches for an address without a plus in it, I need to be able to pull back the address WITH the plus in it. This means that I need to store a reference in the table without the plus which the current data scheme doesn't support.
Right now, there's a key and then a comma delimited list of impacted breaches. I'd need to add the account with the plus along the breach and not only that, but there could by MANY instances of accounts with a plus on the same breach for the same user. For example, let's say that again email@example.com on Adobe I need to support firstname.lastname@example.org and email@example.com. Now we're talking about a collection of related addresses against the occurrence of the master address next to the breach.
I don't mind adding an extra table query for the rare instance where the plus symbol is used and I also don't mind iterating back through every existing row for a bulk update of some kind, the main thing is that for the 99%+ of searches where there's no plus, I don't want to add an overhead for those guys.
Totally right @anonymous, it's resolving the relationship between x+y@ and x@ that's the tricky bit. It'd be easy just to allow the breach to be found when searching for x@ (I'd just add it as a standalone record), but there's no construct at present to turn that around and advise the user that x@ was breached by virtue of x+y@ being breached.
Hi Scott, the V2 API is definitely rate limited as described here: https://haveibeenpwned.com/API/v2
Have I incorrectly stated it's not somewhere? I'll fix that if so.
Which API? If it's the one to pull back breaches for a single email address, don't you already know the email address as you've just sent it in the API request?
Thanks for the suggestion, I've renamed the title to reflect what you're requesting in the body.
This was completed a while ago but I neglected to update the idea here. See the API docs page: https://haveibeenpwned.com/API/v2#BreachesForAccount
Adding the ?truncateResponse=true query string returns just the name attribute of the breach, for example: https://firstname.lastname@example.org?truncateResponse=true
Are you getting notifications for when an email on the domain appears in an incident? This should save you running it periodically.
In terms of filtering, perhaps try the Excel export option then use the filters in there.
"who is running the website, where are the servers, what you do collect and do not collect."
All of this is already in the FAQs. If you're saying that your company can't use the service because the title of the page is "FAQs" and not "T&Cs" then no, this is not a "feature" I'll implement, it's a bureaucratic problem with your company!
If I've misunderstand and there's specific information missing from the site then please let me know what it is, but if it's merely "there is no page called T&Cs" then this may not be the right service for you.
What are you actually looking for in terms which is not already documented on the site? Give me some more detail and I'll see if I can fill the gaps.
Could you expand on this further please - what additional info would you like to see? It looks like the askmein.com address is just pulling the description I already publish.
Right, so the trick then is establishing the criteria for "likely fake". One way could be a high correlation with a previous paste based on the prevalence of the same emails in both pastes. This would mean taking the emails from the new paste and seeing if a certain percentage already exist in an existing paste. At present this would be quite laborious as I'd need to check them one by one and we're sometimes talking 10k emails in a paste. Either that or re-architect things to make searching like this easier.
Out of curiosity, how much is this happening? I mean how often do you get an email notification and then conclude it's probably redundant with another paste? I'm just trying to get a sense of the scale of the issue.
Originally I thought you might even be talking about identifying duplicate pastes (which happens a bit) and there are various angles to that. One thing I keep coming back to though is that even if a paste is duplicate or fake, people usually still want to know how their details are being used. In fact that's one of the other ideas currently in progress - notify people when their info appears on a paste I can't verify or may even be fake.
Would you prefer not to know when your email appears if it may be fake? Or know but be notified that it can't be verified and may be fake?
Hey Josh, hanks for the idea! Tell me more about how you think this feature would be used - would it identify that perhaps a paste is fake due to the high correlation with existing breaches? Is it to try and get more confidence around the legitimacy of a paste?