Add basic correlation logic to compare newly found pastes against current breaches...
Some sort of fuzzy matching & correlation with already posted breaches to see if the paste is just another re-post of the data from another known breach.
One way to do this is look for emails that have the + syntax, which typically means that the user has created a somewhat unique email for a particular service, company, etc
I've just seen a paste that appears the same as a previous breach. Very likely the same data set (not 100% current, but likely floating on some old server/backup).
Would be good to be able to see (in notifications) if this is a repeat or a first time...
Right, so the trick then is establishing the criteria for "likely fake". One way could be a high correlation with a previous paste based on the prevalence of the same emails in both pastes. This would mean taking the emails from the new paste and seeing if a certain percentage already exist in an existing paste. At present this would be quite laborious as I'd need to check them one by one and we're sometimes talking 10k emails in a paste. Either that or re-architect things to make searching like this easier.
Out of curiosity, how much is this happening? I mean how often do you get an email notification and then conclude it's probably redundant with another paste? I'm just trying to get a sense of the scale of the issue.
Probably the 2nd choice:
--know but be notified that it can't be verified and may be fake?
If it appears to be a fake (or is confirmed to be a fake through automatic logic, etc), that means less time for us to have to look up the paste and do some analysis by hand....
This probably means more to me because I am monitoring multiple domains with many users vs. just my personal email.
Originally I thought you might even be talking about identifying duplicate pastes (which happens a bit) and there are various angles to that. One thing I keep coming back to though is that even if a paste is duplicate or fake, people usually still want to know how their details are being used. In fact that's one of the other ideas currently in progress - notify people when their info appears on a paste I can't verify or may even be fake.
Would you prefer not to know when your email appears if it may be fake? Or know but be notified that it can't be verified and may be fake?
@Troy Yes, that is correct. I think we will continue to see a high amount of reposted data... Having some "simple" logic to give some direction around the legitimacy of the paste would be beneficial....
Hey Josh, hanks for the idea! Tell me more about how you think this feature would be used - would it identify that perhaps a paste is fake due to the high correlation with existing breaches? Is it to try and get more confidence around the legitimacy of a paste?