How does Torrentz.com find different trackers for same torrent? - bittorrent

I want to know the logic behind such comparison. Same is being done by BtReannouncer.net
Do the torrents have similar hash or do they compare it by size and name?
Most torrents on piratebay do not have trackers other than piratebay, but torrentz.com provides a complete list of all the trackers that are tracking the same torrent.

Each torrent file contains a hash id which identifies it as unique. As this torrent is passed around to more trackers, the hash id stays. It's also frequently known as an info hash.
febd9a2cb755ec82e6e7a015a8dc497fde9dd507 would be an example hash id.
If you Google it you'll notice that it shows up at many different trackers. Spidering search results and checking in on the major torrent sites can let you index which sites host the torrents, then correlate the intersecting info hashes (Something like, SELECT trackers FROM torrents WHERE info_hash="xxx"). So you'll have torrents with the same info hashes from The Pirate Bay, Mininova, etc.

Related

Comparing data without disclosing it

Two companies A and B want to compare their respective customer bases and figure out the overlap.
Obviously, they can't exchange their customer base. So they need to come up with a process to compare their listing without disclosing any information beside the intersection of both (which defies the whole idea of comparing).
Basically, if I'm a customer of A (identified by my email) and also a customer of B. They both should be able to know after the comparison. However, if I'm only a customer of A, B should NOT be able to identify me, and vice-versa.
Moreover, neither A nor B have the least incentive to disclose any qualitative information about their datasets, such as how many customers they have, or their respective rate of duplication, "incorrectness", etc... The ideal solution should convey information about the intersection, period.
The obvious technical solution seems to hash the identifiers before sending them to the other parties. Both parties can compare with their own after hashing using the exact same method. They can find if an identifier matches but they won't be able to identify the others. However both A and B would know the exact size of each other's dataset. All of this assuming that hashing is not reversible. What else could to be done to solve the problem following that path?
The other solution that is being considered is to find a trusted third party that receives both dataset either plain or hashed, does the comparison and send the intersection to both A and B. I don't know where to find such a service.
A trusted third party would be the way to go here.
The hashing solution is not feasible. To be able to compare the hashes, both A and B would have to use the same hashing algorithm. If both create hashes for all of their customer email addresses, and A then shares the hash of a shared user with B, B can reference the hash to the plaintext email address.
Salts and other such techniques also don't help since, once again, both parties would need to use the same salts to make the hashes comparable.
Lastly, even when A shares the hash of a customer that B does not have, it would be comparatively easy to reverse the hash. For example, B could hash a list of all potential customers and check against it. (This wouldn't reverse every address, but it would still be too large a business risk.)
If a fully trusted 3rd party can't be found, a hybrid approach might work best:
Hash all email addresses, send only the hashes to the 3rd party and have it check which ones overlap.
You can use the following ways,
Add some fake details, which will surely increase the data and could also be reversible at some level
Use the technique such as if my email id is stackoverflow#example.com then each parties can change it to some predefined methods like,
stackoverflow#example.com becomes s#a#k#v#r#l#w#e#a#p#e.c#m
In addition to that, you can also add some checksum of "stackoverflow#example.com" using predefined methods, like ASCII value of characters. However, there is also possibility of dictionary attack or such scenario to retrieve the valid email ids but there will be some level of security. or you can also apply such logic as per your requirement to make it complex.
Trusted third-party as you already mentioned.

Full text search on encrypted data

Suppose I have a server storing encrypted text (end-to-end: server never sees plain text).
I want to be able to do full text search on that text.
I know this is tricky, but my idea is to use the traditional full text design ("list" and "match" tables where words are stored and matched with ids from the content table). When users submit the encrypted text, they also send a salted MD5 of the words and respective matches. The salt used is unique for each user and is recovered from their password.
(in short: the only difference is that the "list" table will contain hashed words)
Now, how vulnerable would this system be?
Note that I said "how vulnerable" instead of "how safe", because I acknowledge that it can't be totally safe.
I DO understand the tradeoff between features (full text search) and security (disclosing some information from the word index). For example, I understand that an attacker able to get the list and match tables could get information about the original, encrypted text and possibly be able to decipher some words with statistical analysis (however, being the salt unique for each user, this would need to be repeated for each user).
How serious would this threat be? And would there be any other serious threats?
DISCLAIMER
What I'm trying to build (and with the help of a cryptographer for actual implementation; right now I'm just trying to understand wether this will be possible) is a consumer-grade product which will deal with confidential yet not totally secret data.
My goal is just to provide something safe enough, so that it would be easier for an attacker to try stealing users' passwords (e.g. breaching into clients - they're consumers, eventually) rather than spending a huge amount of time and computing power trying to brute force the index or run complicated statistical analysis.
Comments in response to #Matthew
(may be relevant for anyone else answering)
As you noted, other solutions are not viable. Storing all the data inside the client means that users cannot access their data from other clients. Server-side encryption would work, but then we won't be able to give users the added security of client-side encryption.The only "true alternative" is just to not implement search: while this is not a required feature, it's very important to me/us.
The salt will be protected in the exactly same way as the users' decryption key (the one used to decrypt stored texts). Thus, if someone was able to capture the salt, he or she would likely be able to capture also the key, creating a much bigger issue.To be precise, the key and the salt will be stored encrypted on the server. They will be decrypted by the client locally with the user's password and kept in memory; the server never sees the decrypted key and salt. Users can change passwords, then, and they just need to re-encrypt the key and the salt, and not all stored texts. This is a pretty standard approach in the industry, to my knowledge.
Actually, the design of the database will be as follow (reporting relevant entries only). This design is like you proposed in your comment. It disallows proximity searches (not very relevant to us) and makes frequency less accurate.
Table content, containing all encrypted texts. Columns are content.id and content.text.
Table words, containing the list of all hashes. Columns are words.id and words.hash.
Table match, that matches texts with hashes/words (in a one-to-many relationship). Columns are match.content_id and match.word_id.
We would have to implement features like removing stopwords etc. Sure. That is not a big issue (will, of course, be done on the client). Eventually, those lists have always been of limited utility for international (i.e. non English-speaking) users.
We expect the lookup/insert ratio to be pretty high (i.e. many lookups, but rare inserts and mostly in bulk).
Decrypting the whole hash database is certainly possible, but requires a brute force attack.
Suppose the salt is kept safe (as per point 2 above). If the salt is long enough (you cited 32 bits... but why not 320? - just an example) that would take A LOT of time.
To conclude... You confirmed my doubts about the possible risk of frequency analysis. However, I feel like this risk is not so high. Can you confirm that?
Indeed, first of all the salt would be unique per each user. This means that one user must be attacked at time.
Second, by reporting words only once per text (no matter how many times they appear), frequency analysis becomes less reliable.
Third... Frequency analysis on hashed words doesn't sound as something as good as frequency analysis on a Caesar-shift, for example. There are 250,000 words in English alone (and, again, not all our users will be English-speaking), and even if some words are more common than others, I believe it'd be hard to do this attack anyway.
PS: The data we'll be storing is messages, like instant messages. These are short, contain a lot of abbreviations, slang, etc. And every person has a different style in writing texts, further reducing the risk (in my opinion) of frequency attacks.
TL;DR: If this needs to be secure enough that it requires per-user end-to-end encryption: Don't do it.
Too long for a comment, so here goes - if I understand correctly:
You have encrypted data submitted by the user (client side encrypted, so not using the DB to handle).
You want this to be searchable to the user (without you knowing anything about this - so an encrypted block of text is useless).
Your proposed solution to this is to also store a list (or perhaps paragraph) of hashed words submitted from the client as well.
So the data record would look like:
Column 1: Encrypted data block
Column 2: (space) delimited hashed, ordered, individual words from the above encrypted text
Then to search you just hash the search terms and treat the hashed terms as words to search the paragraph(s) of "text" in column 2. This will definitely work - just consider searching nonsense text with nonsense search terms. You would even still be able to do some proximity ranking of terms with this approach.
Concerns:
The column with the individually hashed words as text will incredibly weak in comparison to the encrypted text. You are greatly weakening your solution as not only are there limited words to work from, the resultant text will be susceptible to word frequency analysis, etc.
If you do this: separately store a salt unrelated to the password. Given that a rainbow table will be easy to create if your salt is captured (dictionary words only) store it encrypted somewhere.
You will lose many benefits of FTS like ignoring words like 'the' - you will need to re-implement this functionality on your own if you want it (i.e. remove these terms on the client side before submitting the data / search terms).
Other approaches that you imply are not acceptable/workable:
Implement searching client side (all data has to exist on the client to search)
Centralized encryption leveraging the databases built in functionality
I understand the argument being that your approach provides the user with the only access to their data (i.e. you cannot see/decrypt it). I would argue that this hashed approach weakens the data sufficiently that you could reasonably work out a users data (that is, you have lowered the effort required to the point that it is very plausible you could decrypt a user's information without any knowledge of their keys/salts). I wouldn't quite lower the bar to describe this as just obfuscation, but you should really think through how significant this is.
If you are sure that weakening your system to implement searching like this makes sense, and the another approach is not sufficient, one thing that could help is to store the hashes of words in the text as a list of uniquely occuring words only (i.e. no frequency or proximity information would be available). This would reduce the attack surface area of your implementation a little, but would also lose the benefits you are implying you want by describing the approach as FTS. You could get very fast results like this though as the hashed words essentially become tags attached to all the records that include them. The search lookup then could become very fast (at the expense of your inserts).
*Just to be clear - I would want to be REALLY sure my business needs demanded something like this before I implemented it...
EDIT:
Quick example of the issues - say I know you are using 32-bit salts and are hashing common words like "the". 2^32 possible salts = 4 billion possible salts (that is, not that many if you only need to hash a handful of words for the initial attack). Assume the salt is appended or prepended, this still is only 8 billion entries to pre-calculate. Even if it is less common words you do not need to create too many lists to ensure you will get hits (if this is not the case your data would not be worth searching).
Then lookup the highest frequency salts for a given block of text in our each of our pre-calculated salt tables and use the match to see if it correctly decrypts other words in the text. Once you have a plausible candidate generate the 250,000 word English language rainbow table for that salt and decrypt the text.
I would guess you could decrypt the hashed data in the system in hours to days with access to the database.
First, you have all of the normal vulnerabilities of password-based cryptography, which stem from users picking predictable passwords. It is common to crack more than 50% of passwords from real-world applications in offline attacks with less than two hours of desktop computing time.
I assume the full text encryption key is derived from the password, or is encrypted by a password-derived key. So an attacker can test guesses against a selection of hashed index keys, and once she finds the password, decrypt all of the documents.
But, even if a user picks a high-entropy password, frequency analysis on the index could potentially reveal a lot about the plain text. Although word order is lost in indexing (if you don't support proximity searches), you are essentially creating an electronic code book for each user. This index would be vulnerable to centuries of well-developed cryptanalytical techniques. Modern encryption protocols avoid ECB, and provide "ciphertext indistinguishability"—the same plain text yields different cipher text each time it's encrypted. But that doesn't work with indexes.
A less vulnerable approach would be to index and search on the client. The necessary tables would be bundled as a single message and encrypted on the client, then transported to the server for storage. The obvious tradeoff is the cost of transmission of that bundle on each session. Client-side caching of index fragments could mitigate this cost somewhat.
In the end, only you can weigh the security cost of a breach against the performance costs of client-side indexing. But the statistical analysis enabled by an index is a significant vulnerability.
MSSQL Enterprise TDE encrypts Full-Text index as well as other indices when you set whole database encryption (Since 2008). in practice, it works pretty well, without a huge performance penalty. Can't comment on how, b/c it's a proprietary algo, but heres the docs.
https://learn.microsoft.com/en-us/sql/relational-databases/security/encryption/transparent-data-encryption-tde
it doesn't cover any of your application stack besides your db, but your FTS indices will work like normal and won't exist in plain text like they do in MySQL or PostGres. MariaDB and of course Oracle have their own implementation as well, from what i remember. MySQL and PGSQL do not.
As for passwords, TDE on all the implementations use AES keys, which can be rotated (though not always easily) - so the password vulnerability fall on the DBA's.
The problem is you need to pay for full enterprise licensing for MSSQL TDE (ie features not available in "standard" or "basic" cloud and on premise editions), and you do probably for TDE in Oracle as well. But if what you need is a quick solution and have the cash for enterprise licensing (probably cheaper than developing your own implementation), implementations are out there.

How to update entries in a DHT

I know how data is (in theory) stored in a DHT. However, I am uncertain as to how one might go about updating a piece of data associated with a key. Is this possible? Also, how are conflicts handled in a DHT.
A DHT simply defines put(key,value) and get(key) operations and the core of the various DHT algorithms revolve around how to locate the nodes responsible for a specific key.
What those nodes do on an incoming put request for a value already stored largely depends on the purpose and implementation of the DHT network, not on the algorithm itself.
E.g. a node might opt to timestamp all incoming values and return lists with multiple separate timestamped issues. Or it might return lists that also include the source address for each value. Or they might just overwrite the stored value.
If you have some relation between the key and a signature within the value or the source ID or something like that you can put enough intelligence into the nodes to verify the data cryptographically and thus allow them to keep a single canonical value for each key by replacing the old data.
In the case of bittorrent's DHT you wouldn't want that. Many different bittorrent peers announce their presence to a single key from different source addresses. Therefore the nodes actually store unique <key,IP,port> tuples where <IP,port> can be considered the value. Which means it'll return lists of IPs and ports on each lookup. And since a DHT will have multiple nodes responsible for one key you will actually have K (bucket size) nodes responding with varying lists.
TL;DR: It's implementation-dependent
It is possible. I've researched pastrys dht. It is possible to alter data stored under a given key but pastrys developers advise against it as it can have nasty side effects, mainly with replications of the altered piece of data which is stored on other nodes. (see the FAQ on freepastrys home page).
I'm not sure about how it would effect other dhts such as chord or tapestry however.
With regard to conflicts, again I have only experience with pastry. If you try to store data under a key that's already in use an exception will be thrown.

Secure way of exchanging email addresses (hashing) to allow matching for overlap on another list, but not reveal those for which there is no overlap?

I'm with an organization (Company A) that has a large email list. I'm sending a 10,000 email subset of this list to another organization (Company B) to test for overlap (discover which email addresses are on both lists). I want to send the list in a way that is easy for Company B to test for overlap, but difficult (ideally impossible) for Company B to "decode" the email addresses which are NOT already on their list. Secondarily, I want to ensure that if the list I send winds up in the wrong hands (some 3rd party), it would be difficult for anyone else to learn the actual email addresses on the list.
My current solution is to simply pull the emails from our database as
SHA1(email + a_long_random_salt)
Using the same salt for each email address.
To do the match, I send the list of hashes and the salt (securely, separately) to Company B, and they simply search their database using
SELECT email FROM members WHERE SHA1(email + the_salt) IN(hash1, hash2, hash3....)
(Or they pre-compute the SHA1 hash for each address and store it in the DB with the email address so the hashing doesn't need to happen as the query is run)
A sufficiently long/random salt prevents against use of a precomputed rainbow table to crack the hashes. I assume it to be rather unlikely that anyone has a rainbow table of millions upon millions of plausible email addresses salted with whatever 100 character random string I use as my salt. As long as the salt is kept secret, no 3rd party is going to decode this list with a rainbow table or brute force. (Please, correct me if I'm somehow wrong here.)
The issue that I'm struggling with is there are obviously easily-obtained lists of millions upon millions of email addresses harvested from the web. It would be pretty easy for Company B to obtain one of these lists, compute the hashes using the salt I've provided, and recover some significant portion of emails on the list I've sent (certainly not all, but a significant portion).
Is there some strategy to accomplish this match that I'm failing to come up with? The only thing I can think of is to use a more complex hashing method (i.e. multiple iterations) to make it slower to match against a list of hundreds of millions of email addresses (that theoretical list scraped from the web). The key is that it would really only be slower -- not really even difficult. Also, I know that Company B's own email list is in the range of 1 million addresses, so I can't give them a hashing scheme that would take many seconds to compute for each address on that list of 1 million. Simply making it slower doesn't solve the issue -- I think I need a completely different approach.
Honestly, this particular case this is more of an academic exercise for me than a real security concern. I trust Company B is not going to try to do this (we work together often), and even if they did it would be no huge loss. All they could possibly learn is email addresses of 10,000 people on our mailing list -- we're not talking about passwords, credit card numbers, etc. If we were dealing with passwords or credit card numbers, I wouldn't even be considering developing some scheme of my own. And, yes, of course I realize that SHA-256 or some other newer algorithm might be a bit preferable to SHA1, but only to some very limited extent. It's not a brute force crack of the hash that I'm worried about here.
You can conduct the exchange as a Secure Multi-Party Computation problem - with the goal of computing the unique email addresses.
Quoting Wikipedia
Secure multi-party computation (also
known as secure computation or
multi-party computation (MPC)) is a
sub field of cryptography. The goal of
methods for secure multi-party
computation is to enable parties to
jointly compute a function over their
inputs, while at the same time keeping
these inputs private.
If you visit the page http://en.wikipedia.org/wiki/Secure_multiparty_computation
There "external links" section contains libraries and references to get you started.
One thing I can think of is a brute force attack on known domains. Consider the following factors:
#hotmail.com, #gmail.com and #yahoo.com have a great share of the market
the list of last names is finite and not too long. The same for the list of first names
Taking the combination of the name John and surname Doe, we can construct a set of addresses like JDoe#hotmail.com, DoeJ#yahoo.com, JohnDow#hotmail.com etc. The set won't be very extensive.
Depending on how important / benefitial such data mining is (i.e. how much will B gain from knowing that John Doe is in your list), attack that I described can still be profitable. Yes, I remember about salts, but still the number of name/domain combinations is not too large to be unbreakable for good parallel brute force attack.
It appears to me that your problem can be restated as:
Company B has access to a list of 1
million email address, List A. They
also have access to different list of
several million email addresses, List
B. I would like Company B to be able
to run an algorithm to be able to
determine which of the email addresses
in List A is also on our list, but
not be able to run that algorithm against List B.
Re-stated like that, it appears to be a logical impossibility - there is really no difference between their customer database and a list of email addresses they may have downloaded elsewhere.

Need a secure way to publicly display hash values

I am building a windows application to store backups of sensitive files. The purpose of my application is to store a copy of a file with its hash. The program or user will then display the hash publicly in case the user needs to prove they had the backup of the sensitive file at a certain time.
Motivation:
Some situations where this might be useful are:
Someone has a job at a company where they think they might be accused of doing something illegal. If they were accused of changing some data over time, it would be convenient to have copies of sensitive files related to their case over a period of time.
A politician might take notes about things they did each day, many of them about classified or sensitive subjects, and then want to be able to disclose her files at a later date if they are accused of something (for instance, if the CIA said they were briefed on torture…). Not absolute proof, but it would be hard to create fake backup files for every potential scenario, especially several years into the future.
Just to be clear, this application is mostly just an excuse for me to practice my coding skills. I don’t recommend using any type of cryptographic software that hasn’t been scrutinized by several professionals.
Possible Solutions:
For my application, I need to find a good place to publicly store the hash values. Here are my ideas so far:
Send the hash values to a group of people through email. (disadvantage: could annoy people, but would create a traceable record)
Publish the hash values on a public blog (disadvantage: if I ever got in serious legal trouble someone with resources could try to attack the free service I used and erase my data)
Publish the hash values using some online security service that stores documents but does not allow you to delete them. (I am not sure something like this exists.)
What is the most secure and convenient way to publicly display my hash values?
Hash your set of hashes so that you have only one hash to record. Then publish this hash in the classifieds of a widely archived newspaper.
Truly secure? Print out the hashes on a piece of paper along with a legal text to the effect of, "On this day XX/XX/XXXX I affirm these hashes to be accurately identifying these files with these dates." (not a lawyer, get one to verify this), then have it notarized. Then, save that piece of paper in a secure location.

Resources