How does Instagram name uploaded files? - instagram

I have recently been doing some research on Instagram and its API and have come across its strange file naming. Here is an example:
https://scontent-lhr3-1.cdninstagram.com/t51.2885-15/11357983_574786385995155_503550105_n.jpg
(The image is supposed to just be black...)
I understand that they used to name files like this but I cannot seem to find how they are named now. It seems random although I would like to find out if there is any pattern in how Instagram name their files. I would appreciate any information regarding this.

They explain it all in this blog post here:
https://engineering.instagram.com/sharding-ids-at-instagram-1cf5a71e5a5c
They create unique ids following a custom numbering scheme they developped such that it's 64 bits and garantees unicity:
Each of our IDs consists of:
41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
13 bits that represent the logical shard ID
10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond
If you want an implementation example, the blog post explains it pretty well but since the original questions is not about how to get the file name rather than where it comes from I feel the above quote to be sufficient.

Best guess is they use random. It's Javascript so they probably have a random set of numbers generated to avoid data corruption.

Related

Share Link Generation Security

I have always wondered how websites generates "share with others" links.
Some websites allow you to share a piece of data through a link in order to let people you sent the link to to be able to see the data or even edit it.
For example Google Drive, OneDrive, etc... They give you a (pretty short) link, but what guaranties me that it's not possible for someone to find this link "by luck" and access my data?
Like what if an attacker was trying all the possibilities of links: https://link.share.me/xxxxxxx till he finds some working ones?
Is there a certain length which almost guaranties that no one will find one link this way ? For example if a site generated 1000 links, if the endpoints are composed of 10 times a [A-Za-z0-9] like character (~8e17 possibilities), we just assume that it is secure enough ? If yes, at what probability or ratio between links and possibilities do we consider this kind of system as secure?
Is there a certain cryptographic or mathematic way of generating those links which assure us that a link cannot be found by anyone?
Thank you very much.
Probably the most important thing (besides entropy, which we will come back to in a second) is where you get random from. For this purpose you should use a cryptographic pseudo-random number generator (crypto prng). (As a sidenote, you could also use real random, but a real random source is very hard to come by, if you generate many links, you will likely run out of available random bits, so a crypto prng is probably good enough for your purpose, few applications do actually need real random numbers). Most languages and/or frameworks have a facility for this, in Ruby it is SecureRandom, in Java it's java.security.SecureRandom for example, in python it could be os.urandom and so on.
Ok so how long should it be. It somewhat depends on your other non-security requirements as well, for example sometimes these need to be easy to say over the phone, easy to type or something similar. Apart from these, what you should consider is entropy. Your idea of counting the number of all possible codes is a great start, let's just say that the entropy in the code is log2 (base 2 logarithm) of that number. So for a case sensitive, alphanumeric code that is 10 characters long, the entropy is log2((26+26+10)^10) = 59.5 bits. You can compute the entropy for any other length and character set the same way.
That might well be enough, what you should consider is your attacker. Will they be able to perform online attacks only (a lot slower), or offline too (can be very-very fast, especially with specialized hardware)? Also what is the impact if they find one, is it like financial data, or just a random funny picture, or the personal data of somebody, for which you are legally responsible in multiple jurisdictions (see GDPR in EU, or the California privacy laws)?
In general, you could say that 64 bits of entropy is probably good enough for many purposes, and 128 bits is a lot (except maybe for cryptographic keys and very high security applications). As the 59 bits above is.. well, almost 64, for lower security apps that could for example be a reasonable tradeoff for better usability.
So in short, there is no definitive answer, it depends on how you want to model this, and what security requirements you want to meet.
Two more things to consider are the validity of these codes, and how many will be issued (how dense will the space be).
I think the usual variables here are the character set for the code, and its length. Validity is more like a business requirement, and the density of codes will depend on your usage and also the length (which defines the size of your code space).
As an example, let's say you have 64 bits of entropy, you issued 10 million codes already, and your attacker can only perform online attacks by sending a request to your server, at a rate of say 100/second. These are likely huge overstatements towards the secure side.
That would mean there is a 0.17% chance somebody could find a valid code in a year. But would your attacker put so much effort into finding one single (random) valid code? Whether that's acceptable for you only depends on your specific case, only you can tell. If not, you can increase the length of the code for example.
I do not use OneDrive, but I can say from Google Drive that:
The links are not that short. I have just counted one and it's length is 32.
More than security, they probably made large links to do not run out of combinations as thousands of Drive files are shared each day. For security, Drive allows you to choose the users that can access to it. If you select "Everyone" then you should be sure that you don't have problem that anyone sees the content of the link. Even if the link cannot be found "by chance" there still exists the probability that someone else obtains the link from your friend and then shares it or that they are catched in proxies. Long links should be just complementary to other security measures.
Answering your questions:
Links of any length can be found, but longer links will require more time to be found. If you use all alphanumeric characters probably 30 is enough, but as I said they should not be the unique security in your system.
Just make them random, long and let the characters to be in a wide range.

Using "seed" based math to recreate application instances

Okay so I was thinking today about Minecraft a game which so many of you are so familiar with, I'm sure and while my question isn't directly related to the game I find it much simply to describe my question using the game as an example.
My question is, is there any way a type of "seed" or string of characters can be used to recreate an instance of a program (not in the literal programming sense) by storing a code which when re-entered into this program as a string at run-time, could recreate the data it once held again, in fields, text boxes, canvases, for example, exactly as it was.
As I understand it, Minecraft takes the string of ASCII characters you enter, all which truly are numbers, and performs a series of operations on it which evaluate to some type of hash or number which is finite... this number (again as I understand) is the representation of that string you entered. So it makes sense that because a string when parsed by this algorithm will always evaluate to the same hash. 1 + 1 will always = 2 so a seeds value must always equal that seeds value in the end. And in doing so you have the ability to replicate exactly, worlds, by entering this sort of key which is evaluated the same on every machine.
Now, if we can exactly replicate worlds like this this is it possible to bring it into a more abstract concept like the following?...
Say you have an application, like Microsoft Word. Word saved the data you have entered as a file on your hard drive it holds formatting data, the strings you've entered, the format of the file... all that on a physical file... Now imagine if when you entered your essay into Word instead of saving it and bringing your laptop to school you instead click on parse and instead of creating a file, you are given a hash code... Now you goto school you know you have to print it. so you log onto the computer and open Word... Now instead of open there is an option now called evaluate you click it and enter the hash your other computer formulated and it creates the exact essay you have written.
Is this possible, and if so are there obvious implementations of this i simply am not thinking of or are just so seemingly part of everyday I don't think recognize it? And finally... if possible, what methods and algorithms would go into such a thing?
[EDIT]
I had to do some research on the anatomy of a seed and I think this explains it well
The limit is 32 characters or for a
numeric seed, 19 digits plus the minus sign.
Numeric seeds can range from -9223372036854775808 to
9223372036854775807 which is a total of 18446744073709551616 Text
strings entered will be "hashed" to one of the numeric seeds in the
above range. The "Seed for the World Generator" window only allows 32
characters to be entered and will not show or use any more than that."
BUT looking back on it lossless compression IS EXACTLY what I was
describing after re-reading the wiki page and remembering that (you
are very correct) the seed only partakes in the generation, the final
data is stores as a "physical" file on the HDD (which again, you are correct) is raw uncompressed data in a file
So in retrospect, I believe I was describing lossless compression, trying in my mind to figure out how the seed was able to replicate the exact same world, forgetting the seed was only responsible for generating the code, not the saving or compression of it.
So thank you for your help guys! It's really appreciated I believe we can call this one solved!
There are several possibilities to achieve this "string" that recovers your data. However they're not all applicable depending on the context.
An actual seed, which initializes for example a peudo-random number generator, then allows to recreate the same sequence of pseudo-random numbers (see this question).
This is possibly similar to what Minecraft relies on, because the whole process of how to create a world based on some choices (possibly pseudo-random choices) is known in advance. Even if we pretend that we have random numbers, computers are actually deterministic, which makes this possible.
If your document were generated randomly then this would be applicable: with the same seed, the same gibberish comes out.
Some key-value dictionary, or hash map. Then the values have to be accessible by both sides and the string is the key that allows to retrieve the value.
Think for example of storing your word file on an online server, then your key is the URL linking to your file.
Compressing all the information that is in your data into the string. This is much harder, and there are strong limits due to the entropy of the data. See Shannon's source coding theorem for example.
You would be better off (as in, it would be easier) to just compress your file with a usual algorithm (zip or 7z or something else), rather than reimplementing it yourself, especially as soon as your document starts having fancy things (different styles, tables, pictures, unusual characters...)
With the simple hypothesis of 27 possible characters (26 letters and the space), Shannon himself shows in Prediction and Entropy of Printed English (Bell System Technical Journal, 30: 1. January 1951 pp 50-64, online version) that there is about 2.14 bits of entropy per letter in English. That's about 550 characters encoded with your 32 character string.
While this is significantly better than the 8 bits we use for each ASCII character, it also shows it is very likely to be impossible to encode a document in English in less than a fourth of its size. Then you'd still have to add punctuation, and all the rest of the fuss.

Dynamics CRM 2011 Import Data Duplication Rules

I have a requirement in which I need to import data from excel (CSV) to Dynamics CRM regularly.
Instead of using some simple Data Duplication Rules, I need to implement a point system to determine whether a data is considered duplicate or not.
Let me give an example. For example these are the particular rules for Import:
First Name, exact match, 10 pts
Last Name, exact match, 15 pts
Email, exact match, 20 pts
Mobile Phone, exact match, 5 pts
And then the Threshold value => 19 pts
Now, if a record have First Name and Last Name matched with an old record in the entity, the points will be 25 pts, which is higher than the threshold (19 pts), therefore the data is considered as Duplicate
If, for example, the particular record only have same First Name and Mobile Phone, the points will be 15 pts, which is lower than the threshold and thus considered as Non-Duplicate
What is the best approach to achieve this requirement? Is it possible to utilize the default functionality of Import Data in the MS CRM? Is there any 3rd party Add-on that answer my requirement above?
Thank you for all the help.
Updated
Hi Konrad, thank you for your suggestions, let me elaborate here:
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Nice one but I don't think it is really workable in my case, the data will be coming regularly from client in moderate numbers (hundreds to thousands). Typically client won't check about the duplication on the data.
Workflow. Run a process removing any instance calculated as a duplicate.
Workflow is a good idea, however since it is being processed asynchronously, my concern is the user in some cases may already do some update/changes to the data inserted, before the workflow finish working.. therefore creating some data inconsistency or at the very least confusing user experience
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
I like this approach. So I just import like usual (for example, to contact entity), but I already have a plugin in place that getting triggered every time a record is created, the plugin will check whether the record is duplicat-ish or not and took necessary action.
I haven't been fiddling a lot with duplicate detection but looking at your criteria you might be able to make rules that match those, pretty much three rules to cover your cases, full name match, last name and mobile phone match and email match.
If you want to do the points system I haven't seen any out of the box components that solve this, however CRM Extensions have a product called Import Manager that might have that kind of duplicate detection. They claim to have customized duplicate checking. Might be worth asking them about this.
Otherwise it's custom coding that will solve this problem.
I can think of the following approaches to the task (depending on the number of records, repetitiveness of the import, automatization requirement etc.) they may be all good somehow. Would you care to elaborate on the current conditions?
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
Workflow. Run a process removing any instance calculated as a duplicate.
You also need to consider the implication of such elimination of data. There's a mathematical issue. Suppose that the uniqueness' radius (i.e. the threshold in this 1D case) is 3. Consider the following set of numbers (it's listed twice, just in different order).
1 3 5 7 -> 1 _ 5 _
3 1 5 7 -> _ 3 _ 7
Are you sure that's the intended result? Under some circumstances, you can even end up with sets of records of different sizes (only depending on the order). I'm a bit curious on why and how the setup came up.
Personally, I'd go with plugin, if the above is OK by you. If you need to make sure that some of the unique-ish elements never get omitted, you'd probably best of applying a test algorithm to a backup of the data. However, that may defeat it's purpose.
In fact, it sounds so interesting that I might create the solution for you (just to show it can be done) and blog about it. What's the dead-line?

Get random site links in bash [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Get random site names in bash
I'm making a program for the university that has to find the occurrences of the words on the web. I need to make an algorithm that finds sites and count the numbers of words used and after it has to record them and sort by how many times they are used. Therefore the most sites my program checks, the better. First of all I was thinking of calculating random IPs, but the problem is that the process takes really too much (I left the computer searching the whole night and it found only 15 sites). I guess this is because site's IPs aren't distributed evenly on the web and most of the IPs belongs to users or other services. Now I had a pair of new approach in mind and I wanted to know what you guys think:
what if I make random searches using some sort of a dictionary through google? The dictionary would start empty at the beginning and each time I perform a search, I check one site and add to the dictionary only the words that occur once, so that this won't send me to that site again, by corrupting the occurrences.
Is this easy?
The first thing I want to do is to search also random pages in the google search and not only the first one, how can this be done? I can't figure out how to calculate the max number of pages for that search and how to directly go to a specific page
thanks
While I don't think you could (or should) do this in bash alone, take a look at Google Custom Search API and this question. It allows to programmatically query Google search directly.
As for what queries to use, you could resort to picking words randomly from a dictionary file - though that would not give you a uniform distribution as words like 'cat' are more popular than 'epichorial', say. If you require something which takes into account those differences you can use a word frequency dictionary, although that seems to be the point of you research in itself, so perhaps that would not be appropriate.

How do I hash an integer into a very small string?

I need a function that, given a salt integer and a value integer will return a small hash string. Calling the function with 1 and 56 might return "1AF3". Calling it with 2 and 56 might return "C2FA".
Background info:
I have a web app (written in C# if that matters) that stores employee Id values as integers. Users need to be able to see a consistent representation of that Id, but no user should see the actual Id, or the same representation of that Id as seen by another user.
For example, suppose there is an Employee with the Id of 56.
When User 1 logs in, wherever he sees that employee, he sees "1AF3" or something. He might see this employee on different pages in the app, and its Id should always be 1AF3 so he knows it's the same guy.
When User 2 logs in, should he encounter that same employee, he would always see "C2FA", or something. Same goes for User 2: wherever he is in the system, he would see that one employee represented by that same string.
Should User 2 look over the shoulder of User 1 while User 1 is logged in, User 2 should not be able to recognize any of his employees on User 1's screen, because this hash should be irreversible.
Does this make sense?
One additional requirement is that since the users will be discussing these employees in email, on the phone, and in faxes, the hash would need to be of a minimum size and not contain non-alphanumeric characters. 10 characters or fewer would be ideal.
Maybe there is a way to "collapse" a SHA-256 result into fewer characters since the whole alphabet could be used? I have no idea.
Update: Another walk-through
Thanks everyone for giving this a shot but it seems like I am doing a bad job explaining it or something.
Let's pretend you and me are both users of this system. You're Fred and I'm Chris. Your UserId is 2 and my UserId is 1. Let's also assume there are 5 Employees in the system. Employees are not users. You can think of them as products, or whatever you want. I'm just talking about 5 generic entities that you, Fred, and I, Chris, each deal with.
Fred, every time you log in, you need to be able to uniquely identify each employee. Every time I, Chris, log in, I also need to work with employees and I too will need to be able to identify them uniquely. But should I ever look over your shoulder while you are managing employees, I should not be able to figure out which ones you are managing.
So, while in the database, the employee IDs are 1, 2, 3, 4, and 5. You and I do not see them that way in our interface. I might see A, B, C, D, and E, and you might see F, G, H, I, and J. So while E and J both represent the same employee, I can't look at your screen while you are working with your Employee "J" and know that you are working with Employee 5, because for me, that employee is called Employee "E" for me.
So, Fred and Chris can each work with the same set of employees, but if they were to see each other's work, or discussion in an email, they would not be able to know what employees the other guy was talking about.
I was thinking I could achieve this "real-time user-dependent EmployeeID" by taking the real employee ID and hashing it using the user ID as the salt.
Since Fred and Chris each need to discuss employees over email and the telephone with their clients and customers, I'd like the IDs that they use in these discussions to be as simple as we can get them.
Conceptually, here is what you want:
You have a set of employee IDs which you can represent as element in a given space S. You also have some users, and you want each user to see a permutation of space S, which is specific to the user, and such that the details of that permutation cannot be guessed by any other user.
This calls for symmetric encryption. Namely, each employee ID is a numerical value (e.g. a 32-bit integer), and a user 'A' sees employee x as Ek(x), there k is a secret key which is specific to 'A' and that 'B' cannot guess. So you need two things:
a block cipher which can work with short values (e.g. 32-bit words);
a method which turns user ID into the user-specific key.
For the block cipher, the trouble is that short blocks are a security issue for the normal usage of a block cipher (i.e. to encrypt long messages). So all published, secure block ciphers use large blocks (64 bits or more). 64 bits can be represented over 11 characters by using uppercase and lowercase letters, and digits (6211 is somewhat greater than 264). If that's good enough for you, then use 3DES. If you want something smaller, you will have to design your own cipher, something which is not recommended at all. You may want to try KeeLoq: see this paper for pointers (KeeLoq is cryptographically "broken" but not too much, given your context). There is a generic method for building block ciphers with arbitrary block sizes, given a seekable stream cipher, but this is mostly theoretical (implementation requires waddling through high-precision floating point values, which can be done but is very slow).
For the user-specific key: you want something that the Web application can compute, but not users. This means that the Web application knows a secret key K; then, the user-specific encryption key can be the result of HMAC (with a good hash function, such as SHA-256) applied over the user ID, and using key K. You then truncate the HMAC output to the length you need for the user-specific key (for instance, 3DES needs a 24-byte key).
C# has TripleDES and HMAC/SHA-256 implementations (in System.Security.Cryptography namespace).
(There is no generally accepted secure standard for a block cipher with 32-bit blocks. This is still an open research area.)
There might be problems with this approach but you could do it like this:
Make an array holding all your symbols (say a 25 element array)
Hash your string using whatever hash function
Pick a number of octets out of the resulting hash (4 octets if you want 4 symbols in our resulting string) from predefined positions
For each octet compute index = octet % array_size. The index gives the position for each of your symbols
Again, I have almost zero experience with cryptography, hash functions and the like so you may want to take this with a grain of salt.
There are many ways to "de-anonymize" information. It would help if you could be more specific about the context and what "assets" you are really trying to protect here, against who. See our faq.
E.g., might one user know the number of another user? They could probably find it out quickly if they discovered thru other means the correspondence between 1AF3 and C2FA.
But specifically for your narrower question, a good hash will already be well-mixed, so I'd think you could just truncate, e.g., a SHA-256 hash value. But Thomas will probably know the definitive answer there.
Here are my thoughts getting to the point of it (I figure if you talked out your question, I'll talk out my answer. I'm guessing you'll find that helpful):
All hail Thomas, because he has clearly established his dominance.
0-9, A-F is a representation of the data. You can make it A-Z, 0-9, exclude some uncommon letters, and represent six bits per character.
You can basically say that all hashes have collisions. If you approach saturation, you'll end up with two people who have the same hash. Hashes are also one-way. You would need a mapping that allows reversal. If you have a reverse mapping, why not fill it with random strings which don't collide?
You are obfuscating a limited set of data. With a large and secret salt, you can prevent reversal. That said, you're trading one ID for another. The ID is still unique and constant, so I wonder how this enhances security.
I have some clients where if I were to see something like this, I'd put money that the employee ID was a SSN. I hope you're not doing that.
Employee ID and Employee Alternate ID are what you are coming up with. Since they have to be reversible to you but not the public, you need to store that in a two way pairing and keep it secret. Since there's risk of collision with a hash and you have to have a reverse map anyway, the alternate id might as well be a random string. An ID should be arbitrary anyway, and I would really like to know the perceived security benefit of your approach with two ids for one employee; it makes me think of Mission Impossible and the NOC list.
Just an idea for an approach based on the extra information you have added. The security on this idea is very very light and i'm would not recommend it if you think people are going to attempt to crack it, but it's worth throwing in the pot.
You could create a personal hash by bit-shifting the employee Id based on your own employee Id. Then by adding whatever extra obfuscation code you need to the resulting number, such as converting it to hex. E.g.
string hashedEmployeeId = (employeeIdToHash << myEmployeeId).ToString("X");
This will generate hashed employee Ids based on your own Id, but you may run into problems when the employee Ids get large (especially your own!)
Just to reiterate, this on it's own isn't really very secure but it might help you on your way.
Using 4 characters you would have a total of: 36^4 = 1679616.
You could permute all possibilities of employes togheter.
If you calculate de square root you get 1296.
You could then generate an ordered table with all the possibilities in the first column and then randomly distribute ids from 1 to 1296 in to oder columns. You would get something like this:
key a b
AAAA 386 67
AAAB 86 945
...
With this solution you would have a lookup table scalable up to 1296 employes. However if you consider adding an extra character to your key you would get a lot more possibilities (36^5)^0.5=7776.
With this solution gessing a key would give you one chance on 1296 or 7776 to see information about an employe.
May be performance would be an issue, but I tink you can manage this using a cache or may be even keeping all the data loaded in memory and use a kind of tree map to find corresponding key for two given ids.

Resources