How do sites like tinyurl generate urls?

How do sites like tinyurl generate urls? - string

I looked at tinyurl, tinypic, imgur and youtube! I thought they would use a text safe representation of a index and use it as a primary ID in their DB. However trying to put the keys into Convert.FromBase64String("key") yields no results and throw an exception. So these sites dont use a base64 array. What are they using? What might i want to use if i were to do a youtube like site or tinyurl?

im guessing they have developed their own encoding which is simply an alphanumeric equivalent of the ids in their database. im sure they dont generate random strings simply because this will cause catastrophic overflows at a certain point

I don't know about TinyURL, Tinypic, etc. but shorl.com uses something called koremutake.
If I were to develop such a system, I guess some sort of short hash or plain random strings could be possible choices.

My guess is that they simply generate a random string and use that as the primary key. I don't really see a reason to do anything else.

Related

Automatically convert uuids to url safe strings using node-pg

I use uuid for just about every ID in my REST backend powered by node and postgres. I also plan to use validate.js to make sure the queries are formatted correctly.
In order to shorten the URLS for my application, I would like to convert all UUIDS used by my backend into URL safe strings when exposed to the REST consumer.
The problem is that, as far as I can tell there is no such setting within node-pg. And node-pg usually returns the query results as JSON objects using either strings or numbers. That makes it hard to autmatically convert them.
I could of course just go through every single rest endpoint and add code that automatically converts all the types where I know a UUID would be. But that would violate DRY and also be a hotbed for bugs.
I also could try to automatically detect strings that look like UUIDs and then just convert them, but that also seams like it may introduce lots of bugs.
One ideal solution would be some sort of custom code injection into node-pg that automatically converts uuids. Or maybe just some pg function I could use to automatically convert the uuids within the pg-queries themselves (although that would be a bit tedious).
Another Ideal solution might be some way to use validate.js to convert the outputs and inputs during the validation. But I don't know how I could do this.
So basically, what would be a good way to autmatically convert uuids in node-pg to url safe (shorter) strings without having to add a bit of code to every single endpoint?

I think this is what I want: https://github.com/brianc/node-pg-types
It lets me set custom converters for each datatype. The input will probably still have to be converted manually

Compare 2 identical excel/ppt/csv files in nodejs

I have a requirement where I want to compare 2 identical excel/ppt/csv files which may have exact same content but may be created at different point in time.
I want to compare only the file contents in whatever manner possible using any nodejs package.
But I couldn't figure out how it can be done in an easier way either by stream comparison or even buffer comparison also didn't help.
I've done more research but not much success and I'm just wondering how it would be possible to ignore certain things such as time stamp and any other metadata while doing comparison and only consider contents to match up.
I've tried stream-compare, stream-equal, file-compare, buff1.equals(buff2) and few others but nine of them seem to have worked for my requirement.
But I didn't find any node package on the web which does what I am looking for.
Any insights or any suggestions as how it can be achieved?
Thanks in advance any help would be appreciated.

Search for a package that computes a hash on the document, for example crypto, calculate hashes (sha256) for 2 docs and compare them. If hashes match, document content will be the same (there is still a chance of hash collision, but it depends on the hash algorytm that are you using, sha256 will give you a decent confidence that documents are identical). Check this thread for more details: Obtaining the hash of a file using the stream capabilities of crypto module (ie: without hash.update and hash.digest)

What is the code/algorithm for generating a shortened url?

I've searched around for a while now on how to generate a shortened url (e.g. how bit.ly or goo.gl work) but have been unsuccessful.
I presumed it would be something like:
baseN(hash(long_url))
But I always end up with a very long digest instead of something short like 6 characters.
Is it safe to just truncate the digest before encoding it (is encoding it even necessary - I believe it is for making it URL 'safe' but wanted to ask) and is there not a possibility of collisions when only dealing with six characters?
It seems like (warning: I don't know maths) a factorial of 6! (e.g. 6*5*4*3*2*1) would result in only 720 combinations.
I also remember reading somewhere that with a hash table of 100k items, that a rough calculation for the number of collisions could yield ~17% chance of collision. That feels like a pretty large percentage to me?
The following Python code is based off my understanding of how I might do this type of url shortening:
import hashlib, base64
message = hashlib.sha512()
message.update("https://www.python.org/dev/peps/pep-0537/")
base64.urlsafe_b64encode(
message.hexdigest().encode("utf-8")
)[:6].decode("utf-8")

There is no effective function to do this. You need to:
Store the URL in a database
Generate a unique ID (or if you already have the url, reuse the id)

you may be looking for a bidirectional function as mentioned in How to code a URL shortener?
but I also recommend you to not over-complicate unless it is really a requirement for your scenario
a much simpler approach would be to just keep record of what you've mapped:
... there is no compression algorithm, but there is a lookup and generation algorithm. When a URL shortener gets a new URL, it has to create a new short URL that is not yet taken and return this. It will then store the short URL and the long URL in a key-value store and use this at lookup time.
https://www.quora.com/What-are-the-http-bit-ly-and-t-co-shortening-algorithms

I have a simple database of content. Should I hash the "id" so that people don't look over it in the URL?

Is it recommended to create a column (unique key) that is a hash.
When people view my URL, it is currently like this:
url.com/?id=2134
But, people can look over this and data-mine all the content, right?
Is it RECOMMENDED to go 1 extra step to make this through hash?
url.com?id=3fjsdFNHDNSL
Thanks!

The first and most important step is to use some form of role-based security to ensure that no user can see data they aren't supposed to see. So, for example, if a user should only see their own information, then you should check that the id belongs to the logged-in user before you display it.
As a second level of protection, it's not a bad idea to have a unique key that doesn't let you predict other keys (a hash, as you suggest, or a UUID). However, that still means that, for example, a malicious user who obtained someone else's URL (e.g. by sniffing a network, by reading a log file, by viewing the history in someone's browser) could see that user's information. You need authentication and authorization, not simply obfuscating the IDs.

It sort of depends on your situation, but off hand I think if you think you need to hash you need to hash. If someone could data mine by, say, iterating through:
...
url.com?id=2134
url.com?id=2135
url.com?id=2136
...
Then using a hash for the id is necessary to avoid this, since it will be much harder to figure out the next one. Keep in mind, though, that you don't want to make the hash too obvious, so that a determined attacker would easily figure it out, e.g. just taking the MD5 of 2134 or whatever number you had.

Well, the problem here is that an actual Hash is technically one way. So if you hash the data you won't be able to recover it on the receiving side. Without knowing what technology you are using to create your web page it's hard to make any concrete suggestions, but if you must have sensitive information in your query string then I would recommend that you at least use a symmetric encryption algorithm on it to keep people from simply reading off the values and reverse engineering things.
Of course if you have the option - it's probably better to not have that information in the query string at all.

How can I shorten a string and later get back the original contents?

I have a really long string that I need to pass in a URL, say 10,000 characters. Anyone know a good way to shorten it to under 2,000 chars, and then on a server somehow get the original back?
This is Objective-C talking to Ruby, but that shouldn't matter.

Can you post the data?
If you use GET the max length of a url is around 4000 characters. If you POST it you have no constraints (except timeouts memory etc)
This article talks about doing a post from objective-c

Are you sure you have to pass it in as a URL? Maybe POST-Data or Session would be more appropriate? otherwise you could store the string in a database and return the key of the inserted record as a URL Parameter. If this is a security concern (as people can just change the number if it is an integer key), you could create a UUID as key.

Store it in a database and then just pass the id of the string in the url.

You can try running it through Base64. If the string is guaranteed to have only a subset of possible characters -- for example, [a-zA-Z0-9] -- it can be shortened even more by converting these to unique ordinals and using a higher base encoding.
But it would probably be easier to just use POST.

Well, compress it and Base64 encode the result. If the string has a very specific format, a custom encoding could even yield a better compression. Can you give an example?

I would persist this information to a database (or any other persistence source) and then pass a reference to it in the URL.
Both source and destination will require access to the database, but that isn't an issue most of the time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do sites like tinyurl generate urls? - string

im guessing they have developed their own encoding which is simply an alphanumeric equivalent of the ids in their database. im sure they dont generate random strings simply because this will cause catastrophic overflows at a certain point

I don't know about TinyURL, Tinypic, etc. but shorl.com uses something called koremutake. If I were to develop such a system, I guess some sort of short hash or plain random strings could be possible choices.

My guess is that they simply generate a random string and use that as the primary key. I don't really see a reason to do anything else.

Related

Automatically convert uuids to url safe strings using node-pg

Compare 2 identical excel/ppt/csv files in nodejs

What is the code/algorithm for generating a shortened url?

I have a simple database of content. Should I hash the "id" so that people don't look over it in the URL?

How can I shorten a string and later get back the original contents?

Categories

Resources