how to code a string to a unique number and decode it - string

Is there any way to code a long string to a unique number (integer) and then decode this number to original string? (I mean to reduce size of long string)

The simple answer is no.
The complex answer is maybe.
What you are looking for is compression, compression can reduce the size of the String but there is no guarantee as to how small it can make it. In particular you can never guarantee being able to fit it into a certain sized integer.
There are concepts like "hashing" which may help you do what you want depending on exactly what you are trying to do with this number.
Alternatively if you use the same string in a lot of different places then you can store it once and pass references/pointers to that single instance of the String around.

First you need to hash it to string eg md5. Then you convert the characters of the hash string into numbers according to the alphabetical number

Related

How to randomize a string

I have a random binary string s of length l bits. How can I change it in-place to another random string of the same length, such that I can retrieve the original string?
A. A trivial example would be adding +1 modulo 2^l
B. Another example could be: for each bit b in the string, replace it with (b+position(b))%2 where position(b) is the position of the bit (0, 1, 2, 3, ...).
However with both these methods, for every input the output is very similar to the input. For example using method A I'll get '010101' => '010110'. Is there any way to "increase the randomness" of the output somehow? In short, can I randomize a string, and retrieve the original (without adding extra bits to the original string)?
Are you trying to make your own encryption system? If so, typical advice would be to use an existing encryption system.
However, one way to do what you ask, would be to generate a value from the string itself (for example, by taking the length of the string), and using that as a seed to a random number generator, and then using the random number generator to alter each character in some reversible way.
That way, your string will be the same length, and not look like the original, and be decodable. It's not very strong encryption though - just a variable cypher which could be broken by a decent decryption attempt.

Fastest way to determine if a string contains a character

I have a string which consists of unicode characters. The same character can occur only once.
The length of the string is between 1 and ~50.
What is the fastest way to check if a particular character is in the string or not?
Iterating the string is not a good choice, isn't it? Is there any efficient algorithm for this purpose?
My first idea was to keep the characters in the string alphabetically sorted. It could be searched quickly, but the sorting and the comparison of unicode characters are not so trivial (using the right collation) and it has a big cost, probably bigger then iterating the whole string.
Maybe some hashing? Maybe the iteration is the fastest way?
Any idea?
If there's no preprocessing, the simplest and fastest way is to iterate through the characters.
If there's preprocessing, the previous approach might still the best, or you could try a small hashtable which stores whether a string contains that character. Storing the hash will take extra space, but could be better for memory cache (with low hash collision & assuming you don't have to access the actual string). Make sure you measure the peformance.
I have a feeling you're trying to over-engineer a really simple task. Have you verified that this is a bottleneck in your application?
A linear search through the string is O(n) with each operation being very simple. Sorting the string is O(n log n) with more complicated operations. It's pretty clear that the linear search will be faster in all cases.
If the characters are stored in UTF-8 or UTF-16 encoding then there's a possibility that you'll need to search for more than one contiguous element. There are ways to speed that up, such as Boyer-Moore or Knuth-Morris-Pratt. It's unclear whether there would be an actual speedup with such short search strings.
Is it a repeated operation on the same string or 1 time task ? If it is a 1 time task, then you can't do better than going through the string after all you have to look at all characters. O(n)
If it is repeated operation then you can do some preprocessing of the strings to make the subsequent operations faster. The most space efficient and fastest would be to build bloom filters for the characters in each string. Once built which is is fast too, you can say if a character is not present in 0(1) and only do a binary search of the sorted string only if bloom filter says yes.

String Encoding Algorithm

Is there is a way to reduce the length of a string
using String encoding algorithm.
Unfortunately "Huffman coding" is not a solution for my case. I am searching for a coding algorithm which takes a string and generate a string which is shorter than the original string(input).
There is no way to shorten an arbitrary string, just as there is no general compression method that works in all cases. So what you need to do is pick a compression method that apples to your expected inputs and use that. Then you just need to convert the results back to a string.
In case you were merely wondering how to convert the results back to a string, there are again any number of ways. Base64 is easy and works well enough. However, it has a 25% overhead compared to a pure binary encoded string.
Hopefully this answers your intended question. There is a library, smaz which compresses short English strings efficiently. Perhaps luckily for you, it actually encodes the string. If your strings aren't English, the general method used by smaz (a static dictionary) can be used with other compressors.
See "English Text compression test". In the article you will find almost all possible algorithms to compress an english text. May be some of them could satisfy your requirements.

Removing repeated characters in string without using recursion

You are given a string. Develop a function to remove duplicate characters from that string. String could be of any length. Your algorithm must be in space. If you wish you can use constant size extra space which is not dependent any how on string size. Your algorithm must be of complexity of O(n).
My idea was to define an integer array of size of 26 where 0th index would correspond to the letter a and the 25th index for the letter z and initialize all the elements to 0.
Thus we will travel the entire string once and and would increment the value at the desired index as and when we encounter a letter.
and then we will travel the string once again and if the value at the desired index is 1 we print out the letter otherwise we do not.
In this way the time complexity is O(n) and the space used is constant irrespective of the length of the string!!
if anyone can come up with ideas of better efficiency,it will be very helpful!!
Your solution definitely fits the criteria of O(n) time. Instead of an array, which would be very, very large if the allowed alphabet is large (Unicode has over a million characters), you could use a plain hash. Here is your algorithm in (unoptimized!) Ruby:
def undup(s)
seen = Hash.new(0)
s.each_char {|c| seen[c] += 1}
result = ""
s.each_char {|c| result << c if seen[c] == 1}
result
end
puts(undup "")
puts(undup "abc")
puts(undup "Olé")
puts(undup "asdasjhdfasjhdfasbfdasdfaghsfdahgsdfahgsdfhgt")
It makes two passes through the string, and since hash lookup is less than linear, you're good.
You can say the Hashtable (like your array) uses constant space, albeit large, because it is bounded above by the size of the alphabet. Even if the size of the alphabet is larger than that of the string, it still counts as constant space.
There are many variations to this problem, many of which are fun. To do it truly in place, you can sort first; this gives O(n log n). There are variations on merge sort where you ignore dups during the merge. In fact, this "no external hashtable" restriction appears in Algorithm: efficient way to remove duplicate integers from an array (also tagged interview question).
Another common interview question starts with a simple string, then they say, okay now a million character string, okay now a string with 100 billion characters, and so on. Things get very interesting when you start considering Big Data.
Anyway, your idea is pretty good. It can generally be tweaked as follows: Use a set, not a dictionary. Go trough the string. For each character, if it is not in the set, add it. If it is, delete it. Sets take up less space, don't need counters, and can be implemented as bitsets if the alphabet is small, and this algorithm does not need two passes.
Python implementation: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/
You can also use a bitset instead of the additional array to keep track of found chars. Depending on which characters (a-z or more) are allowed you size the bitset accordingly. This requires less space than an integer array.

How to have a bigint hash for a string

We have an alpha numeric string (up to 32 characters) and we want to transform it to an integer (bigint). Now we're looking for an algorithm to do that. Collision isn't bad (therefor we use an bigint to prevent this a little bit), important thing is, that the calculated integers are constantly distributed over bigint range and the calculated integer is always the same for a given string.
This page has a few. You'll need to port to 64bit, but that should be trivial. A C# port of SBDM hash is here. Another page of hash functions here
Most programming languages come with a built-in construct or a standard library call to do this. Without knowing the language, I don't think anyone can help you.
Yes, a "hash" should be the right description for my problem. I know, that there is CRC32, but it only provides an 32-bit int (in PHP) and this 32-bit integers are at least 10 characters long, so a huge range of integer number is unused!?
Mostly, we have a short string like "PX38IEK" or an 36 character UUID like "24868d36-a150-11df-8882-d8d385ffc39c", so the strings are arbitrary, yes.
It doesn't has to be reversible (so collisions aren't bad). It also doesn't matter what int a string is converted to, my only wish is, that the full bigint range is used as best as possible.

Resources