Fastest way to determine if a string contains a character - string

I have a string which consists of unicode characters. The same character can occur only once.
The length of the string is between 1 and ~50.
What is the fastest way to check if a particular character is in the string or not?
Iterating the string is not a good choice, isn't it? Is there any efficient algorithm for this purpose?
My first idea was to keep the characters in the string alphabetically sorted. It could be searched quickly, but the sorting and the comparison of unicode characters are not so trivial (using the right collation) and it has a big cost, probably bigger then iterating the whole string.
Maybe some hashing? Maybe the iteration is the fastest way?
Any idea?

If there's no preprocessing, the simplest and fastest way is to iterate through the characters.
If there's preprocessing, the previous approach might still the best, or you could try a small hashtable which stores whether a string contains that character. Storing the hash will take extra space, but could be better for memory cache (with low hash collision & assuming you don't have to access the actual string). Make sure you measure the peformance.
I have a feeling you're trying to over-engineer a really simple task. Have you verified that this is a bottleneck in your application?

A linear search through the string is O(n) with each operation being very simple. Sorting the string is O(n log n) with more complicated operations. It's pretty clear that the linear search will be faster in all cases.
If the characters are stored in UTF-8 or UTF-16 encoding then there's a possibility that you'll need to search for more than one contiguous element. There are ways to speed that up, such as Boyer-Moore or Knuth-Morris-Pratt. It's unclear whether there would be an actual speedup with such short search strings.

Is it a repeated operation on the same string or 1 time task ? If it is a 1 time task, then you can't do better than going through the string after all you have to look at all characters. O(n)
If it is repeated operation then you can do some preprocessing of the strings to make the subsequent operations faster. The most space efficient and fastest would be to build bloom filters for the characters in each string. Once built which is is fast too, you can say if a character is not present in 0(1) and only do a binary search of the sorted string only if bloom filter says yes.

Related

Fast substring in scala

According to Time complexity of Java's substring(), java's substring takes linear time.
Is there a faster way (may be in some cases)?
I may suggest iterator, but suspect that it also takes O(n).
val s1: String = s.iterator.drop(5).mkString
But several operations on an iterator would be faster than same operations on string, right?
If you need to edit very long string, consider using data structure called Rope.
Scalaz library has Cord class which is implementation of modified version of Rope.
A Cord is a purely functional data structure for efficiently
storing and manipulating Strings that are potentially very long.
Very similar to Rope[Char], but with better constant factors and a
simpler interface since it's specialized for Strings.
As Strings are - according to the linked question - always backed by a unique character array, substring can't be faster than O(n). You need to copy the character data.
As for alternatives: there will at least be one operation which is O(n). In your example, that's mkString which collects the characters in the iterator and builds a string from them.
However, I wouldn't worry about that too much. The fact that you're using a high level language means (should mean) that developer time is more valuable than CPU time for your particular task. substring is also the canonical way to ... take a substring, so using it makes your program more readable.
EDIT: I also like this sentence (from this answer) a lot: O(n) is O(1) if n does not grow large. What I take away from this is: you shouldn't write inefficient code, but asymptotical efficiency is not the same as real-world efficiency.

Important algorithm involving random access to a string?

I am implementing a different string representation where accessing a string in non-sequential manner is very costly. To avoid this I try to implement certain position caches or character blocks so one can jump to certain locations and scan from there.
In order to do so, I need a list of algorithms where scanning a string from right to left or random access of its characters is required, so I have a set of test cases to do some actual benchmarking and to create a model I can use to find a local/global optimum for my efforts.
Basically I know of:
String.charAt
String.lastIndexOf
String.endsWith
One scenario where one needs right to left access of strings is extracting the file extension and the file name (item) of paths.
For random access i find no algorithm at all unless one has prefix tables and access the string more randomly checking all those positions for longer than prefix strings.
Does anyone know other algorithms with either right to left or random access of string characters is required?
[Update]
The calculation of the hash-code of a String is calculated using every character and accessed from left to right along the value is stored in a local primary variable. So this is not something for random access.
Also the MD5 or CRC algorithm also all process the complete string. So I do not find any random access examples at all.
One interesting algorithm is Boyer-Moore searching, which involves both skipping forward by a variable number of characters and comparing backwards. If those two operations are not O(1), then KMP searching becomes more attractive, but BM searching is much faster for long search patterns (except in rare cases where the search pattern contains lots of repetitions of its own prefix). For example, BM shines for patterns which must be matched at word-boundaries.
BM can be implemented for certain variable-length encodings. In particular, it works fine with UTF-8 because misaligned false positives are impossible. With a larger class of variable-length encodings, you might still be able to implement a variant of BM which allows forward skips.
There are a number of algorithms which require the ability to reset the string pointer to a previously encountered point; one example is word-wrapping an input to a specific line length. Those won't be impeded by your encoding provided your API allows for saving a copy of an iterator.

String datastructure supporting append, prepend and search operations

I need to build a text editor as my mini project, and I need to design a data structure or algorithm that supports following operation:
Append : Append a character at the end of the String.
Prepend : Prepend a character at the beginning of the string.
Search : Given a search string s, find all the occurrences of the string.
Each operation in O(log n) time or less. Search and replace operations will be appreciable but not necessary. The maximum length of string is constant. Any ideas how to achieve this?
Thanks!
A common data structure for this kind of application is a Rope, where Append and Prepend are O(1), although that depends a bit on whether the tree is balanced. However, as noted by Толя, Search would be linear.
There are certainly data structures that can make the search faster, such as a Suffix Tree, but they are probably not appropriate for a text editor application.
I would propose you adapt a Trie. On an append operation add all the suffixes of the string ending at the new character with length up to the maximum length of the string in the datastructure. On prepend add all the prefixes of the string starting at the new char with length up to the fixed length of the string. Asymptotically both operations are constant - they take O(k^2) where k is the fixed length of the string. For each node in the structure keep track of all the strings ending at that node(possibly a list).
A search operation will again be constant: iterate over the string and output all indexes stored in the ending node(if you have not "dropped out the tree").
A drawback of my approach is the memory overhead(at most times the maximum length of a word), but if the maximum string length allowed is reasonable and you only insert real words(from English dictionary for instance), this should not be a big problem.

Removing repeated characters in string without using recursion

You are given a string. Develop a function to remove duplicate characters from that string. String could be of any length. Your algorithm must be in space. If you wish you can use constant size extra space which is not dependent any how on string size. Your algorithm must be of complexity of O(n).
My idea was to define an integer array of size of 26 where 0th index would correspond to the letter a and the 25th index for the letter z and initialize all the elements to 0.
Thus we will travel the entire string once and and would increment the value at the desired index as and when we encounter a letter.
and then we will travel the string once again and if the value at the desired index is 1 we print out the letter otherwise we do not.
In this way the time complexity is O(n) and the space used is constant irrespective of the length of the string!!
if anyone can come up with ideas of better efficiency,it will be very helpful!!
Your solution definitely fits the criteria of O(n) time. Instead of an array, which would be very, very large if the allowed alphabet is large (Unicode has over a million characters), you could use a plain hash. Here is your algorithm in (unoptimized!) Ruby:
def undup(s)
seen = Hash.new(0)
s.each_char {|c| seen[c] += 1}
result = ""
s.each_char {|c| result << c if seen[c] == 1}
result
end
puts(undup "")
puts(undup "abc")
puts(undup "Olé")
puts(undup "asdasjhdfasjhdfasbfdasdfaghsfdahgsdfahgsdfhgt")
It makes two passes through the string, and since hash lookup is less than linear, you're good.
You can say the Hashtable (like your array) uses constant space, albeit large, because it is bounded above by the size of the alphabet. Even if the size of the alphabet is larger than that of the string, it still counts as constant space.
There are many variations to this problem, many of which are fun. To do it truly in place, you can sort first; this gives O(n log n). There are variations on merge sort where you ignore dups during the merge. In fact, this "no external hashtable" restriction appears in Algorithm: efficient way to remove duplicate integers from an array (also tagged interview question).
Another common interview question starts with a simple string, then they say, okay now a million character string, okay now a string with 100 billion characters, and so on. Things get very interesting when you start considering Big Data.
Anyway, your idea is pretty good. It can generally be tweaked as follows: Use a set, not a dictionary. Go trough the string. For each character, if it is not in the set, add it. If it is, delete it. Sets take up less space, don't need counters, and can be implemented as bitsets if the alphabet is small, and this algorithm does not need two passes.
Python implementation: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/
You can also use a bitset instead of the additional array to keep track of found chars. Depending on which characters (a-z or more) are allowed you size the bitset accordingly. This requires less space than an integer array.

String recurring subsequences and compression

I'd like to do some kind of "search and replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.
For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.
I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.
My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.
If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.
The byte pair encoding does something pretty close to what you want.
Rather than searching directly for the longest repeated string (top-down),
each pass of byte pair encoding searches for repeated byte pairs (bottom-up).
But eventually it discovers the longest repeated string(*).
gatcatcgatc
1=at g1c1cg1c
2=atc g22g2
3=gatc 2=atc 323
As you can see, it has found the longest repeated string "gatc".
(*) byte pair encoding either eventually finds the longest repeated string,
or else it stops early after making (2^8 - uniquechars(source) ) substitutions.
I suspect it may be possible to tweak byte pair encoding so that the early-stop condition is relaxed a little -- perhaps (2^9 - uniquechars(source) ) or 2^12 or 2^16.
Even if that hurts compression performance, perhaps it will give interesting results for applications like yours.
Wikipedia: byte pair encoding
Stack Overflow: optimizing byte-pair encoding

Resources