Fast substring in scala - string

According to Time complexity of Java's substring(), java's substring takes linear time.
Is there a faster way (may be in some cases)?
I may suggest iterator, but suspect that it also takes O(n).
val s1: String = s.iterator.drop(5).mkString
But several operations on an iterator would be faster than same operations on string, right?

If you need to edit very long string, consider using data structure called Rope.
Scalaz library has Cord class which is implementation of modified version of Rope.
A Cord is a purely functional data structure for efficiently
storing and manipulating Strings that are potentially very long.
Very similar to Rope[Char], but with better constant factors and a
simpler interface since it's specialized for Strings.

As Strings are - according to the linked question - always backed by a unique character array, substring can't be faster than O(n). You need to copy the character data.
As for alternatives: there will at least be one operation which is O(n). In your example, that's mkString which collects the characters in the iterator and builds a string from them.
However, I wouldn't worry about that too much. The fact that you're using a high level language means (should mean) that developer time is more valuable than CPU time for your particular task. substring is also the canonical way to ... take a substring, so using it makes your program more readable.
EDIT: I also like this sentence (from this answer) a lot: O(n) is O(1) if n does not grow large. What I take away from this is: you shouldn't write inefficient code, but asymptotical efficiency is not the same as real-world efficiency.

Related

Concurrently check if string is in slice?

Generally to check if a string is in slice I write a function with for loop and if statement. but it's really inefficient in cases of large slices of string or struct types. is it possible to make this check concurrent?
The concurrent search on sequential data is usually not a great idea, simply because we already have a binary search that scales really well for even billions of records. All you have to do to utilize it is build indexing on top of the slice you are searching in. To build the most trivial indexing, you have to save keys into another slice along with the index of data they are pointing to. Once you have the slice, just sort it by strings, and indexing is done.
You have to perform the binary search on the indexing you just created to be more efficient. This way you have the complexity of O(log N).
Another much simpler option you have is creating the map[string]int and inserting all keys along with the indexes. Then find the index inside the map. Which can be O(1) best case.
The important thing to note is that if you have to perform just one search on a given slice, this is not worth it as creating indexing is a lot heavier than linear search.

When is good to use KMP algorithm?

I understand that KMP algorithm depends on the helper array that there are prefixes that are similar to suffixes.
It won't efficient when the above condition is not fulfilled as in the helper array contains all zeroes.
Would the runtime be O(m + n) ?
If I am right, what is a better substring algorithm in this case?
To understand when KMP is a good algorithm to use, it's often helpful to ask the question "what's the alternative?"
KMP has the nice advantage that it is guaranteed worst-case efficient. The preprocessing time is always O(n), and the searching time is always O(m). There are no worst-case inputs, no probability of getting unlucky, etc. In cases where you are searching for very long strings (large n) inside of really huge strings (large m), this may be highly desirable compared to other algorithms like the naive one (which can take time Θ(mn) in bad cases), Rabin-Karp (pathological inputs can take time Θ(mn)), or Boyer-Moore (worst-case can be Θ(mn)). You're right that KMP might not be all that necessary in the case where there aren't many overlapping parts of the string, but the fact that you never need to worry about whether there's a bad case is definitely a nice thing to have!
KMP also has the nice property that the processing can be done a single time. If you know you're going to search for the same substring lots and lots of times, you can do the O(n) preprocessing work once and then have the ability to search in any length-m string you'd like in time O(m).

Looking for an efficient array-like structure that supports "replace-one-member" and "append"

As an exercise I wrote an implementation of the longest increasing subsequence algorithm, initially in Python but I would like to translate this to Haskell. In a nutshell, the algorithm involves a fold over a list of integers, where the result of each iteration is an array of integers that is the result of either changing one element of or appending one element to the previous result.
Of course in Python you can just change one element of the array. In Haskell, you could rebuild the array while replacing one element at each iteration - but that seems wasteful (copying most of the array at each iteration).
In summary what I'm looking for is an efficient Haskell data structure that is an ordered collection of 'n' objects and supports the operations: lookup i, replace i foo, and append foo (where i is in [0..n-1]). Suggestions?
Perhaps the standard Seq type from Data.Sequence. It's not quite O(1), but it's pretty good:
index (your lookup) and adjust (your replace) are O(log(min(index, length - index)))
(><) (your append) is O(log(min(length1, length2)))
It's based on a tree structure (specifically, a 2-3 finger tree), so it should have good sharing properties (meaning that it won't copy the entire sequence for incremental modifications, and will perform them faster too). Note that Seqs are strict, unlike lists.
I would try to just use mutable arrays in this case, preferably in the ST monad.
The main advantages would be making the translation more straightforward and making things simple and efficient.
The disadvantage, of course, is losing on purity and composability. However I think this should not be such a big deal since I don't think there are many cases where you would like to keep intermediate algorithm states around.

Haskell arrays vs lists

I'm playing with Haskell and Project Euler's 23rd problem. After solving it with lists I went here where I saw some array work. This solution was much faster than mine.
So here's the question. When should I use arrays in Haskell? Is their performance better than lists' and in which cases?
The most obvious difference is the same as in other languages: arrays have O(1) lookup and lists have O(n). Attaching something to the head of a list (:) takes O(1); appending (++) takes O(n).
Arrays have some other potential advantages as well. You can have unboxed arrays, which means the entries are just stored contiguously in memory without having a pointer to each one (if I understand the concept correctly). This has several benefits--it takes less memory and could improve cache performance.
Modifying immutable arrays is difficult--you'd have to copy the entire array which takes O(n). If you're using mutable arrays, you can modify them in O(1) time, but you'd have to give up the advantages of having a purely functional solution.
Finally, lists are just much easier to work with if performance doesn't matter. For small amounts of data, I would always use a list.
And if you're doing much indexing as well as much updating,you can
use Maps (or IntMaps), O(log size) indexing and update, good enough for most uses, code easy to get right
or, if Maps are too slow, use mutable (unboxed) arrays (STUArray from Data.Array.ST or STVectors from the vector package; O(1) indexing and update, but the code is easier to get wrong and generally not as nice.
For specific uses, functions like accumArray give very good performance too (uses mutable arrays under the hood).
Arrays have O(1) indexing (this used to be part of the Haskell definition), whereas lists have O(n) indexing. On the other hand, any kind of modification of arrays is O(n) since it has to copy the array.
So if you're going to do a lot of indexing but little updating, use arrays.

Efficient String Implementation in Haskell

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.
Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.
Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.
The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.
The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.

Resources