What are the performance and pitfalls of String.sub - string

I'm considering using String.sub for a task on a hot path that inserts multiple elements inside a large string at arbitrary positions.
Knowing that this kind of function always has gotchas in other languages, I'd like to know what those are in the standard OCaml implementation.

String.sub (like most of the string manipulating functions) allocates a new string and copies the contents of the original string. So, it might be pretty slow if it is in a hot path.

Related

How to implement efficient string interning in f#?

What is to implement a custom string type in f# for interning strings. i have to read large csv files into memory. Given most of the columns are categorical, values are repeating and it makes sense to create new string first time it is encountered and only refer to it on subsequent occurrences to save memory.
In c# I do this by creating a global intern pool (concurrent dict) and before setting a value, lookup the dictionary if it already exists. if it exists, just point to the string already in the dictionary. if not, add it to the dictionary and set the value to the string just added to dictionary.
New to f# and wondering what is the best way to do this in f#. will be using the new string type in records named tuples etc and it will have to work with concurrent processes.
Edit:
String.Intern uses the Intern Pool. My understanding is, it is not very efficient for large pools and is not garbage collected i.e. any/all interned strings will remain in intern pool for lifetime of the app. Imagine a an application where you read a file, perform some operations and write data. Using Intern Pool solution will probably work. Now imagine you have to do the same 100 times and the strings in each file have little in common. If the memory is allocated on heap, after processing each file, we can force garbage collector to clear unnecessary strings.
I should have mentioned I could not really figure out how to do the C# approach in F# (other than implementing a C# type and using it in F#)
Memorisation pattern is slightly different from what I am looking for? We are not caching calculated results - we are ensuring each string object is created no more than once and all subsequent creations of same string are just references to the original. Using a dictionary to do this is a one way and using String.Intern is other.
sorry if is am missing something obvious here.
I have a few things to say, so I'll post them as an answer.
First, I guess String.Intern works just as well in F# as in C#.
let x = "abc"
let y = StringBuilder("a").Append("bc").ToString()
printfn "1 : %A" (LanguagePrimitives.PhysicalEquality x y) // false
let y2 = String.Intern y
printfn "2 : %A" (LanguagePrimitives.PhysicalEquality x y2) // true
Second, are you using a dictionary in combination with String.Intern in your C# solution? If so, why not just do s = String.Intern(s); after the string is ready following input from file?
To create a type for use in your business domain to handle string deduplication in general is a very bad idea. You don't want your business domain polluted by that kind of low level stuff.
As for rolling your own. I did that some years ago, probably to avoid that problem you mentioned with the strings not being garbage collected, but I never tested if that actually was a problem.
It might be a good idea to use a dictionary (or something) for each column (or type of column) where the same values are likely to repeat in great numbers. (This is pretty much what you said already.)
It makes sense to only keep these dictionaries live while you read the information from file, and stuff it into internal data structures. You might be thinking that you need the dictionaries for subsequent reads, but I am not so sure about that.
The important thing is to deduplicate the great majority of strings, and not necessarily every single duplicate. Because of this you can greatly simplify the solution as indicated. You most probably have nothing to gain by overcomplicating your solution to squeeze out the last fraction of memory savings.
Releasing the dictionaries after the file is read and structures filled, will have the advantage of not holding on to strings when they are no longer really needed. And of course you save memory by not holding onto the dictionaries.
I see no need to handle concurrency issues in the implementation here. String.Intern must necessarily be immune to concurrency issues. If you roll your own with the design suggested, you would not use it concurrently. Each file being read would have its own set of dictionaries for its columns.

Data.Text vs String

While the general opinion of the Haskell community seems to be that it's always better to use Text instead of String, the fact that still the APIs of most of maintained libraries are String-oriented confuses the hell out of me. On the other hand, there are notable projects, which consider String as a mistake altogether and provide a Prelude with all instances of String-oriented functions replaced with their Text-counterparts.
So are there any reasons for people to keep writing String-oriented APIs except backwards- and standard Prelude-compatibility and the "switch-making inertia"?
Are there possibly any other drawbacks to Text as compared to String?
Particularly, I'm interested in this because I'm designing a library and trying to decide which type to use to express error messages.
My unqualified guess is that most library writers don't want to add more dependencies than necessary. Since strings are part of literally every Haskell distribution (it's part of the language standard!), it is a lot easier to get adopted if you use strings and don't require your users to sort out Text distributions from hackage.
It's one of those "design mistakes" that you just have to live with unless you can convince most of the community to switch over night. Just look at how long it has taken to get Applicative to be a superclass of Monad – a relatively minor but much wanted change – and imagine how long it would take to replace all the String things with Text.
To answer your more specific question: I would go with String unless you get noticeable performance benefits by using Text. Error messages are usually rather small one-off things so it shouldn't be a big problem to use String.
On the other hand, if you are the kind of ideological purist that eschews pragmatism for idealism, go with Text.
* I put design mistakes in scare quotes because strings as a list-of-chars is a neat property that makes them easy to reason about and integrate with other existing list-operating functions.
If your API is targeted at processing large amounts of character oriented data and/or various encodings, then your API should use Text.
If your API is primarily for dealing with small one-off strings, then using the built-in String type should be fine.
Using String for large amounts of text will make applications using your API consume significantly more memory. Using it with foreign encodings could seriously complicate usage depending on how your API works.
String is quite expensive (at least 5N words where N is the number of Char in the String). A word is same number of bits as the processor architecture (ex. 32 bits or 64 bits):
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
There are at least three reasons to use [Char] in small projects.
[Char] does not rely on any arcane staff, like foreign pointers, raw memory, raw arrays, etc that may work differently on different platforms or even be unavailable altogether
[Char] is the lingua franka in haskell. There are at least three 'efficient' ways to handle unicode data in haskell: utf8-bytestring, Data.Text.Text and Data.Vector.Unboxed.Vector Char, each requiring dealing with extra package.
by using [Char] one gains access to all power of [] monad, including many specific functions (alternative string packages do try to help with it, but still)
Personally, I consider utf16-based Data.Text one of the most questionable desicions of the haskell community, since utf16 combines flaws of both utf8 and utf32 encoding while having none of their benefits.
I wonder if Data.Text is always more efficient than Data.String???
"cons" for instance is O(1) for Strings and O(n) for Text. Append is O(n) for Strings and O(n+m) for strict Text's. Likewise,
let foo = "foo" ++ bigchunk
bar = "bar" ++ bigchunk
is more space efficient for Strings than for strict Texts.
Other issue not related to efficiency is pattern matching (perspicuous code) and lazyness (predictably per-character in Strings, somehow implementation dependent in lazy Text).
Text's are obviously good for static character sequences and for in-place modification. For other forms of structural editing, Data.String might have advantages.
I do not think there is a single technical reason for String to remain.
And I can see several ones for it to go.
Overall I would first argue that in the Text/String case there is only one best solution :
String performances are bad, everyone agrees on that
Text is not difficult to use. All functions commonly used on String are available on Text, plus some useful more in the context of strings (substitution, padding, encoding)
having two solutions creates unnecessary complexity unless all base functions are made polymorphic. Proof : there are SO questions on the subject of automatic conversions. So this is a problem.
So one solution is less complex than two, and the shortcomings of String will make it disappear eventually. The sooner the better !

Most efficient method of constructing and printing very large strings

I have a program which constructs very large strings. Currently I am using lazy ByteStrings. Here are the problem parameters summarized:
The current implementation works up to about 500k characters, simply running out of memory afterwards (~600MB). I would like this (amount of characters) to run in under 50MB.
The string isn't accessed while being built. This probably leads to a lot of thunks and hence the memory issue. I am using Builder to make the ByteStrings, but it seems that there is no strict version of Builder (or at least I can't find it).
The string cannot be put in the file while being built. The entire build operation has to happen before the string is placed in a file.
I don't need unicode support. Even 7 bit ascii would do. I believe that ByteString doesn't waste memory to encode unicode characters though.
Things I have tried:
Calling seq on the ByteStrings as they are being built. This seems to work for 50-100k characters but after that the effect is the same.
Using strict ByteStrings. I couldn't figure out how to use Builder with them, so I ended up using lists and concat.
Using UArray Int Char. This means either knowing the size of the string in advance and allocating the entire array, or having a ton of intermediate data structures.

Delphi String Sharing Question

I have a large amout of objects that all have a filename stored inside. All file names are within a given base directory (let's call it C:\BaseDir\). I am now considering two alternatives:
Store absolute paths in the objects
Store relative paths in the object and store the base path additionally
If I understand Delphi strings correctly the second approach will need much less memory because the base path string is shared - given that I pass the same string field to all the objects like this:
TDataObject.Create (FBasePath, RelFileName);
Is that assumption true? Will there be only one string instance of the base path in memory?
If anybody knows a better way to handle situations like this, feel free to comment on that as well.
Thanks!
You are correct. When you write s1 := s2 with two string variables, there is one string in memory with (at least two) references to it.
You also ask whether trying to reduce the number of strings in memory is a good idea. That depends on how many strings you have in comparison to other memory consuming objects. Only you can really answer that.
As David said, the common string would be shared (unless you use ie UniqueString()).
Having said that, it looks like premature optimisation. If you actually need to work with full paths and never need the dir and filename part separately then you should think about splitting them up only when you really run into memory problems.
Constantly concatenating the base and filename parts could significantly slow down your program and cause memory fragmentation.

Efficient String Implementation in Haskell

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.
Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.
Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.
The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.
The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.

Resources