Most efficient method of constructing and printing very large strings

Most efficient method of constructing and printing very large strings - string

I have a program which constructs very large strings. Currently I am using lazy ByteStrings. Here are the problem parameters summarized:
The current implementation works up to about 500k characters, simply running out of memory afterwards (~600MB). I would like this (amount of characters) to run in under 50MB.
The string isn't accessed while being built. This probably leads to a lot of thunks and hence the memory issue. I am using Builder to make the ByteStrings, but it seems that there is no strict version of Builder (or at least I can't find it).
The string cannot be put in the file while being built. The entire build operation has to happen before the string is placed in a file.
I don't need unicode support. Even 7 bit ascii would do. I believe that ByteString doesn't waste memory to encode unicode characters though.
Things I have tried:
Calling seq on the ByteStrings as they are being built. This seems to work for 50-100k characters but after that the effect is the same.
Using strict ByteStrings. I couldn't figure out how to use Builder with them, so I ended up using lists and concat.
Using UArray Int Char. This means either knowing the size of the string in advance and allocating the entire array, or having a ton of intermediate data structures.

Related

What are the performance and pitfalls of String.sub

I'm considering using String.sub for a task on a hot path that inserts multiple elements inside a large string at arbitrary positions.
Knowing that this kind of function always has gotchas in other languages, I'd like to know what those are in the standard OCaml implementation.

String.sub (like most of the string manipulating functions) allocates a new string and copies the contents of the original string. So, it might be pretty slow if it is in a hot path.

Space leaks with Haskell's cereal library?

As a hobby project called 'beercan', I'm reverse-engineering the resource files of the Torchlight games. Using an okay-ish hex editor, I try to guess the structure of the files, and then I model my ideas, use cereal to write Getters (and later some Putters), and try to decode every file in an application of the library.
I've just started on Torchlight's compiled layout files (*.LAYOUT in TL1, *.LAYOUT.cmp in TL2). The format turns out to be a little trickier than the dat files, but I think I figured out the basic structure, and how they are encoded in the TL2 files. so I'm trying to make a map of file versions, tag numbers, and guessed data types.
To do so, I wrote an application that flattens the data structure, leaving only the guessed type of the values of the leaves, each annotated with the file version and the node and leaf tag numbers. I turn this into a map from the file version and tag numbers to a set of the guessed types. For every file, I'd expect this Map to maybe take twice the file size in memory. (Not sure, though.) Then, I merge these maps, and I print the map.
For some reason, even if I only take 20MB worth of files (100 files), memory usage increases linearly to about 200MB, then decreases to the final size of the resulting map, and then deflates rapidly as I print it.
I wouldn't expect this memory usage. Does anyone know how I could fix it? I've tried to force values after decoding them (using deepseq), I've tried adding bangs to data types, but this hasn't really helped. I've tried copying all bytestrings I keep in the file structure, which brought down the memory usage a bit, but it's still unacceptably high, especially when I want to analyze the entire dataset (200MB+ of original files).
-edit- I've pushed a (not very S)SCCE to demonstrate the performance issue, (accidentally) along with my profiling results.
Clone the repository.
cabal configure, with flags to enable profiling (is it normal to need --enable-library-profiling --enable-executable-profiling --ghc-options="-rtsopts -prof"?)
cabal build
cd test, and run StressTest.sh.
This script tries to load a regular TL2 layout file 100 times. On my machine, top says it takes about 500MB of memory, and the profiling results are consistent with my description above.

I totally agree with #petrpudlak, we would need actual code to make any meaningful comments to the question "why does my code use so much memory?" :) (sorry, you did offer code), however, some of the patterns you describe are pretty typical in Haskell and some generic discussion is possible.
First of all, note that native Haskell types use a lot more memory than you might guess. Take a look at the ghc memory footprint page at http://www.haskell.org/haskellwiki/GHC/Memory_Footprint. Note that even a simple Char will take a full 16 bytes of memory! Add to that pointers for linked list items in a String, and you will easily use more than an order of magnitude greater memory than you might have guessed. If memory is important, you should use another data type, like Data.Text or Data.ByteString, which store Strings internally more like c would (as a block of bytes in memory, with 1-4 bytes per char, depending on encoding and what char is used). If data other than Strings are the problem, you can use unboxed arrays for arbitrary data types.
Second of all, if possible, you can cut down memory usage by processing items in series (where the memory will be garbage collected right away). Haskell laziness often does this for you automatically, for instance, try to run the following program
import Data.Char
main = interact $ map toUpper
As you type, the output will appear continuously (your OS, not Haskell, may buffer full lines, so you may need to hit 'enter' before seeing anything, but you will see output update for each 'enter'). Rather than loading the whole input into memory and then processing all at once, Char memory is being created and garbage collected Char by Char.
Of course this isn't always possible (ie- if you have to process the data in a very nonlocal way), but most of the time at least parts of the code can be refactored this way to cut down total memory usage.
Edit- Sorry, I just realized that you did post a link to the code, and you are using ByteString..... So some of what I wrote isn't valid. But I do still see boxed lists and unpacking of the ByteString, so I will leave the answer as it is.

The memory usage pattern sounds like your application is building up a lot of unnecessary thunks and then memory consumption starts going down when those thunks get evaluated. I only glanced at your code quickly but one simple change you could try is to replace all imports of Data.Map with Data.Map.Strict. This is especially important if you are doing a lot of updates on the values inside a Map without forcing evaluation in between.
Another things you should be aware of is that replicateM is quite inefficient with larger numbers in a strict monad (see e.g. this answer). I'm not sure what kinds of counts you are usually dealing with in your application, but it's good to keep in mind.
It might also help to use strict fields in simple container data types like your LeafValue type and compile with -funbox-strict-fields (and -O2 of course).

Erlang binary strings by default

I am writing an erlang module that has to deal a bit with strings, not too much, however, I do some tcp recv and then some parsing over the data.
While matching data and manipulating strings, I am using binary module all the time like binary:split(Data,<<":">>) and basically using <<"StringLiteral">> all the time.
Till now I have not encounter difficulties or missing methods from the alternative( using lists) and everything is coming out quite naturally except maybe for adding the <<>>, but I was wondering if this way of dealing with strings might have drawbacks I am not aware of.
Any hint?

As long as you and your team remember that your strings are binaries and not lists, there are no inherent problems with this approach. In fact, Couch DB took this approach as an optimization which apparently paid nice dividends.

You do need to be very aware of how your string is encoded in your binaries. When you do <<"StringLiteral">> in your code, you have to be aware that this is simply a binary serialization of the list of code-points. Your Erlang compiler reads your code as ISO-8859-1 characters, so as long as you only use Latin-1 characters and do this consistently, you should be fine, But this isn't very friendly to internationalization.
Most application software these day should prefer a unicode encoding. UTF-8 is compatible with your <<"StringLiteral">> for the first 128 codepoints, but not for the second 128, so be careful. You might be surprised what you see on your UTF-8 encoded web applications if you use <<"StrïngLïteral">> in your code.
There was an EEP proposal for binary support in the form of <<"StrïngLïteral"/utf8>>, but I don't think this is finalized.
Also be aware that your binary:split/2 function may have unexpected results in UTF-8 if there is a multi-byte character that contains the IS0-8859-1 byte that to are splitting on.
Some would argue that UTF-16 is a better encoding to use because it can be parsed more efficiently and can be more easily split by index, if you are assuming or verify that there are no 32-bit characters.
The unicode module should be use, but tread carefully when you use literals.

The only thing to be aware of is that a binary is a slice of bytes, whereas a list is a list of unicode codepoints. In other words, the latter is naturally unicode whereas the former requires you to do some sort of encoding, usually UTF-8.
To my knowledge, there is no drawbacks to your method.

Binaries are very efficient structures to store strings. If they are longer than 64B they are also stored outside process heap so they are not object of GC (still GC'ed by ref counting when last ref lost). Don't forget use iolists for concatenation them to avoid copying when performance matter.

VB6 - Is there any performance benefit gained by using fixed-width strings in VB6?

In pre-.NET Visual Basic, a programmer could declare a string to be a certain width. For example, I know that a social-security number (in the US) is always eleven characters. So, I can declare a string that would store social-security numbers as an eleven-character string like this:
Dim SSN As String * 11
My question is: does this create any type of performance benefit that would either make the code run faster or perhaps use less memory? Also, would a fixed-length string be allocated in memory differently (i.e.: on the stack as opposed to in the heap)?

No, there is no performance benefit.
BUT even if there were, unless you were calling many (say millions) times in a loop, any performance benefit would be negligible.
Also, fixed-length strings occupy more memory than variable-length ones if you are not using the entire length (unless very short fixed length strings).
As always, you should carefully benchmark before making the code harder to maintain.
Fixed length strings were usually seen when interacting with some COM API's, or when modelling to domain constraints (such as the example you gave of a SSN)

The only time in VB6 or earlier that I had to use fixed length strings was with working with API calls. Not passing a fixed length string would cause unexplained errors at times when the length was longer than expected, and even sometimes when shorter than expected.
If you are going through and planning to change that in the application make sure there is no passing of the strings to an API or external DLL, and that the program does not require fixed length fields to be output, such as with many AS/400 import programs.
I personally never got to see a performance difference as I was running loops of 300k+ records, but had no choice but to provide and work with fixed lengths when I did. However VB likes to use undefined lengths by default so I would imagine the performance would be lower for fixed length.
Try writing a test app to perform a basic concatenation of two strings, and have it loop over the function like 50k times. Time the difference between the two of having one undefined length and the other fixed.

Efficient String Implementation in Haskell

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.

Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.

Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.

The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.

The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string