I encountered the following problem in TCL. In my application, I read very large text files (some hundreds of MB) into TCl list. The list is then returned by the function to the main context, and then checked for emptiness. Here is the code snapshot:
set merged_trace_list [merge_trace_files $exclude_trace_file $trace_filenames ]
if {$merged_trace_list == ""} {
...
And I get crash at the "if" line. The crash seems to be related to memory overflow. I thought that the comparison to "" forces TCL to convert list to the string, and since the string is too long, this causes crash. I then replaced above "if" line by another one:
if {[lempty $merged_trace_list]} {
and crash indeed disappeared. In the light of the above, I have several questions:
What is the maximum allowed string length in TCL?
What is difference between string and list in TCL in terms of memory allocation? Why I can have very long list, but not corresponding string?
When the list first returned by the function into the main scope (the first line) , is it not converted to the string first? And if yes, why I don't have crash in that line?
Thanks,
I hope the descriptions and the questions are clear.
Konstantin
The current maximum size of individual memory object (e.g., string) is 2GB. This is a known bug (of long standing) on 64-bit platforms, but fixing it requires a significant ABI and API breaking change, so it won't appear until Tcl 9.0.
The difference between strings and lists is that strings are stored in a single block of memory, whereas lists are stored in an array of pointers to elements. You can probably get 256k elements in a list no problem, but after that you might run into problems as the array reaches the 2GB limit.
Tcl's value objects may be simultaneously both lists and strings; the dictum about Tcl that “everything is a string” is not actually true, it's just that everything may be serialized to a string. The returning of a list does not force it to be converted to string — that's actually a fairly slow operation — but comparing the value for equality with a string does force the generation of the string. The lempty command must be instead getting the length of the string (you can use llength to do the same thing) and comparing that to zero.
Can you adjust your program to not need to hold all that data in memory at once? It's living a little dangerously given the bug mentioned above.
This is not really an answer, but it's slightly too much for a comment.
If you want to check if a list is empty, the best option is llength. If the list length is 0, your list has no content. The low-level lookup for this is very cheap.
If you still want to determine if a list is empty by comparing it to the empty string you will have to face the cost of resolving the string representation of the list. In this case, $myLongList eq {} is preferable to $myLongList == {}, since the latter comparison also forces the interpreter to check if the operands are numeric (at least it used to be like that, it might have changed).
Related
I have a program which constructs very large strings. Currently I am using lazy ByteStrings. Here are the problem parameters summarized:
The current implementation works up to about 500k characters, simply running out of memory afterwards (~600MB). I would like this (amount of characters) to run in under 50MB.
The string isn't accessed while being built. This probably leads to a lot of thunks and hence the memory issue. I am using Builder to make the ByteStrings, but it seems that there is no strict version of Builder (or at least I can't find it).
The string cannot be put in the file while being built. The entire build operation has to happen before the string is placed in a file.
I don't need unicode support. Even 7 bit ascii would do. I believe that ByteString doesn't waste memory to encode unicode characters though.
Things I have tried:
Calling seq on the ByteStrings as they are being built. This seems to work for 50-100k characters but after that the effect is the same.
Using strict ByteStrings. I couldn't figure out how to use Builder with them, so I ended up using lists and concat.
Using UArray Int Char. This means either knowing the size of the string in advance and allocating the entire array, or having a ton of intermediate data structures.
I have this function for reversing strings in ocaml however it says that I have my types wrong. I am unsure as to why or what I can do :(
Any tips on debugging would also be greatly appreciated!
28 let reverse s =
29 let rec helper i =
30 if i >= String.length s then "" else (helper (i+1))^(s.[i])
31 in
32 helper 0
Error: This expression has type char but an expression was expected of type
string
Thank you
Your implementation does not have the expected (linear) time and space complexity: it is quadratic in both time and space, so it is hardly a correct implementation of the requested feature.
String concatenation sa^sb allocates a new string of size length sa + length sb, and fills it with the two strings; this means that both its time and space complexity are linear in the sum of the lengths. When you iterate this operation once per character, you get an algorithm of quadratic complexity (the total size of memory allocated, and total number of copies, will be 1+2+3+....+n).
To correctly implement this algorithm, you could either:
allocate a string of the expected size, and mutate it in place with the content of the input string, reversed
create a string list made of reversed size-one strings, then use String.concat to concatenate all of them at once (which allocates the result and copies the strings only once)
use the Buffer module which is meant to accumulate characters or strings iteratively without exhibiting a quadratic behavior (it uses a dynamic resizing policy that makes addition of a char amortized constant time)
The first approach is both the simplest and the fastest, but the other two will get more interesting in more complex application where you want to concatenate strings, but it's less straightforward to know in one step what the final result will be.
The error message is pretty clear, I think. The expression s.[i] represents a character (the ith character of the string). But the ^ operator requires strings as its arguments.
To get past the problem you can use String.make 1 s.[i]. This expression gives a 1-character string containing the single character s.[i].
Handling strings recursively in OCaml isn't as nice as it could be, because there's no nice way to destructure a string (break it into parts). The equivalent code to reverse a list looks a lot prettier. For what it's worth :-)
You can also use 3rd party libraries to do so. http://batteries.forge.ocamlcore.org/ already implements a function for reversing strings
Some times you could want to avoid/minimize the garbage collector, so I want to be sure about how to do it.
I think that the next one is correct:
Declare variables at the beginning of the function.
To use array instead of slice.
Any more?
To minimize garbage collection in Go, you must minimize heap allocations. To minimize heap allocations, you must understand when allocations happen.
The following things always cause allocations (at least in the gc compiler as of Go 1):
Using the new built-in function
Using the make built-in function (except in a few unlikely corner cases)
Composite literals when the value type is a slice, map, or a struct with the & operator
Putting a value larger than a machine word into an interface. (For example, strings, slices, and some structs are larger than a machine word.)
Converting between string, []byte, and []rune
As of Go 1.3, the compiler special cases this expression to not allocate: m[string(b)], where m is a map and b is a []byte
Converting a non-constant integer value to a string
defer statements
go statements
Function literals that capture local variables
The following things can cause allocations, depending on the details:
Taking the address of a variable. Note that addresses can be taken implicitly. For example a.b() might take the address of a if a isn't a pointer and the b method has a pointer receiver type.
Using the append built-in function
Calling a variadic function or method
Slicing an array
Adding an element to a map
The list is intended to be complete and I'm reasonably confident in it, but am happy to consider additions or corrections.
If you're uncertain of where your allocations are happening, you can always profile as others suggested or look at the assembly produced by the compiler.
Avoiding garbage is relatively straight forward. You need to understand where the allocations are being made and see if you can avoid the allocation.
First, declaring variables at the beginning of a function will NOT help. The compiler does not know the difference. However, human's will know the difference and it will annoy them.
Use of an array instead of a slice will work, but that is because arrays (unless dereferenced) are put on the stack. Arrays have other issues such as the fact that they are passed by value (copied) between functions. Anything on the stack is "not garbage" since it will be freed when the function returns. Any pointer or slice that may escape the function is put on the heap which the garbage collector must deal with at some point.
The best thing you can do is avoid allocation. When you are done with large bits of data which you don't need, reuse them. This is the method used in the profiling tutorial on the Go blog. I suggest reading it.
Another example besides the one in the profiling tutorial: Lets say you have an slice of type []int named xs. You continually append to the []int until you reach a condition and then you reset it so you can start over. If you do xs = nil, you are now declaring the underlying array of the slice as garbage to be collected. Append will then reallocate xs the next time you use it. If instead you do xs = xs[:0], you are still resetting it but keeping the old array.
For the most part, trying to avoid creating garbage is premature optimization. For most of your code it does not matter. But you may find every once in a while a function which is called a great many times that allocates a lot each time it is run. Or a loop where you reallocate instead of reusing. I would wait until you see the bottle neck before going overboard.
In pre-.NET Visual Basic, a programmer could declare a string to be a certain width. For example, I know that a social-security number (in the US) is always eleven characters. So, I can declare a string that would store social-security numbers as an eleven-character string like this:
Dim SSN As String * 11
My question is: does this create any type of performance benefit that would either make the code run faster or perhaps use less memory? Also, would a fixed-length string be allocated in memory differently (i.e.: on the stack as opposed to in the heap)?
No, there is no performance benefit.
BUT even if there were, unless you were calling many (say millions) times in a loop, any performance benefit would be negligible.
Also, fixed-length strings occupy more memory than variable-length ones if you are not using the entire length (unless very short fixed length strings).
As always, you should carefully benchmark before making the code harder to maintain.
Fixed length strings were usually seen when interacting with some COM API's, or when modelling to domain constraints (such as the example you gave of a SSN)
The only time in VB6 or earlier that I had to use fixed length strings was with working with API calls. Not passing a fixed length string would cause unexplained errors at times when the length was longer than expected, and even sometimes when shorter than expected.
If you are going through and planning to change that in the application make sure there is no passing of the strings to an API or external DLL, and that the program does not require fixed length fields to be output, such as with many AS/400 import programs.
I personally never got to see a performance difference as I was running loops of 300k+ records, but had no choice but to provide and work with fixed lengths when I did. However VB likes to use undefined lengths by default so I would imagine the performance would be lower for fixed length.
Try writing a test app to perform a basic concatenation of two strings, and have it loop over the function like 50k times. Time the difference between the two of having one undefined length and the other fixed.
I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.
Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.
Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.
The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.
The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.