Data structure for storing strings? - string

I'm looking for a data structure to store strings in. I require a function in the interface that takes a string as its only parameter and returns a reference/iterator/pointer/handle that can be used to retrieve the string for the rest of the lifetime of the data structure. Set membership, entry deletion etc. is not required.
I'm more concerned with memory usage than speed.

One highly efficient data structure for storing strings is the Trie. This saves both memory and time by storing strings with common prefixes using the same memory.
You could use as the pointer returned the final marker of the string in the Trie, which uniquely identifies the string, and could be used to recreate the string by traversing the Trie upwards.

I think the keyword here is string interning, where you store only one copy of each distinct string. In Java, this is accomplished by String.intern():
String ref1 = "hello world".intern();
String ref2 = "HELLO WORLD".toLowerCase().intern();
assert ref1 == ref2;

I think the best bet here would be an ArrayList. The common implementations have some overhead from allocating extra space in the array for new elements, but if memory is such a requirement, you can allocate manually for each new element. It will be slower, but will only use the necessary memory for the string.

There are three ways to store strings:
Fixed length (array type structure)
Variable length but maximum size is fixed during running time (pointer type structure)
Linked list structure

Related

Any vs Nested Concrete type in Kotlin Collections

is using Any as collection type consume less memory than using concrete type?
suppose
val list1 = listOf<Any>("ABC", "DEF", "GHI", "JKL", "MNO")
val list2 = listOf<String>("ABC", "DEF", "GHI", "JKL", "MNO")
i wonder if list1 consume less memory than list2 since String type allocates memory to store its properties (e.g size)
So, is it better to use list1 if i dont use any String type function?
EDIT
what if i want to use other type in the collection?
list = listOf<Any>("ABC", 123, 12.34)
is it more efficient than
list = listOf<String>("ABC", "123", "12.34")
EDIT 2
thanks to #João Dias and #gidds
as #gidds says :
the list does not directly contain String objects, or Any objects — it contains references.
And a String reference is exactly the same size as an Any reference or a reference of any other type. 
So, The List<String> and List<Any> are the exactly the same because of the Type Erasure --which pointed out by #João Dias-- with a difference of compile-time & runtime type
but, does it mean that
val list1 = listOf<Any>("ABC", "DEF", "GHI")
and
val list2 = listOf<String>("ABC", "DEF", "GHI")
is consuming the same memory as
val list3 = listOf<List<List<List<String>>>>(
listOf(listOf(ListOf("ABC"))),
listOf(listOf(ListOf("DEF"))),
listOf(listOf(ListOf("GHI")))
)
AFAIK, String is basically collections of Char. A String contains a reference to the Char. And since everything is an object in Kotlin, then every Char inside the String should contain a reference to the value in the heap, am i correct up to here?
if thats the case, doesnt it make sense that List<String> consume more memory than List<Any> because List<String> is having more than 1 reference.
A point not addressed so far is that the list does not directly contain String objects, or Any objects — it contains references.
And a String reference is exactly the same size as an Any reference or a reference of any other type. (That size depends on the internals of the JVM running the code; it might be 4 or 8 bytes. See these questions.)
Of course, the objects being referred to will also take up their own space in the heap; but that will be the same in both cases.
EDITED TO ADD:
The internal details of how List and String are implemented is irrelevant to the original question. (Which is good, coz they vary between implementations.) JVM languages (such as Kotlin) have only two kinds of value: primitives (Int, Short, Long, Byte, Char, Double, Float, Boolean), and references (to an object or an array).
So any collection, if it's not a collection of primitives, is a collection of references. That applies to all List implementations. So your list1 and list2 objects will be exactly the same size, depending only on the number of references they hold (or can hold), not on what's in those references.
If you want a deeper picture, list1 is a reference, pointing to an object which implements the List interface. There are many different implementations, and I don't know off-hand which one Kotlin will pick (and again, that might change between versions), but say for example it's an ArrayList. That has at least two properties: a size (which is probably an Int), and a reference to an array which holds the references to the items in the list. (The array will usually be bigger than the current size of the list, so that you can add some more items without having to re-allocate the array each time; the current size of the array is known as the list's capacity.) If those items are Strings, then the exact internal representation depends on the JVM version, but it might be an object with at least three properties: an array of Char, an Int giving the start index of the string within the array, and another Int giving either the length.
But as I said, the details change over time and between JVM versions. What doesn't change is that List is a collection of references, and the size of a reference doesn't depend on its type. So a list of String references will (all other things being equal) take exactly the same space as a list of Any references to those same strings.
(And, as has been mentioned elsewhere, due to type erasure at runtime the JVM has no concept of type parameters, and so the objects will in fact be identical.)
Of course, the ‘deep size’ (the overall heap space taken up by the list and the objects it contains) will depend upon the size of those objects — but in the case we're discussing, those are the exact same String objects, so there's no difference in size there either.
It does not make any difference in memory consumption. You are building the exact same list with exactly the same content. Additionally, there is a thing called type erasure that goes something like this:
Type erasure can be explained as the process of enforcing type
constraints at compile-time and discarding the element type
information at runtime.
This means that at runtime, there is no List<String> or List<Any> just List making no difference whatsoever if you use the first or the second in regards to memory consumption. In regards to code readability, maintainability and robustness you definitely should go with List<String>. Given that in Kotlin lists exposed as List are read-only by default (thanks #Alex.T and #Tenfour04 for the hints, Lists can only be considered immutable if the elements in them are also immutable and if the concrete implementation of List is indeed immutable, e.g., consider a list : List<String> property with an underlying mutable ArrayList<String> then list is still not completely immutable since a since cast allows you to add or remove elements from it (list as ArrayList<String>).add("new-element")) you basically have only disadvantages on using List<Any> (because then if you want to iterate or use any of its elements all you will know at that time is that it is Any element which is way harder to work with than a specific type like String).

Does a string use a list to store char's?

Curious if a string data type uses a list to store its characters? And if so, what is the list implementation type?
Other than being immutable, a String has elements and index positions like a list so wondering if a String is actually a list data structure.
In most modern programming languages, a string is a wrapper around an array. In .NET, for example, the backing store is an array of 16-bit integers, each of which holds a single UTF-16 code point. Whatever the backing store, member functions provide indexing and such.
In C, there isn't even a string type. What we call strings are really just arrays of characters. There are certain rules for the format of those arrays (like a null byte, \0, terminates the string), and "string functions" that operate on those arrays as though they are strings.
There are string handling libraries that use ropes to more efficiently manipulate long strings.
Note also that not all languages specify that strings are immutable. In some languages, you can modify individual characters within the string, and do some other things without actually creating a new string object.

Conversion of list to string - TCL

I encountered the following problem in TCL. In my application, I read very large text files (some hundreds of MB) into TCl list. The list is then returned by the function to the main context, and then checked for emptiness. Here is the code snapshot:
set merged_trace_list [merge_trace_files $exclude_trace_file $trace_filenames ]
if {$merged_trace_list == ""} {
...
And I get crash at the "if" line. The crash seems to be related to memory overflow. I thought that the comparison to "" forces TCL to convert list to the string, and since the string is too long, this causes crash. I then replaced above "if" line by another one:
if {[lempty $merged_trace_list]} {
and crash indeed disappeared. In the light of the above, I have several questions:
What is the maximum allowed string length in TCL?
What is difference between string and list in TCL in terms of memory allocation? Why I can have very long list, but not corresponding string?
When the list first returned by the function into the main scope (the first line) , is it not converted to the string first? And if yes, why I don't have crash in that line?
Thanks,
I hope the descriptions and the questions are clear.
Konstantin
The current maximum size of individual memory object (e.g., string) is 2GB. This is a known bug (of long standing) on 64-bit platforms, but fixing it requires a significant ABI and API breaking change, so it won't appear until Tcl 9.0.
The difference between strings and lists is that strings are stored in a single block of memory, whereas lists are stored in an array of pointers to elements. You can probably get 256k elements in a list no problem, but after that you might run into problems as the array reaches the 2GB limit.
Tcl's value objects may be simultaneously both lists and strings; the dictum about Tcl that “everything is a string” is not actually true, it's just that everything may be serialized to a string. The returning of a list does not force it to be converted to string — that's actually a fairly slow operation — but comparing the value for equality with a string does force the generation of the string. The lempty command must be instead getting the length of the string (you can use llength to do the same thing) and comparing that to zero.
Can you adjust your program to not need to hold all that data in memory at once? It's living a little dangerously given the bug mentioned above.
This is not really an answer, but it's slightly too much for a comment.
If you want to check if a list is empty, the best option is llength. If the list length is 0, your list has no content. The low-level lookup for this is very cheap.
If you still want to determine if a list is empty by comparing it to the empty string you will have to face the cost of resolving the string representation of the list. In this case, $myLongList eq {} is preferable to $myLongList == {}, since the latter comparison also forces the interpreter to check if the operands are numeric (at least it used to be like that, it might have changed).

Mapping arbitrary objects to indices

Let's assume that we have some objects (strings, for example). It is well known that working with indices (i.e. with numbers 1,2,3...) is much more convenient than with arbitrary objects.
Is there any common way of assigning an index for each object? One can create a hash_map and store an index in the value, but that will be memory-expensive when the number of objects is too high to be placed into the memory.
Thanks.
You can store the string objects in a sorted file.
This way, you are not storing the objects in memory.
Your mapping function can search for the required object in the sorted file.
You can create a hash map to optimize the search.

What's the advantage of a String being Immutable?

Once I studied about the advantage of a string being immutable because of something to improve performace in memory.
Can anybody explain this to me? I can't find it on the Internet.
Immutability (for strings or other types) can have numerous advantages:
It makes it easier to reason about the code, since you can make assumptions about variables and arguments that you can't otherwise make.
It simplifies multithreaded programming since reading from a type that cannot change is always safe to do concurrently.
It allows for a reduction of memory usage by allowing identical values to be combined together and referenced from multiple locations. Both Java and C# perform string interning to reduce the memory cost of literal strings embedded in code.
It simplifies the design and implementation of certain algorithms (such as those employing backtracking or value-space partitioning) because previously computed state can be reused later.
Immutability is a foundational principle in many functional programming languages - it allows code to be viewed as a series of transformations from one representation to another, rather than a sequence of mutations.
Immutable strings also help avoid the temptation of using strings as buffers. Many defects in C/C++ programs relate to buffer overrun problems resulting from using naked character arrays to compose or modify string values. Treating strings as a mutable types encourages using types better suited for buffer manipulation (see StringBuilder in .NET or Java).
Consider the alternative. Java has no const qualifier. If String objects were mutable, then any method to which you pass a reference to a string could have the side-effect of modifying the string. Immutable strings eliminate the need for defensive copies, and reduce the risk of program error.
Immutable strings are cheap to copy, because you don't need to copy all the data - just copy a reference or pointer to the data.
Immutable classes of any kind are easier to work with in multiple threads, the only synchronization needed is for destruction.
Perhaps, my answer is outdated, but probably someone will found here a new information.
Why Java String is immutable and why it is good:
you can share a string between threads and be sure no one of them will change the string and confuse another thread
you don’t need a lock. Several threads can work with immutable string without conflicts
if you just received a string, you can be sure no one will change its value after that
you can have many string duplicates – they will be pointed to a single instance, to just one copy. This saves computer memory (RAM)
you can do substring without copying, – by creating a pointer to an existing string’s element. This is why Java substring operation implementation is so fast
immutable strings (objects) are much better suited to use them as key in hash-tables
a) Imagine StringPool facility without making string immutable , its not possible at all because in case of string pool one string object/literal e.g. "Test" has referenced by many reference variables , so if any one of them change the value others will be automatically gets affected i.e. lets say
String A = "Test" and String B = "Test"
Now String B called "Test".toUpperCase() which change the same object into "TEST" , so A will also be "TEST" which is not desirable.
b) Another reason of Why String is immutable in Java is to allow String to cache its hashcode , being immutable String in Java caches its hash code and do not calculate every time we call hashcode method of String, which makes it very fast as hashmap key.
Think of various strings sitting on a common pool. String variables then point to locations in the pool. If u copy a string variable, both the original and the copy shares the same characters. These efficiency of sharing outweighs the inefficiency of string editing by extracting substrings and concatenating.
Fundamentally, if one object or method wishes to pass information to another, there are a few ways it can do it:
It may give a reference to a mutable object which contains the information, and which the recipient promises never to modify.
It may give a reference to an object which contains the data, but whose content it doesn't care about.
It may store the information into a mutable object the intended data recipient knows about (generally one supplied by that data recipient).
It may return a reference to an immutable object containing the information.
Of these methods, #4 is by far the easiest. In many cases, mutable objects are easier to work with than immutable ones, but there's no easy way to share with "untrusted" code the information that's in a mutable object without having to first copy the information to something else. By contrast, information held in an immutable object to which one holds a reference may easily be shared by simply sharing a copy of that reference.

Resources