Let's assume that we have some objects (strings, for example). It is well known that working with indices (i.e. with numbers 1,2,3...) is much more convenient than with arbitrary objects.
Is there any common way of assigning an index for each object? One can create a hash_map and store an index in the value, but that will be memory-expensive when the number of objects is too high to be placed into the memory.
Thanks.
You can store the string objects in a sorted file.
This way, you are not storing the objects in memory.
Your mapping function can search for the required object in the sorted file.
You can create a hash map to optimize the search.
Related
I want to map a timestamp t and an identifier id to a certain state of an object. I can do so by mapping a tuple (t,id) -> state_of_id_in_t. I can use this mapping to access one specific (t,id) combination.
However, sometimes I want to know all states (with matching timestamps t) of a specific id (i.e. id -> a set of (t, state_of_id_in_t)) and sometimes all states (with matching identifiers id) of a specific timestamp t (i.e. t -> a set of (id, state_of_id_in_t)). The problem is that I can't just put all of these in a single large matrix and do linear search based on what I want. The amount of (t,id) tuples for which I have states is very large (1m +) and very sparse (some timestamps have many states, others none etc.). How can I make such a dict, which can deal with accessing its contents by partial keys?
I created two distinct dicts dict_by_time an dict_by_id, which are dicts of dicts. dict_by_time maps a timestamp t to a dict of ids, which each point to a state. Similiarly, dict_by_id maps an id to a dict of timestamps, which each point to a state. This way I can access a state or a set of states however I like. Notice that the 'leafs' of both dicts (dict_by_time an dict_by_id) point to the same objects, so its just the way I access the states that's different, the states themselves however are the same python objects.
dict_by_time = {'t_1': {'id_1': 'some_state_object_1',
'id_2': 'some_state_object_2'},
't_2': {'id_1': 'some_state_object_3',
'id_2': 'some_state_object_4'}
dict_by_id = {'id_1': {'t_1': 'some_state_object_1',
't_2': 'some_state_object_3'},
'id_2': {'t_1': 'some_state_object_2',
't_2': 'some_state_object_4'}
Again, notice the leafs are shared across both dicts.
I don't think it is good to do it using two dicts, simply because maintaining both of them when adding new timestamps or identifiers result in double work and could easily lead to inconsistencies when I do something wrong. Is there a better way to solve this? Complexity is very important, which is why I can't just do manual searching and need to use some sort of HashMap magic.
You can always trade add complexity with lookup complexity. Instead of using a single dict, you can create a Class with an add method and a lookup method. Internally, you can keep track of the data using 3 different dictionaries. One uses the (t,id) tuple as key, one uses t as the key and one uses id as the key. Depending on the arguments given to lookup, you can return the result from one of the dictionaries.
As I am currently playing with huge number of strings (have a look at another question: VBA memory size of Arrays and Arraylist) I used a scripting dictionary just for the feature of the keyed access that it has.
Everything was looking fine except that it was some how slow in loading the strings and that it uses a lot of memory. For an example of 100,000 strings of 128 characters in length, the Task manager showed at the end of the sub approximately 295 MB and when setting Dictionary=Nothing a poor 12 MB was remaining in Excel. Even considering internal Unicode conversion of strings 128 * 2 * 100,000 gives 25.6 MB ! Can someone explain this big difference ?
Here is all the info I could find on the Scripting.Dictionary:
According to Eric Lippert, who wrote the Scripting.Dictionary, "the actual implementation of the generic dictionary is an extensible-hashing-with-chaining algorithm that re-hashes when the table gets too full." (It is clear from the context that he is referring to the Scripting.Dictionary) Wikipedia's article on Hash Tables is a pretty good introduction to the concepts involved. (Here is a search of Eric's blog for the Scripting.Dictionary, he occasionally mentions it)
Basically, you can think of a Hash Table as a large array in memory. Instead of storing your strings directly by an index, you must provide a key (usually a string). The key gets "hashed", that is, a consistent set of algorithmic steps is applied to the key to crunch it down into a number between 0 and current max index in the Hash Table. That number is used as the index to store your string into the hash table. Since the same set of steps is applied each time the key is hashed, it results in the same index each time, meaning if you are looking up a string by its key, there is no need to search through the array as your normally would.
The hash function (which is what converts a key to an index into the table) is designed to be as random as possible, but every once in a while two keys can crunch down to the same index - this is called a collision. This is handled by "chaining" the strings together in a linked list (or possibly a more searchable structure). So suppose you tried to look a string up in the Hash Table with a key. The key is hashed, and you get an index. Looking in the array at that index, it could be an empty slot if no string with that key was ever added, or it could be a linked list that contains one or more strings whose keys mapped to that index in the array.
The entire reason for going through the details above is to point out that a Hash Table must be larger than the number of things it will store to make it efficient (with some exceptions, see Perfect Hash Function). So much of the overhead you would see in a Hash Table are the empty parts of the array that have to be there to make the hash table efficient.
Additionally, resizing the Hash Table is an expensive operation because the all the existing strings have to be rehashed to new locations, so when the load factor of the Hash Table exceeds the predefined threshold and it gets resized, it might get doubled in size to avoid having to do so again soon.
The implementation of the structure that holds the chain of strings at each array position can also have a large impact on the overhead.
If I find anything else out, I'll add it here...
I understand the workings of HashMap -collisions and everything - trying to understand the deeper mechanics and the choice for entry bucket - and not say an Array (making it a 2 dimensional matrix )? Since search through both are O(n) operations ? I assumed one factor that might make the choice for Linked List is the insertions o(1) factor! Is that a correct assumption?
If it is in Java then the linked list is the bucket. And each entry in the table is a linked list containing the array of entry elements. In that each node knows what is in the next to the list till the time you reach to the end when the next reference is null.
Also do check:-
How does Java's Hashmap work internally?
An array has a predetermined size. So if you use an array, then the hash table has a predetermined size for each of the array elements. Every possible bucket is allocated, and your hash table will become big. This will be make sense if you have very huge memory but if not then use link list and walk through the list to find the match.
Its because you do not add at the end of the linkedlist but at the head to make it O(1)
[Answering below question - No not looking for a solution :) - I just wanted to understand why was the choice was made for Buckets to be am internam implemetnation of a Linked List and not an array based List (i didn't mean a fixed size array - ). Linked lists offer node based implemetnation that certainly help when there are a large number of inserts into the data structure - but when you are adding to the end of list - any growable array based implementation for bucket should suffice?]
I want to make an array containing three wide character arrays such that one of them is the key.
"LPWCH,LPWCH,LPWCH" was not able to use the greater than/lesser than symbols since it thinks it is a tag
Hash_map only lets me use a pair. wKey and the element associated with it. Is there another data structure that lets me use this?
This set will be updated by different threads almost simultaneously. And thats the reason why I don't want to use a class or another struct to define the remaining two wide character arrays.
You can use LPWCH as a key and std::pair<LPWCH, LPWCH> as an element.
Using any of LP-typedefs is not good. You would only be comparing the points, and not strings.
LPWCH is nothing but a WCHAR* which can be drilled down to void*. When you compare two pointers, you are comparing where they are pointing, and not what they are pointing.
You either need to have another comparer attached to your map/hash_map, or use actual string datatype (like std::string, CString)
I'm looking for a data structure to store strings in. I require a function in the interface that takes a string as its only parameter and returns a reference/iterator/pointer/handle that can be used to retrieve the string for the rest of the lifetime of the data structure. Set membership, entry deletion etc. is not required.
I'm more concerned with memory usage than speed.
One highly efficient data structure for storing strings is the Trie. This saves both memory and time by storing strings with common prefixes using the same memory.
You could use as the pointer returned the final marker of the string in the Trie, which uniquely identifies the string, and could be used to recreate the string by traversing the Trie upwards.
I think the keyword here is string interning, where you store only one copy of each distinct string. In Java, this is accomplished by String.intern():
String ref1 = "hello world".intern();
String ref2 = "HELLO WORLD".toLowerCase().intern();
assert ref1 == ref2;
I think the best bet here would be an ArrayList. The common implementations have some overhead from allocating extra space in the array for new elements, but if memory is such a requirement, you can allocate manually for each new element. It will be slower, but will only use the necessary memory for the string.
There are three ways to store strings:
Fixed length (array type structure)
Variable length but maximum size is fixed during running time (pointer type structure)
Linked list structure