Why can't an object have an array position associated with it instead of using a hashmap? - hashmap

I am trying to understand Hashmap concepts. I understand that it can be useful to find a certain object by hashing an object to find its location in memory.
However, why can't we have a property of an object correspond to its position in memory, so we could refer to that when searching for an object. As for insertion, we could have a counter store the number of objects, so that insertion could be O(1).
Why isn't this feasible?

I understand that it can be useful to find a certain object by hashing an object to find its location in memory.
Hashing does not do that! What it does is to compute a number from the value of the object. Even in the case where Object::hashCode is not overridden, the code is still not the object's address in memory.
One problem with addresses (and why they can't be used) is that an object's address changes when the GC moves it. So each time the GC ran you would need to rebuild all of the hash tables. Identity hashcodes don't have this problem / cost. An object's identity hashcode never changes.
A second problem is that if you used addresses (only) as the hashcodes, you wouldn't be able to handle the (more common) case of hashing based on the value of the key.

Related

JS DOM elements’ content-string properties: memory management and computation in current engines

In the DOM, the Node and Element types have many dynamic properties, e.g., Node.textContent and Element.innerHTML, that return string representations of the nodes’ contents.
The DOM specifications state only that these methods are simply to “return” DOMStrings, i.e., strings. But they, of course, do not specify when and how these strings are allocated and computed. For text nodes, an implementation can clearly simply return its plain-text string content directly. But Elements must create new strings that contain the tags and contents of any descendant elements.
As far as I can tell, an implementation may allocate memory for these strings and compute their values at four points:
Immediately, when the Element is first created;
Lazily, scheduled sometime after the Element’s creation;
Lazily, on demand at the point when getting a string property from the Element;
Or lazily, on demand whenever the string is first used (i.e., the returned string uses a special, externally indistinguishable implementation that is immediately returned by the Element’s string properties but which dynamically generates its value when needed).
These methods would also determine when the strings’ memory is freed. The first two techniques essentially cache the string in the Element object—they create a reference to the string from the Element, and the string will not be deallocated until the Element is. The third and fourth techniques might but need not create such a reference from the Element; the string’s memory may be deallocated long before the Element is, as long as its other references are freed. In addition, an implementation might use two, three, or all of these possibilities based on some heuristic.
Which techniques do modern browser engines use? This question has clear performance ramifications, both in computation time and memory allocation, for any sort of DOM manipulation and text retrieval. But as far as I can tell these properties’ performance characteristics are unknown or not documented.

V8 JavaScript Object vs Binary Tree

Is there a faster way to search data in JavaScript (specifically on V8 via node.js, but without c/c++ modules) than using the JavaScript Object?
This may be outdated but it suggests a new class is dynamically generated for every single property. Which made me wonder if a binary tree implementation might be faster, however this does not appear to be the case.
The binary tree implementation isn't well balanced so it might get better with balancing (only the first 26 values are roughly balanced by hand.)
Does anyone have an idea on why or how it might be improved? On another note: does the dynamic class notion mean there are actually ~260,000 properties (in the jsperf benchmark test of the second link) and subsequently chains of dynamic class definitions held in memory?
V8 uses the concepts of 'maps', which describe the layout of the data in an object.
These maps can be "fast maps" which specify a fixed offset from the start of the object at which a particular property can be found, or they can be "dictionary map", which use a hashtable to provide a lookup mechanism.
Each object has a pointer to the map that describes it.
Generally, objects start off with a fast map. When a property is added to an object with a fast map, the map is transitioned to a new one which describes the location of the new property within the object. The object is re-allocated with enough space for the new data item if necessary, and the object's map pointer is set to the new map.
The old map keeps a record of the transitions from it, including a pointer to the new map and a description of the property whose addition caused the map transition.
If another object which has the old map gets the same property added (which is very common, since objects of the same type tend to get used in the same way), that object will just use the new map - V8 doesn't create a new map in this case.
However, once the number of properties goes over a certain theshold (in fact, the current metric is to do with the storage space used, not the actual number of properties), the object is changed to use a dictionary map. At this point the object is re-written using a hashtable. In general, it won't undergo any more map transitions - any more properties that are added will just go in the hashtable.
Fast maps allow V8 to generate optimized code (using Crankshaft) where the offset of a property within an object is hard-coded into the machine code. This makes it very fast for cases where it can do this - it avoids the need for doing any lookup.
Obviously, the generated machine code is then dependent on the map - if the object's data layout changes, the code has to be discarded and re-optimized when necessary. V8 has a type profiling mechanism which collects information about what the types of various objects are during execution of unoptimized code. It doesn't trigger optimization of the code until certain stability constraints are met - one of these is that the maps of objects used in the function aren't changing frequently.
Here's a more detailed description of how this stuff works.
Here's a video where one of the lead developers of V8 describes stuff like map transitions and lots more.
For your particular test case, I would think that it goes through a few hundred map transitions while properties are being added in the preparation loop, then it will eventually transition to a dictionary based object. It certainly won't go through 260,000 of them.
Regarding your question about binary trees: a properly sized hashtable (with a sensible hash function and a significant number of objects in it) will always outperform a binary tree for a use-case where you're just searching, as your test code seems to do (all of the insertion is done in the setup phase).

How Strings are stored in a VBA Dictionary structure?

As I am currently playing with huge number of strings (have a look at another question: VBA memory size of Arrays and Arraylist) I used a scripting dictionary just for the feature of the keyed access that it has.
Everything was looking fine except that it was some how slow in loading the strings and that it uses a lot of memory. For an example of 100,000 strings of 128 characters in length, the Task manager showed at the end of the sub approximately 295 MB and when setting Dictionary=Nothing a poor 12 MB was remaining in Excel. Even considering internal Unicode conversion of strings 128 * 2 * 100,000 gives 25.6 MB ! Can someone explain this big difference ?
Here is all the info I could find on the Scripting.Dictionary:
According to Eric Lippert, who wrote the Scripting.Dictionary, "the actual implementation of the generic dictionary is an extensible-hashing-with-chaining algorithm that re-hashes when the table gets too full." (It is clear from the context that he is referring to the Scripting.Dictionary) Wikipedia's article on Hash Tables is a pretty good introduction to the concepts involved. (Here is a search of Eric's blog for the Scripting.Dictionary, he occasionally mentions it)
Basically, you can think of a Hash Table as a large array in memory. Instead of storing your strings directly by an index, you must provide a key (usually a string). The key gets "hashed", that is, a consistent set of algorithmic steps is applied to the key to crunch it down into a number between 0 and current max index in the Hash Table. That number is used as the index to store your string into the hash table. Since the same set of steps is applied each time the key is hashed, it results in the same index each time, meaning if you are looking up a string by its key, there is no need to search through the array as your normally would.
The hash function (which is what converts a key to an index into the table) is designed to be as random as possible, but every once in a while two keys can crunch down to the same index - this is called a collision. This is handled by "chaining" the strings together in a linked list (or possibly a more searchable structure). So suppose you tried to look a string up in the Hash Table with a key. The key is hashed, and you get an index. Looking in the array at that index, it could be an empty slot if no string with that key was ever added, or it could be a linked list that contains one or more strings whose keys mapped to that index in the array.
The entire reason for going through the details above is to point out that a Hash Table must be larger than the number of things it will store to make it efficient (with some exceptions, see Perfect Hash Function). So much of the overhead you would see in a Hash Table are the empty parts of the array that have to be there to make the hash table efficient.
Additionally, resizing the Hash Table is an expensive operation because the all the existing strings have to be rehashed to new locations, so when the load factor of the Hash Table exceeds the predefined threshold and it gets resized, it might get doubled in size to avoid having to do so again soon.
The implementation of the structure that holds the chain of strings at each array position can also have a large impact on the overhead.
If I find anything else out, I'll add it here...

Weak Tables in lua - What are the practical uses?

I understand what weak tables are.
But I'd like to know where weak tables can be used practically?
The docs say
Weak tables are often used in situations where you wish to annotate
values without altering them.
I don't understand that. What does that mean?
Posted as an answer from comments...
Since Lua doesn't know what you consider garbage, it won't collect anything it isn't sure to be garbage. In some situations (one of which could be debugging) you want to specify a value for a variable without causing it to be considered "not trash" by Lua. From my understanding, weak tables allow you to do what you'd normally do with variables/objects/etc, but if they're weak referenced (or in a weak table), they will still be considered garbage by Lua and collected when the garbage collection function is called.
Example: Think about if you wanted to use an associative array, with key/value pairs in two separate private tables. If you only wanted to use the key table for one specific use, once you are done using it, it will be locked into existence in Lua. If you were to use a weak table, however, you'd be able to collect it as garbage as soon as you were done using it, freeing up the resources it was using.
To explain that one cryptic sentence about annotating, when you "alter" a variable, you lock it into existence and Lua no longer considers it to be garbage. To "annotate" a variable means to give it a name, number, or some other value. So, it means that you're allowed to give a variable a name/value without locking it into existence (so then Lua can garbage collect it).
Translation:
Weak tables are often used in situations where you wish to give a name to a value without locking the value into existence, which takes up memory.
Normally, storing a reference to an obect will prevent that object from being reclaimed when the object goes out of scope. Weak references do not prevent garbage collection.

What's the advantage of a String being Immutable?

Once I studied about the advantage of a string being immutable because of something to improve performace in memory.
Can anybody explain this to me? I can't find it on the Internet.
Immutability (for strings or other types) can have numerous advantages:
It makes it easier to reason about the code, since you can make assumptions about variables and arguments that you can't otherwise make.
It simplifies multithreaded programming since reading from a type that cannot change is always safe to do concurrently.
It allows for a reduction of memory usage by allowing identical values to be combined together and referenced from multiple locations. Both Java and C# perform string interning to reduce the memory cost of literal strings embedded in code.
It simplifies the design and implementation of certain algorithms (such as those employing backtracking or value-space partitioning) because previously computed state can be reused later.
Immutability is a foundational principle in many functional programming languages - it allows code to be viewed as a series of transformations from one representation to another, rather than a sequence of mutations.
Immutable strings also help avoid the temptation of using strings as buffers. Many defects in C/C++ programs relate to buffer overrun problems resulting from using naked character arrays to compose or modify string values. Treating strings as a mutable types encourages using types better suited for buffer manipulation (see StringBuilder in .NET or Java).
Consider the alternative. Java has no const qualifier. If String objects were mutable, then any method to which you pass a reference to a string could have the side-effect of modifying the string. Immutable strings eliminate the need for defensive copies, and reduce the risk of program error.
Immutable strings are cheap to copy, because you don't need to copy all the data - just copy a reference or pointer to the data.
Immutable classes of any kind are easier to work with in multiple threads, the only synchronization needed is for destruction.
Perhaps, my answer is outdated, but probably someone will found here a new information.
Why Java String is immutable and why it is good:
you can share a string between threads and be sure no one of them will change the string and confuse another thread
you don’t need a lock. Several threads can work with immutable string without conflicts
if you just received a string, you can be sure no one will change its value after that
you can have many string duplicates – they will be pointed to a single instance, to just one copy. This saves computer memory (RAM)
you can do substring without copying, – by creating a pointer to an existing string’s element. This is why Java substring operation implementation is so fast
immutable strings (objects) are much better suited to use them as key in hash-tables
a) Imagine StringPool facility without making string immutable , its not possible at all because in case of string pool one string object/literal e.g. "Test" has referenced by many reference variables , so if any one of them change the value others will be automatically gets affected i.e. lets say
String A = "Test" and String B = "Test"
Now String B called "Test".toUpperCase() which change the same object into "TEST" , so A will also be "TEST" which is not desirable.
b) Another reason of Why String is immutable in Java is to allow String to cache its hashcode , being immutable String in Java caches its hash code and do not calculate every time we call hashcode method of String, which makes it very fast as hashmap key.
Think of various strings sitting on a common pool. String variables then point to locations in the pool. If u copy a string variable, both the original and the copy shares the same characters. These efficiency of sharing outweighs the inefficiency of string editing by extracting substrings and concatenating.
Fundamentally, if one object or method wishes to pass information to another, there are a few ways it can do it:
It may give a reference to a mutable object which contains the information, and which the recipient promises never to modify.
It may give a reference to an object which contains the data, but whose content it doesn't care about.
It may store the information into a mutable object the intended data recipient knows about (generally one supplied by that data recipient).
It may return a reference to an immutable object containing the information.
Of these methods, #4 is by far the easiest. In many cases, mutable objects are easier to work with than immutable ones, but there's no easy way to share with "untrusted" code the information that's in a mutable object without having to first copy the information to something else. By contrast, information held in an immutable object to which one holds a reference may easily be shared by simply sharing a copy of that reference.

Resources