I am using the procedure hash-set to set a value in a hash?. It seems to require the hash to be immutable?. So far I could not find a better way to transform a mutable hash into an immutable hash than the following:
(make-immutable-hash (hash->list myhash))
The hash is some yaml, which I am reading from a file and the yaml module gives me a mutable hash. For example I have the following code:
(hash-set yaml-hash
"content"
(make-immutable-hash
(hash->list
(my-hash-map content-hash
(lambda (key value)
(cons key
(markdown-to-html value)))))))
Is there a better way to transform a mutable hash into an immutable one, for the purpose of updating it? Or should I go a different way?
If the hash is mutable to begin with, you can modify it directly using hash-set!:
(hash-set! yaml-hash <key> <new-value>)
The above will change the value of the hash in-place, whereas hash-set will return a new hash, which you'll have to store or reassign somewhere else.
I could not find a better way to transform a mutable hash into an immutable hash than the following:
(make-immutable-hash (hash->list myhash))
You can use a for/hash comprehension and in-hash sequence to iterate through the entries of the mutable hash one at a time instead of first building a list of them all:
;;; Contract: (-> hash? (and/c hash? immutable?))
(define (hash->immutable-hash table)
(if (immutable? table)
table ;; If hash is already immutable, just return it
(for/hash ([(k v) (in-mutable-hash my-table)]) (values k v))))
Related
I am trying to understand the Python hash function under the hood. I created a custom class where all instances return the same hash value.
class C:
def __hash__(self):
return 42
I just assumed that only one instance of the above class can be in a dict at any time, but in fact a dict can have multiple elements with the same hash.
c, d = C(), C()
x = {c: 'c', d: 'd'}
print(x)
# {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'}
# note that the dict has 2 elements
I experimented a little more and found that if I override the __eq__ method such that all the instances of the class compare equal, then the dict only allows one instance.
class D:
def __hash__(self):
return 42
def __eq__(self, other):
return True
p, q = D(), D()
y = {p: 'p', q: 'q'}
print(y)
# {<__main__.D object at 0x7f0823a9af40>: 'q'}
# note that the dict only has 1 element
So I am curious to know how a dict can have multiple elements with the same hash.
Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole.
Python dictionaries are implemented as hash tables.
Hash tables must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
Python hash table is just a continguous block of memory (sort of like an array, so you can do O(1) lookup by index).
Each slot in the table can store one and only one entry. This is important
Each entry in the table actually a combination of the three values - . This is implemented as a C struct (see dictobject.h:51-56)
The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).
# Logical model of Python Hash table
-+-----------------+
0| <hash|key|value>|
-+-----------------+
1| ... |
-+-----------------+
.| ... |
-+-----------------+
i| ... |
-+-----------------+
.| ... |
-+-----------------+
n| ... |
-+-----------------+
When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i that is based on the hash of the key. CPython uses initial i = hash(key) & mask. Where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)
There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (==) of the keys when inserting items. So in summary, if there are two keys, a and b and hash(a)==hash(b), but a!=b, then both can exist harmoniously in a Python dict. But if hash(a)==hash(b) and a==b, then they cannot both be in the same dict.
Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).
I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"
While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?
For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?
Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.
If the hash matches but the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).
The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.
Edit: the answer below is one of possible ways to deal with hash collisions, it is however not how Python does it. Python's wiki referenced below is also incorrect. The best source given by #Duncan below is the implementation itself: https://github.com/python/cpython/blob/master/Objects/dictobject.c I apologize for the mix-up.
It stores a list (or bucket) of elements at the hash then iterates through that list until it finds the actual key in that list. A picture says more than a thousand words:
Here you see John Smith and Sandra Dee both hash to 152. Bucket 152 contains both of them. When looking up Sandra Dee it first finds the list in bucket 152, then loops through that list until Sandra Dee is found and returns 521-6955.
The following is wrong it's only here for context: On Python's wiki you can find (pseudo?) code how Python performs the lookup.
There's actually several possible solutions to this problem, check out the wikipedia article for a nice overview: http://en.wikipedia.org/wiki/Hash_table#Collision_resolution
Hash tables, in general have to allow for hash collisions! You will get unlucky and two things will eventually hash to the same thing. Underneath, there is a set of objects in a list of items that has that same hash key. Usually, there is only one thing in that list, but in this case, it'll keep stacking them into the same one. The only way it knows they are different is through the equals operator.
When this happens, your performance will degrade over time, which is why you want your hash function to be as "random as possible".
In the thread I did not see what exactly python does with instances of a user-defined classes when we put it into a dictionary as a keys. Let's read some documentation: it declares that only hashable objects can be used as a keys. Hashable are all immutable built-in classes and all user-defined classes.
User-defined classes have __cmp__() and
__hash__() methods by default; with them, all objects
compare unequal (except with themselves) and
x.__hash__() returns a result derived from id(x).
So if you have a constantly __hash__ in your class, but not providing any __cmp__ or __eq__ method, then all your instances are unequal for the dictionary.
In the other hand, if you providing any __cmp__ or __eq__ method, but not providing __hash__, your instances are still unequal in terms of dictionary.
class A(object):
def __hash__(self):
return 42
class B(object):
def __eq__(self, other):
return True
class C(A, B):
pass
dict_a = {A(): 1, A(): 2, A(): 3}
dict_b = {B(): 1, B(): 2, B(): 3}
dict_c = {C(): 1, C(): 2, C(): 3}
print(dict_a)
print(dict_b)
print(dict_c)
Output
{<__main__.A object at 0x7f9672f04850>: 1, <__main__.A object at 0x7f9672f04910>: 3, <__main__.A object at 0x7f9672f048d0>: 2}
{<__main__.B object at 0x7f9672f04990>: 2, <__main__.B object at 0x7f9672f04950>: 1, <__main__.B object at 0x7f9672f049d0>: 3}
{<__main__.C object at 0x7f9672f04a10>: 3}
I've narrowed a problem down to one particular function call to one of my library routines that looks like (pop-hstack current-hstack), which pops an element from a stack structure. It is causing data corruption (an inconsistency, see below) in the stack structure, but only when multiple threads are running. I've tried wrapping the call in a lock like so (bt:with-lock-held (*lock*) (pop-hstack current-hstack), but current-hstack is still becoming inconsistent somewhere during execution when there are two or more threads active. The arguments to pop-hstack (eg, current-hstack) in each thread are dynamically bound special variables, and so are not shared between threads. It's confusing whether the inconsistency is being introduced by multi-threading (no inconsistency running single-thread), or perhaps by a contingent programming bug in the structure definition or pop-hstack function.
(defstruct hstack
"An hstack (hash stack) is an expanded stack representation containing an
adjustable one-dimensional array of elements, plus a hash table for quickly
determining if an element is in the stack. Keyfn is applied to elements to
access the hash table. New elements are pushed at the fill-pointer, and
popped at the fill-pointer minus 1."
(vector (make-array 0 :adjustable t :fill-pointer t) :type (array * (*)))
(table (make-hash-table) :type hash-table) ;can take a custom hash table
(keyfn #'identity :type function)) ;fn to get hash table key for an element
(defun pop-hstack (hstk)
"Pops an element from hstack's vector. Also removes the element's index from
the element's hash table entry--and the entry itself if it's the last index."
(let* ((vec (hstack-vector hstk))
(fptr-1 (1- (fill-pointer vec)))
(tbl (hstack-table hstk))
(key (funcall (hstack-keyfn hstk) (aref vec fptr-1))))
(when (null (setf (gethash key tbl) (delete fptr-1 (gethash key tbl))))
(remhash key tbl))
(vector-pop vec)))
Normally, hstack's stack vector and hash table are in sync, containing the same number of entries: (length (hstack-vector x)) = (hash-table-count (hstack-table x)). Only when there are duplicate elements in hstack, will the number of entries differ. (Because then a single hash table entry will contain multiple vector indices for duplicate elements appearing in the vector.) However, the inconsistency between the number of entries in the vector and the hash table still shows up when there are no duplicate elements. Typically, there are one or two extra elements in the hash table, indicating that these extra elements were not properly removed during a pop-hstack operation. The stack vector always seems to have the correct elements.
EDIT(5/2/19): Corrected a coding error in pop-hstack: Replace (delete fptr-1 (gethash key tbl)) with (setf (gethash key tbl) (delete fptr-1 (gethash key tbl))).
The form (delete fptr-1 (gethash key tbl)) might be the cause, it modifies the list structure so that concurrent access might see a corrupt list.
What's the definition of the push operation?
Does corruption also occur if all push and all pop operations are wrapped in with-lock-held (using the same lock)?
i need to retrieve the key whose value contains a string "TRY"
:CAB "NAB/TRY/FIGHT.jar"
so in this case the output should be :CAB .
I am new to Clojure, I tried a few things like .contains etc but I could not form the exact function for the above problem.its easier in few other languages like python but I don't know how to do it in Clojure.
Is there a way to retrieve the name of the key ?
for can also filter with :when. E.g.
(for [[k v] {:FOO "TRY" :BAR "BAZ"}
:when (.contains v "TRY")]
k)
First, using .contains is not recommended - first, you are using the internals of the underlying language (Java or JavaScript) without need, and second, it forces Clojure to do a reflection as it cannot be sure that the argument is a string.
It's better to use clojure.string/includes? instead.
Several working solutions have been already proposed here for extracting a key depending on the value, here is one more, that uses the keep function:
(require '[clojure.string :as cs])
(keep (fn [[k v]] (when (cs/includes? v "TRY") k))
{:CAB "NAB/TRY/FIGHT.jar" :BLAH "NOWAY.jar"}) ; => (:CAB)
The easiest way is to use the contains method from java.lang.String. I'd use that to map valid keys, and then filter to remove all nil values:
(filter some?
(map (fn [[k v]] (when (.contains v "TRY") k))
{:CAB "NAB/TRY/FIGHT.jar" :BLAH "NOWAY.jar"}))
=> (:CAB)
If you think there is at most one such matching k/v pair in the map, then you can just call first on that to get the relevant key.
You can also use a regular expression instead of .contains, e.g.
(fn [[k v]] (when (re-find #"TRY" v) k))
You can use some on your collection, some will operate in every value in your map a given function until the function returns a non nil value.
We're gonna use the function
(fn [[key value]] (when (.contains values "TRY") key))
when returns nil unless the condition is matched so it will work perfectly for our use case. We're using destructuring in the arguments of the function to get the key and value. When used by some, your collection will indeed be converted to a coll which will look like
'((:BAR "NAB/TRY/FIGHT.jar"))
If your map is named coll, the following code will do the trick
(some
(fn [[key value]] (when (.contains value "TRY") key))
coll)
In the example of defining a custom hash function on page 114 of Nim in Action, the !$ operator is used to "finalize the computed hash".
import tables, hashes
type
Dog = object
name: string
proc hash(x: Dog): Hash =
result = x.name.hash
result = !$result
var dogOwners = initTable[Dog, string]()
dogOwners[Dog(name: "Charlie")] = "John"
And in the paragraph below:
The !$ operator finalizes the computed hash, which is necessary when writing a custom hash procedure. The use of the $! operator ensures that the computed hash is unique.
I am having trouble understanding this. What does it mean to "finalize" something? And what does it mean to ensure that something is unique in this context?
Your questions might become answered if instead of reading the single description of the !$ operator you take a look at the beginning of the hashes module documentation. As you can see there, primitive data types have a hash() proc which returns their own hash. But if you have a complex object with many variables, you might want to create a single hash for the object itself, and how do you do that? Without going into hash theory, and treating hashes like black boxes, you need to use two kind of procs to produce a valid hash: the addition/concatenation operator and the finalization operator. So you end up using !& to keep adding (or mixing) individual hashes into a temporal value, and then use !$ to finalize that temporal value into a final hash. The Nim in Action example might have been easier to understand if the Dog object had more than a single variable, thus requiring the use of both operators:
import tables, hashes, sequtils
type
Dog = object
name: string
age: int
proc hash(x: Dog): Hash =
result = x.name.hash !& x.age.hash
result = !$result
var dogOwners = initTable[Dog, string]()
dogOwners[Dog(name: "Charlie", age: 2)] = "John"
dogOwners[Dog(name: "Charlie", age: 5)] = "Martha"
echo toSeq(dogOwners.keys)
for key, value in dogOwners:
echo "Key ", key.hash, " for ", key, " points at ", value
As for why are hash values temporarily concatenated and then finalized, that depends much on which algorithms have the Nim developers chosen to use for hashing. You can see from the source code that hash concatenation and finalization is mostly bit shifting. Unfortunately the source code doesn't explain or point at any other reference to understand why is that done and why this specific hashing algorithm was selected compared to others. You could try asking the Nim forums for that, and maybe improve the documentation/source code with your findings.
We're in the process of converting our imperative brains to a mostly-functional paradigm. This function is giving me trouble. I want to construct an array that EITHER contains two pairs or three pairs, depending on a condition (whether refreshToken is null). How can I do this cleanly using a FP paradigm? Of course with imperative code and mutation, I would just conditionally .push() the extra value onto the end which looks quite clean.
Is this an example of the "local mutation is ok" FP caveat?
(We're using ReadonlyArray in TypeScript to enforce immutability, which makes this somewhat more ugly.)
const itemsToSet = [
[JWT_KEY, jwt],
[JWT_EXPIRES_KEY, tokenExpireDate.toString()],
[REFRESH_TOKEN_KEY, refreshToken /*could be null*/]]
.filter(item => item[1] != null) as ReadonlyArray<ReadonlyArray<string>>;
AsyncStorage.multiSet(itemsToSet.map(roArray => [...roArray]));
What's wrong with itemsToSet as given in the OP? It looks functional to me, but it may be because of my lack of knowledge of TypeScript.
In Haskell, there's no null, but if we use Maybe for the second element, I think that itemsToSet could be translated to this:
itemsToSet :: [(String, String)]
itemsToSet = foldr folder [] values
where
values = [
(jwt_key, jwt),
(jwt_expires_key, tokenExpireDate),
(refresh_token_key, refreshToken)]
folder (key, Just value) acc = (key, value) : acc
folder _ acc = acc
Here, jwt, tokenExpireDate, and refreshToken are all of the type Maybe String.
itemsToSet performs a right fold over values, pattern-matching the Maye String elements against Just and (implicitly) Nothing. If it's a Just value, it cons the (key, value) pair to the accumulator acc. If not, folder just returns acc.
foldr traverses the values list from right to left, building up the accumulator as it visits each element. The initial accumulator value is the empty list [].
You don't need 'local mutation' in functional programming. In general, you can refactor from 'local mutation' to proper functional style by using recursion and introducing an accumulator value.
While foldr is a built-in function, you could implement it yourself using recursion.
In Haskell, I'd just create an array with three elements and, depending on the condition, pass it on either as-is or pass on just a slice of two elements. Thanks to laziness, no computation effort will be spent on the third element unless it's actually needed. In TypeScript, you probably will get the cost of computing the third element even if it's not needed, but perhaps that doesn't matter.
Alternatively, if you don't need the structure to be an actual array (for String elements, performance probably isn't that critical, and the O (n) direct-access cost isn't an issue if the length is limited to three elements), I'd use a singly-linked list instead. Create the list with two elements and, depending on the condition, append the third. This does not require any mutation: the 3-element list simply contains the unchanged 2-element list as a substructure.
Based on the description, I don't think arrays are the best solution simply because you know ahead of time that they contain either 2 values or 3 values depending on some condition. As such, I would model the problem as follows:
type alias Pair = (String, String)
type TokenState
= WithoutRefresh (Pair, Pair)
| WithRefresh (Pair, Pair, Pair)
itemsToTokenState: String -> Date -> Maybe String -> TokenState
itemsToTokenState jwtKey jwtExpiry maybeRefreshToken =
case maybeRefreshToken of
Some refreshToken ->
WithRefresh (("JWT_KEY", jwtKey), ("JWT_EXPIRES_KEY", toString jwtExpiry), ("REFRESH_TOKEN_KEY", refreshToken))
None ->
WithoutRefresh (("JWT_KEY", jwtKey), ("JWT_EXPIRES_KEY", toString jwtExpiry))
This way you are leveraging the type system more effectively, and could be improved on further by doing something more ergonomic than returning tuples.