Alternative Hash Functions for Dicts - python-3.x

Currently I am using hash(frozenset(my_dict.items())) to hash my dicts.
However it feels that once in a while I had issues with it (Also explained here: https://stackoverflow.com/a/5884123/2516892) and would like to have alternative approaches that are considered to be more stable/safe.
What would be a good recommendation?

Related

What is a efficient way of storing snapshots of an in-memory key-value store in Java?

I am trying to design an in-memory key-value store that maps strings to strings of variable length. I also want to give it the ability to take snapshots of its key-value data sets for any particular moment in time. Moreover, modifications to the key-value store should not affect past snapshots. I am currently using a HashMap for this, and for snapshots I maintain a mapping of timestamps to deep-copies of the respective HashMap's entry sets (with simple String compression). Are there any other more effective methods of doing this in-memory?
I am wondering, is it perhaps more memory-efficient, since I am working with strings of characters, to use tries instead?
Interesting. A little research shows that a Ctrie might be what you are looking for. Wiki: https://en.wikipedia.org/wiki/Ctrie
ctrie: Concurrent Tries with Efficient Non-Blocking Snapshots
Looks like there is code available in multiple languages
java haskell python C++
Found related :
Creating a ConcurrentHashMap that supports "snapshots"
and searching Stackoverflow: https://stackoverflow.com/search?q=ctrie

Keep list of hashs

I am working on a small project to keep my skills from completely rusting
I am generating a lot of hashes(in his case md5) and I need to check if I've seen that hash before so I wanted to keep it in a list
whats the best way to list them that I can look if they exist in pior to doing calculations
The hash itself is already a key of sorts. Your best bet is a hash table. In a properly implemented hash table, you can check for the existence of a key in constant time. Common hash table implementations with this feature are C# Dictionaries, Python's dict type, PHP array (which are actually Maps, not arrays), Perl's hashes % and Ruby's Hash. If you included details of what language you're working in, an example wouldn't be too hard to lookup.

Haskell data structure with efficient inexact lookup by key?

I have data keyed by Data.Time.Calendar.Day and need to efficiently look it up. Some dates are missing, when I try to look up by a missing key, I want to get data attached to the closest existing key, somewhat like std::map::lower_bound.
Any suggestions for existing libraries that can do this? I searched around for a while and only found maps supporting exact key lookups.
Thanks.
Did you check Data.Map.Lazy? In particular, I guess you could use the functions lookupLE and lookupGT, or similar. The complexity of these functions is O(log n), and similar functions exist in Data.Map.Strict.
A suitable combination of Data.Map's splitLookup and findMin/findMax will do the trick.

Finding possibly matching strings in a large dataset

I'm in the middle of a project where I have to process text documents and enhance them with Wikipedia links. Preprocessing a document includes locating all the possible target articles, so I extract all ngrams and do a comparison against a database containing all the article names. The current algorithm is a simple caseless string comparison preceded by simple trimming. However, I'd like it to be more flexible and tolerant to errors or little text modifications like prefixes etc. Besides, the database is pretty huge and I have a feeling that string comparison in such a large database is not the best idea...
What I thought of is a hashing function, which would assign a unique (I'd rather avoid collisions) hash to any article or ngram so that I could compare hashes instead of the strings. The difference between two hashes would let me know if the words are similiar so that I could gather all the possible target articles.
Theoretically, I could use cosine similiarity to calculate the similiarity between words, but this doesn't seem right for me because comparing the characters multiple times sounds like a performance issue to me.
Is there any recommended way to do it? Is it a good idea at all? Maybe the string comparison with proper indexing isn't that bad and the hashing won't help me here?
I looked around the hashing functions, text similarity algoriths, but I haven't found a solution yet...
Consider using the Apache Lucene API It provides functionality for searching, stemming, tokenization, indexing, document similarity scoring. Its an open source implementation of basic best practices in Information Retrieval
The functionality that seems most useful to you from Lucene is their moreLikeThis algorithm which uses Latent Semantic Analysis to locate similar documents.

Parsing bulk text with Hadoop: best practices for generating keys

I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into groups such that all members in a group share the same original sentence.
I feel that using the entire sentence as a key is a bad idea. I felt that generating some hash value of the sentence may not work because of a limited number of keys (unjustified belief).
Can anyone recommend the best idea/practice for generating unique keys for each sentence? Ideally, I would like to preserve order. However, this isn't a main requirement.
Aντίο,
Standard hashing should work fine. Most hash algorithms have a value space far greater than the number of sentences you're likely to be working with, and thus the likelihood of a collision will still be extremely low.
Despite the answer that I've already given you about what a proper hash function might be, I would really suggest you just use the sentences themselves as the keys unless you have a specific reason why this is problematic.
Though you might want to avoid simple hash functions (for example, any half-baked idea that you could think up quickly) because they might not mix up the sentence data enough to avoid collisions in the first place, one of the standard cryptographic hash functions would probably be quite suitable, for example MD5, SHA-1, or SHA-256.
You can use MD5 for this, even though collisions have been found and the algorithm is considered unsafe for security intensive purposes. This isn't a security critical application, and the collisions that have been found arose through carefully constructed data and probably won't arise randomly in your own NLP sentence data. (See, for example Johannes Schindelin's explanation of why it's probably unnecessary to change git to use SHA-256 hashes, so that you can appreciate the reasoning behind this.)

Resources