Efficient way to store bigrams with count

Efficient way to store bigrams with count - hashmap

I have several text files and want to create bigrams with the count of the number of times these bigrams occur in my file. I was thinking I could store these in a hashmap with the bigrams as key, and count as value. However, I know that hashmaps use quite more memory than a list, and I was thinking I might be able to do the same thing with a list containing triples (w1, w2, count).
So, in code, I am doing this as of now:
(defparameter mymap (make-hash-table :test 'equal))
(if (gethash "w1 w2" mymap)
(setf (gethash "w1 w2" mymap) (+ 1 (gethash "w1 w2" mymap)))
(setf (gethash "w1 w2" mymap) 1))

Unless you have a small number of keys, a list is probably not what you want. A hashmap is probably a good choice. It may use a little more memory, but probably not enough that you need to worry about it, and you can tweak the rehash-size and rehash-threshold to control the memory/performance tradeoff for your particular application. (for example a small rehash-size and large rehash-threshold would use less memory, but take longer for lookups).
Another alternative is a binary search tree such as an AVL tree or Red-Black tree. These aren't included in the cl package, but there are several lisp libraries that provide implementations, including fset, lisp interface library, and cl-containers.

Related

Are sequences faster than vectors for searching in haskell?

I am kind of new using data structures in haskell besides of Lists. My goal is to chose one container among Data.Vector, Data.Sequence, Data.List, etc ... My problem is the following:
I have to create a sequence (mathematically speaking). The sequence starts at 0. In each iteration two new elements are generated but only one should be appended based in whether the first element is already in the sequence. So in each iteration there is a call to elem function (see the pseudo-code below).
appendNewItem :: [Integer] -> [Integer]
appendNewItem acc = let firstElem = someFunc
secondElem = someOtherFunc
newElem = if firstElem `elem` acc
then secondElem
else firstElem
in acc `append` newElem
sequenceUptoN :: Int -> [Integer]
sequenceUptoN n = (iterate appendNewItem [0]) !! n
Where append and iterate functions vary depending on which colection you use (I am using lists in the type signature for simplicity).
The question is: Which data structure should I use?. Is Data.Sequence faster for this task because of the Finger Tree inner structure?
Thanks a lot!!

No, sequences are not faster for searching. A Vector is just a flat chunk of memory, which gives generally the best lookup performance. If you want to optimise searching, use Data.Vector.Unboxed. (The normal, “boxed” variant is also pretty good, but it actually contains only references to the elements in the flat memory-chunk, so it's not quite as fast for lookups.)
However, because of the flat memory layout, Vectors are not good for (pure-functional) appending: basically, whenever you add a new element, the whole array must be copied so as to not invalidate the old one (which somebody else might still be using). If you need to append, Seq is a pretty good choice, although it's not as fast as destructive appending: for maximum peformance, you'll want to pre-allocate an uninitialized Data.Vector.Unboxed.Mutable.MVector of the required size, populate it using the ST monad, and freeze the result. But this is much more fiddly than purely-functional alternatives, so unless you need to squeeze out every bit of performance, Data.Sequence is the way to go. If you only want to append, but not look up elements, then a plain old list in reverse order would also do the trick.

I suggest using Data.Sequence in conjunction with Data.Set. The Sequence to hold the sequence of values and the Set to track the collection.
Sequence, List, and Vector are all structures for working with values where the position in the structure has primary importance when it comes to indexing. In lists we can manipulate elements at the front efficiently, in sequences we can manipulate elements based on the log of the distance the closest end, and in vectors we can access any element in constant time. Vectors however, are not that useful if the length keeps changing, so that rules out their use here.
However, you also need to lookup a certain value within the list, which these structures don't help with. You have to search the whole of a list/sequence/vector to be certain that a new value isn't present. Data.Map and Data.Set are two of the structures for which you define an index value based on Ord, and let you lookup/insert in log(n). So, at the cost of memory usage you can lookup the presence of firstElem in your Set in log(n) time and then add newElem to the end of the sequence in constant time. Just make sure to keep these two structures in synch when adding or taking new elements.

Data.Map: how do I tell if I "need value-strict maps"?

When choosing between Data.Map.Lazy and Data.Map.Strict, the docs tell us for the former:
API of this module is strict in the keys, but lazy in the values. If you need value-strict maps, use Data.Map.Strict instead.
and for the latter likewise:
API of this module is strict in both the keys and the values. If you need value-lazy maps, use Data.Map.Lazy instead.
How do more seasoned Haskellers than me tend to intuit this "need"? Use-case in point, in a run-and-done (ie. not daemon-like/long-running) command-line tool: readFileing a simple lines-based custom config file where many (not all) lines define key:value pairs to be collected into a Map. Once done, we rewrite many values in it depending on other values in it that were read later (thanks to immutability, in this process we create a new Map and discard the initial incarnation).
(Although in practice this file likely won't often or ever reach even a 1000 lines, let's just assume for the sake of learning that for some users it will before long.)
Any given run of the tool will perhaps lookup some 20-100% of the (rewritten on load, although with lazy-eval I'm never quite sure "when really") key:value pairs, anywhere between once and dozens of times.
How do I reason about the differences between "value-strict" and "value-lazy" Data.Maps here? What happens "under the hood", in terms of mainstream computing if you will?
Fundamentally, such hash-maps are of course about "storing once, looking up many times" --- but then, what in computing isn't, "fundamentally". And furthermore the whole concept of lazy-eval's thunks seems to boil down to this very principle, so why not always stay value-lazy?

How do I reason about the differences between "value-strict" and "value-lazy" Data.Maps here?
Value lazy is the normal in Haskell. This means that not just values, but thunks (i.e. recipes of how to compute the value) are stored. For example, lets say you extract the value from a line like this:
tail (dropUntil (==':') line)
Then a value-strict map would actually extract the value upon insert, while a lazy one would happily just remember how to get it. This is then also what you would get on a lookup
Here are some pros and cons:
lazy values may need more memory, not only for the thunk itself, but also for the data that are referenced there (here line).
strict values may need more memory. In our case this could be so when the string gets interpreted to yield some memory hungry structure like lists, JSON or XML.
using lazy values may need less CPU if your code doesn't need every value.
too deep nesting of thunks may cause stack-overflows when the value is finally needed.
there is also a semantic difference: in lazy mode, you may get away when the code to extract the value would fail (like the above one that fails if there isnt a ':' on the line) if you just need to look whether the key is present. In strict mode, your program crashes upon insert.
As always, there are no fixed measures like: "If your evaluated value needs less than 20 bytes and takes less than 30µs to compute, use strict, else use lazy."
Normally, you just go with one and when you notice extreme runtimes/memory usage you try the other.

Here's a small experiment that shows a difference betwen Data.Map.Lazy and Data.Map.Strict. This code exhausts the heap:
import Data.Foldable
import qualified Data.Map.Lazy as M
main :: IO ()
main = print $ foldl' (\kv i -> M.adjust (+i) 'a' kv)
(M.fromList [('a',0)])
(cycle [0])
(Better to compile with a small maximum heap, like ghc Main.hs -with-rtsopts="-M20m".)
The foldl' keeps the map in WHNF as we iterate over the infinite list of zeros. However, thunks accumulate in the modified value until the heap is exhausted.
The same code with Data.Map.Strict simply loops forever. In the strict variant, the values are in WHNF whenever the map is in WHNF.

Why is a string not a list of characters in scheme/racket?

What I'm used to is that a string is just a list or array of characters, like in most C-like languages. However, in the scheme implementations that I use, including Chicken and Racket, a string is not the same as a list of characters. Something like (car "mystring") just won't fly. Instead there are functions to convert from and to lists. Why was this choice made? In my opinion Haskell does it the best way, there is literally no difference in any way between a list of chars and a string. I like this the most because it conveys the meaning of what is meant by a string in the clearest, simplest way. I'm not completely sure, but I'd guess that in the 'background' strings are lists or arrays of chars in almost any language. I'd especially expect a language like scheme with a focus on simplicity to handle strings in this way, or at least make is so you can do with strings what you can do with lists, like take the car or cdr What am I missing?

It looks like what you are really asking is, Why aren't there generic operations that work on both strings and lists?
Those do exist, in libraries like the generic collections library.
#lang racket/base
(require data/collection)
(first '(my list)) ; 'my
(first "mystring") ; #\m
Also, operations like map from this library can work with multiple different types of collections together.
#lang racket/base
(require data/collection)
(define (digit->char n)
(first (string->immutable-string (number->string n))))
(first (map string (map digit->char '(1 2 3)) "abc"))
; "1a"
This doesn't mean that strings are lists, but it does mean that both strings and lists are sequences, and operations on sequences can work on both kinds of data types together.

According to the Racket documentation, strings are arrays of characters:
4.3 Strings
A string is a fixed-length array of characters.
An array, as the term is usually used in programming languages, and especially in C and C++, is a contiguous block of memory with the important property that it supports efficient random access. E.g., you can access the first element (x[0]) just as quickly as the nth (x[n-1]). Linked lists, the lists you encounter by default in Lisps, don't support efficient random access.
So, since strings are arrays in Racket, you'd expect there to be some counterpart to the x[i] notation (which isn't very Lispy). In Racket, you use string-ref and string-set!, which are documented on that same page. E.g.:
(string-ref "mystring" 1) ;=> #\y
(Now, there are also vector-ref and vector-set! procedures for more generalized vectors. In some Lisps, strings are also vectors, so you can use general vector and array manipulation functions on strings. I'm not much of a Racket user, so I'm not sure whether that applies in Racket as well.)

Technically "array" is usually a continuous piece of memory, while "list" is usually understood to be a single- or double-linked list of independently allocated objects. In most common programming languages, including C, and all Lisp and Scheme dialects that I know, for performance reasons string data is stored in an array in the sense that it is stored in a continuous piece of memory.
The confusion is that sometimes they might still be colloquially referred to as lists, which is not correct when understanding "list" as the precise technical term "linked list".
Were a string truly stored as list, including how Scheme and Lisp generally store one, every single character would have the overhead of being part of an object that contains the character and at least one pointer, that to the next character.

Is there a functional n:1 bimap?

I have a map where multiple keys can map to the same value. I'd like to do reverse lookups, such that given a value, I get a list of all keys that map to this value.
Note that unlike
Data.Bimap my map is not 1:1 but n :1.
Also, the reverse lookup should not take O(n) like running through all map entries would require but rather O(log n) or better like with a reverse index. The map will contain many ten-thousands of entries with a high load of add/remove/lookup operations.
Is such a data structure available in functional form (Haskell or Frege preferred)?

Looking for an efficient array-like structure that supports "replace-one-member" and "append"

As an exercise I wrote an implementation of the longest increasing subsequence algorithm, initially in Python but I would like to translate this to Haskell. In a nutshell, the algorithm involves a fold over a list of integers, where the result of each iteration is an array of integers that is the result of either changing one element of or appending one element to the previous result.
Of course in Python you can just change one element of the array. In Haskell, you could rebuild the array while replacing one element at each iteration - but that seems wasteful (copying most of the array at each iteration).
In summary what I'm looking for is an efficient Haskell data structure that is an ordered collection of 'n' objects and supports the operations: lookup i, replace i foo, and append foo (where i is in [0..n-1]). Suggestions?

Perhaps the standard Seq type from Data.Sequence. It's not quite O(1), but it's pretty good:
index (your lookup) and adjust (your replace) are O(log(min(index, length - index)))
(><) (your append) is O(log(min(length1, length2)))
It's based on a tree structure (specifically, a 2-3 finger tree), so it should have good sharing properties (meaning that it won't copy the entire sequence for incremental modifications, and will perform them faster too). Note that Seqs are strict, unlike lists.

I would try to just use mutable arrays in this case, preferably in the ST monad.
The main advantages would be making the translation more straightforward and making things simple and efficient.
The disadvantage, of course, is losing on purity and composability. However I think this should not be such a big deal since I don't think there are many cases where you would like to keep intermediate algorithm states around.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string