shuffle and take first vs sampling in haskell

shuffle and take first vs sampling in haskell - haskell

In Ruby Programming language myList.shuffle.first is slower than myList.sample since it completely shuffle the list and pick the first element. If something similar (shuffle and take first) is done in Haskell, will that be as fast as the later (sampling the array)? I am assuming that the list will be shuffled lazily so picking the first element or taking the sample will be virtually same.

It can be written to behave that way, using the decorate-sort-undecorate pattern: first label each element in your list with a random number; sort by the labels; throw away the labels; and take the first element. The standard implementation of sort will make this an O(n) operation, just like sampling would be. Not sure if there are any packages that offer this out of the box, and of course the manually written version of the algorithm may have better constants.

Related

Way to use bisect module for sets in python

I was looking for something similar to lower_bound() function for sets in
python, as we have in C++.
Task is to have a ds, which inserts element in sorted manner, storing only single instance of each distinct value, and returns the left neighbor of a given value, both operations in O(logn) worst time in python.
python: something similar to bisect module for lists, with efficient insertion may work.

sets are unordered, and the standard lib does not offer tree structures.
Maybe you could look at sorted containers (3rd party lib): http://www.grantjenks.com/docs/sortedcontainers/ it might offer a good approach to your problem.

Haskell data structure that is efficient for swapping elements?

I am looking for a Haskell data structure that stores an ordered list of elements and that is time-efficient at swapping pairs of elements at arbitrary locations within the list. It's not [a], obviously. It's not Vector because swapping creates new vectors. Which data structure is efficient at this?

The most efficient implementations of persistent data structures, which exhibit O(1) updates (as well as appending, prepending, counting and slicing), are based on the Array Mapped Trie algorithm. The Vector data-structures of Clojure and Scala are based on it, for instance. The only Haskell implementation of that data-structure that I know of is presented by the "persistent-vector" package.
This algorithm is very young, it was only first presented in the year 2000, which might be the reason why not so many people have ever heard about it. But the thing turned out to be such a universal solution that it got adapted for Hash-tables soon after. The adapted version of this algorithm is called Hash Array Mapped Trie. It is as well used in Clojure and Scala to implement the Set and Map data-structures. It is also more ubiquitous in Haskell with packages like "unordered-containers" and "stm-containers" revolving around it.
To learn more about the algorithm I recommend the following links:
http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvector-implementation.html
http://lampwww.epfl.ch/papers/idealhashtrees.pdf

Data.Sequence from the containers package would likely be a not-terrible data structure to start with for this use case.

Haskell is a (nearly) pure functional language, so any data structure you update will need to make a new copy of the structure, and re-using the data elements is close to the best you can do. Also, the new list would be lazily evaluated and typically only the spine would need to be created until you need the data. If the number of updates is small compared to the number of elements, you could make a difference list that checks a sparse set of updates first, and only then looks in the original vector.

Repa performance versus lists

In the Numeric Haskell Repa Tutorial Wiki, there is a passage that reads (for context):
10.1 Fusion, and why you need it
Repa depends critically on array fusion to achieve fast code. Fusion is a fancy name for the
combination of inlining and code transformations performed by GHC when
it compiles your program. The fusion process merges the array filling
loops defined in the Repa library, with the "worker" functions that
you write in your own module. If the fusion process fails, then the
resulting program will be much slower than it needs to be, often 10x
slower an equivalent program using plain Haskell lists. On the other
hand, provided fusion works, the resulting code will run as fast as an
equivalent cleanly written C program. Making fusion work is not hard
once you understand what's going on.
The part that I don't understand is this:
"If the fusion process fails, then the
resulting program will be much slower than it needs to be, often 10x
slower an equivalent program using plain Haskell lists."
I understand why it would run slower if stream fusion fails, but why does it run that much slower than lists?
Thanks!

Typically, because lists are lazy, and Repa arrays are strict.
If you fail to fuse a lazy list traversal, e.g.
map f . map g
you pay O(1) cost per value for leaving the intermediate (lazy) cons cell there.
If you fail to fuse the same traversal over a strict sequence, you pay at least O(n) per value for allocating a strict intermediate array.
Also, since fusion mangles your code into an unrecognizable Stream data type, to improve analysis, you can be left with code that has just too many constructors and other overheads.

Edit: This is not correct--see Don Nelson's comment (and his answer--he knows a lot more about the library than I do).
Immutable arrays cannot share components; disregarding fusion, any modification to an immutable array must reallocate the entire array. By contrast, while list operations are non-destructive, they can share parts: f i (h:t) = i:t, for example, replaces the head of a list in constant time by creating a new list in which the first cell points to the second cell of the original list. Moreover, because lists can be built incrementally, such functions as generators that build a list by repeated calls to a function can still run in O(n) time, while the equivalent function on an immutable array without fusion would need to reallocate the array with every call to the function, taking O(n^2) time.

Functional alternative to caching known "answers"

I think the best way to form this question is with an example...so, the actual reason I decided to ask about this is because of because of Problem 55 on Project Euler. In the problem, it asks to find the number of Lychrel numbers below 10,000. In an imperative language, I would get the list of numbers leading up to the final palindrome, and push those numbers to a list outside of my function. I would then check each incoming number to see if it was a part of that list, and if so, simply stop the test and conclude that the number is NOT a Lychrel number. I would do the same thing with non-lychrel numbers and their preceding numbers.
I've done this before and it has worked out nicely. However, it seems like a big hassle to actually implement this in Haskell without adding a bunch of extra arguments to my functions to hold the predecessors, and an absolute parent function to hold all of the numbers that I need to store.
I'm just wondering if there is some kind of tool that I'm missing here, or if there are any standards as a way to do this? I've read that Haskell kind of "naturally caches" (for example, if I wanted to define odd numbers as odds = filter odd [1..], I could refer to that whenever I wanted to, but it seems to get complicated when I need to dynamically add elements to a list.
Any suggestions on how to tackle this?
Thanks.
PS: I'm not asking for an answer to the Project Euler problem, I just want to get to know Haskell a bit better!

I believe you're looking for memoizing. There are a number of ways to do this. One fairly simple way is with the MemoTrie package. Alternatively if you know your input domain is a bounded set of numbers (e.g. [0,10000)) you can create an Array where the values are the results of your computation, and then you can just index into the array with your input. The Array approach won't work for you though because, even though your input numbers are below 10,000, subsequent iterations can trivially grow larger than 10,000.
That said, when I solved Problem 55 in Haskell, I didn't bother doing any memoization whatsoever. It turned out to just be fast enough to run (up to) 50 iterations on all input numbers. In fact, running that right now takes 0.2s to complete on my machine.

Looking for an efficient array-like structure that supports "replace-one-member" and "append"

As an exercise I wrote an implementation of the longest increasing subsequence algorithm, initially in Python but I would like to translate this to Haskell. In a nutshell, the algorithm involves a fold over a list of integers, where the result of each iteration is an array of integers that is the result of either changing one element of or appending one element to the previous result.
Of course in Python you can just change one element of the array. In Haskell, you could rebuild the array while replacing one element at each iteration - but that seems wasteful (copying most of the array at each iteration).
In summary what I'm looking for is an efficient Haskell data structure that is an ordered collection of 'n' objects and supports the operations: lookup i, replace i foo, and append foo (where i is in [0..n-1]). Suggestions?

Perhaps the standard Seq type from Data.Sequence. It's not quite O(1), but it's pretty good:
index (your lookup) and adjust (your replace) are O(log(min(index, length - index)))
(><) (your append) is O(log(min(length1, length2)))
It's based on a tree structure (specifically, a 2-3 finger tree), so it should have good sharing properties (meaning that it won't copy the entire sequence for incremental modifications, and will perform them faster too). Note that Seqs are strict, unlike lists.

I would try to just use mutable arrays in this case, preferably in the ST monad.
The main advantages would be making the translation more straightforward and making things simple and efficient.
The disadvantage, of course, is losing on purity and composability. However I think this should not be such a big deal since I don't think there are many cases where you would like to keep intermediate algorithm states around.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

shuffle and take first vs sampling in haskell - haskell

Related

Way to use bisect module for sets in python

Haskell data structure that is efficient for swapping elements?

Repa performance versus lists

Functional alternative to caching known "answers"

Looking for an efficient array-like structure that supports "replace-one-member" and "append"

Categories

Resources