Efficient boolean reductions `any`, `all` for PySpark RDD? - apache-spark

PySpark supports common reductions like sum, min, count, ... Does it support boolean reductions like all and any?
I can always fold over or_ and and_ but this seems inefficient.

this is very late, but all on a set of boolean values z is the same as min(z) == True and any is the same as max(z) == True

No the underlying Scala API doesn't have it so the Python one definitely won't. I don't think they will add it either as it's very easy to define in terms of filter.
Yes using fold would be inefficient because it won't parallelelize. Do something like .filter(!condition).take(1).isEmpty to mean .forall(condition) and .filter(condition).take(1).nonEmpty to mean .exists(condition)
(General suggestion: the underlying Scala API is generally more flexible than Python API, suggest you move to it - it also makes debugging much easier as you have less layers to dig through. Scala means Scalable Language - it's much better for scalable applications and more robust than dynamically typed languages)

Related

Way to use bisect module for sets in python

I was looking for something similar to lower_bound() function for sets in
python, as we have in C++.
Task is to have a ds, which inserts element in sorted manner, storing only single instance of each distinct value, and returns the left neighbor of a given value, both operations in O(logn) worst time in python.
python: something similar to bisect module for lists, with efficient insertion may work.
sets are unordered, and the standard lib does not offer tree structures.
Maybe you could look at sorted containers (3rd party lib): http://www.grantjenks.com/docs/sortedcontainers/ it might offer a good approach to your problem.

Python built-in functions time/space complexity

For python built-in functions such as:
sorted()
min()
max()
what are time/space complexities, what algorithms are used?
Is it always advisable to use the built-in functions of python?
As mentioned in comments, sorted is timsort (see this post) which is O(n log(n)) and a stable sort. max and min will run in Θ(n). But, if you want to find both of them in a solution, you can find them using 3n/2 comparison instead of 2n. (Although in general they are in O(n)). To know more about the method see this post.

Why does collect_list in Spark not use partial aggregation

I recently played around with UDAFs and looked into the sourcecode of the built-in aggregation function collect_list, I was suprised to see that collect_list does not have a merge method implemented, although I think this is really straight-farward (just concatenate two Arrays). Code taken from org.apache.spark.sql.catalyst.expressions.aggregate.collect.Collect
override def merge(buffer: InternalRow, input: InternalRow): Unit = {
sys.error("Collect cannot be used in partial aggregations.")
}
It is no longer the case, as SPARK-1893 but I'd assume that the initial design had mostly collect_list in mind.
Because collect_list is logically equivalent to groupByKey the motivation would be exactly the same to avoid long GC pauses. In particular map side combine in groupByKey has been disabled with Spark SPARK-772:
Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC.
So to address you comment
I think this is really straight-farward (just concatenate two Arrays).
It might be simple but it doesn't add much value (unless there is another reducing operation on top of it) and sequence concatenation is expensive.

Efficient algorithm for grouping array of strings by prefixes

I wonder what is the best way to group an array of strings according to a list of prefixes (of arbitrary length).
For example, if we have this:
prefixes = ['GENERAL', 'COMMON', 'HY-PHE-NATED', 'UNDERSCORED_']
Then
tasks = ['COMMONA', 'COMMONB', 'GENERALA', 'HY-PHE-NATEDA', 'UNDERESCORED_A', 'HY-PHE-NATEDB']
Should be grouped this way:
[['GENERALA'], ['COMMONA', 'COMMONB'], ['HY-PHE-NATEDA', 'HY-PHE-NATEDB'], ['UNDERESCORED_A'] ]
The naïve approach is to loop through all the tasks and inner loop through prefixes (or vice versa, whatever) and test each task for each prefix.
Can one give me a hint how to make this in a more efficient way?
It depends a bit on the size of your problem, of course, but your naive approach should be okay if you sort both your prefixes and your tasks and then build your sub-arrays by traversing both sorted lists only forwards.
There are a few options, but you might be interested in looking into the trie data structure.
http://en.wikipedia.org/wiki/Trie
The trie data structure is easy to understand and implement and works well for this type of problem. If you find that this works for your situation you can also look at Patricia Tries which achieve the similar performance characteristics but typically have better memory utilization. They are a little more involved to implement but not overly complex.

efficiently compute the edit distance between 1 string and a large set of other strings?

The use case is auto-complete options where I want to rank a large set of other strings by how like a fixed string they are.
Is there any bastardization of something like a DFA RegEx that can do a better job than the start over on each option solution?
The guy who asked this question seems to know of a solution but doesn't list any sources.
(p.s. "Read this link" type answer welcome.)
I did something like this recently. Unfortunately it's closed source.
The solution is to write a levenshtein automaton. Spoiler: it's a NFA.
Although many people will try to convince you that simulating NFAs is exponential, it isn't. Creating a DFA from NFA is exponential. Simulating is just polynomial. Many regex engines are writen with sub-optimal algorithms based on this.
NFA simulation is O(n*m) for a n-sized string and m states. Or O(n) amortized if you convert it to a DFA lazily (and cache it).
I'm afraid you'll either have to deal with complex automata libraries or will have to write a lot of code (what I did).

Resources