Find first element in RDD satisfying the given predicate - apache-spark

How to find the first element in a normal RDD( Because in PairRDD, we can use lookup(key) API ) which satisfy a predicate? And after finding the first element, it should exit the RDD traversal.
Looking for a solution without using legacy for loops.

How about
rdd.filter(p).top(1)
or if you don't have an order on the RDD
rdd.filter(p).take(1)

The above solutions stated are perfectly correct. Here is another method to achieve the same goal
rdd.filter(p).first

Related

Elegant application of attribute set/get or method to several objects of same type

I'd like to achieve the following without using loops or comprehensions (at least explicitly) as I believe this is more elegant, especially considering the degree of nested attributes I am using.
Say I have an object nested like so: a.b.c.d.e, where e is of type NodeList. NodeList holds a list of Node objects internally. You can use a.b.c.d.e.all() to get this list of all objects.
Node has a member val that I would like to set on all nodes in e.
I want syntax like this to work: a.b.c.d.e.all().val = <val>. Is there a way I can implement the all() method such that this is possible?
Thanks.
I've decided to achieve this by implementing:
a.b.c.d.e.all(lambda x: x.val = <val>) where all applies the lambda to each Node.
I think this solution is quite nice. Not ideal. If you have any other ideas, please let me know. Thanks.

Optimal way to add an element in k-th position of list of size n if k<n

I know it's possible to add an element inside a list AND NOT AS THE FIRST ELEMENT NOR THE LAST by redefining the list and adding three lists:
# I want to add 5 into [1,2,3,4,6,7,8,9,0] between the 4 and the 6
A=[1,2,3,4,6,7,8,9,0]
A=[1,2,3,4]+[5]+[6,7,8,9,0]
but I think this isn't optimal, since I'm creating three lists and re-defining a variable. Someone could show me the best way to do this?
You can use insert method of the list mentioned here.
L = [1,2,3,4,6,7,8,9,0]
L.insert(4,5)
This is the optimized way of python, if you need more optimized insertion operation perhaps use some other data structure depending upon your need.

Python: Sort 2-tuple sometimes according to first and sometimes according to second element

I have a list of tuples of integers [(2,10), [...] (4,11),(3,9)].
Tuples are added to the list as well as deleted from the list regularly. It will contain up to ~5000 Elements.
In my code I need to use this list sometimes sorted according to the first and sometimes to the second tuple-element. Hence ordering of the list will change drastically. Resorting might take place at any time.
Pythons tinsort is only fast when list are already sorted heavily. So this general approach of frequent resorting might be inefficient. A better approach would be to use two naturally sorted data-structures like the SortedList. But here I would need two lists (one for the first tuple element, and one for the second) as well as a dictionary to create the mapping of the above tuples.
What is the pythonic way to solve this?
In Java I would do it like this:
TreeSet<Integer> leftTupleEntry = new Treeset<Integer>();
TreeSet<Integer> rightTupleEntry = new Treeset<Integer>();
Hashmap<Integer, Integer> tupleMap = new HashMap<Integer,Integer>()
And have both sorting strategies in the best runtime complexity class as well as the necessary connection between both numbers.
When I need to sort it according to first tuple I need to access the whole list (as i need to calculate a cumulative sum, and other operations)
When I need to sort according to second element, I'm only interested in the smallest elements, which then is usually followed by the deletion of these respective tuples.
Typically after any insertation a new sort according to the first element is requested.
first_element_list = sorted([i[0] for i in list_tuple])
second_element_list = sorted([i[1] for i in list_tuple])
What I did:
I use the SortedKeyList and sorted according to the first tuple element. Inserting into this list is O(log(n)). Reading from it is O(log(n)) too.
from operator import itemgetter
from sortedcontainers import SortedKeyList
self.list = SortedKeyList(key=itemgetter(0))
self.list.add((1,4))
self.list.add((2,6))
When I need the argmin according to the second tuple element I used
np.argmin(self.list, axis=0)[0]
Which is O(n). Not optimal.

How to use values (as Column) in function (from functions object) where Scala non-SQL types are expected?

I'd like to undertand how I can dynamically add number of days to a given timestamp: I tried something similar to the example shown below. The issue here is that the second argument is expected to be of type Int, however in my case it returns type Column. How do I unbox this / get the actual value? (The code examples below might not be 100% correct as I write this from top of my head ... I don't have the actual code with me currently)
myDataset.withColumn("finalDate",date_add(col("date"),col("no_of_days")))
I tried casting:
myDataset.withColumn("finalDate",date_add(col("date"),col("no_of_days").cast(IntegerType)))
But this did not help either. So how is it possible to solve this?
I did find a workaround by using selectExpr:
myDataset.selectExpr("date_add(date,no_of_days) as finalDate")
While this works, I still would like to understand how to get the same result with withColumn.
withColumn("finalDate", expr("date_add(date,no_of_days)"))
The above syntax should work.
I think it's not possible as you'd have to use two separate similar-looking type systems - Scala's and Spark SQL's.
What you call a workaround by using selectExpr is probably the only way to do it as you're confined in a single type system, in Spark SQL's and since the parameters are all defined in Spark SQL's "realm" that's the only possible way.
myDataset.selectExpr("date_add(date,no_of_days) as finalDate")
BTW, you've just showed me another reason where support for SQL is different from Dataset's Query DSL. It's about the source of the parameters to functions -- only from structured data sources, only from Scala or a mixture thereof (as in UDFs and UDAFs). Thanks!

groupByKey and distinct VS combineByKey

I'm learning Spark and playing around with some "exercises".
At some point, I have a JavaPairRDD<Integer, Integer> which contains duplicates. My goal is to group by keys and remove the duplicates.
I see (at least) two possibilities:
Use distinct() and groupByKey()
Use combineByKey() and make sure that we only add elements we didn't have before (for example the result could be a JavaPairRDD<Integer, Set<Integer>>
I used the second approach (to avoid doing a shuffle with the distinct()) but then saw the original "solution" is the first one.
Which solution is the most efficient in your opinion?

Resources