Can we apply a Action on another Action in Spark? - apache-spark

I am trying to run some basic spark applications.
Can we apply a Action on another Action ?
or
Action can be applied only on Transformed RDD?
val numbersRDD = sc.parallelize(Array(1,2,3,4,5));
val topnumbersRDD = numbersRDD.take(2)
scala> topnumbersRDD.count
<console>:17: error: missing arguments for method count in trait TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
topnumbersRDD.count
^
I would like to know why I am getting this above error .
Also what can I do if I want to find the count of first 2 numbers.. I need output as 2 .

Actions can be applied on RDD and DataFrame, the take method returns an Array, you could use length or size of array to count the elements.
If you want to choose datas with a condition, you could use filter and that returns a new RDD

Related

Spark : put hashmap into Dataset column?

I have a dataset Dataset<Row> which comes from reading a parquet file. Knowing that one column inside InfoMap is of type Map.
Now I want to update this column, but when I use withColumn, it tells me that I cannot put a hashmap inside because it's not a litteral.
I want to know what is the correct way to update a column of type Map for a dataset ?
Try using typedLit instead of lit
typedLit
"...The difference between this function and lit() is that this
function can handle parameterized scala types e.g.: List, Seq and Map"
data.withColumn("dictionary", typedLit(Map("foo" -> 1, "bar" -> 2)))

How to get back a normal DataFrame after invoking groupBy

For a simple grouping operation apparently the returned type is no longer a DataFrame ??
val itemsQtyDf = pkgItemsDf.groupBy($"packageid").withColumn("totalqty",sum("qty"))
We can not however invoke the DataFrame ops after the groupBy - since it is a GroupedData:
Error:(26, 55) value withColumn is not a member of org.apache.spark.sql.GroupedData
So, then how to get my DataFrame back after a grouping? Is it necessary to use DataFrame.agg() instead??
Grouping only without an aggregate function implies you may want to use the distinct() function instead which does return a DataFrame. But your example shows you want sum("qty"), so just change your code to be like this:
pkgItemsDf.groupBy($"packageid").agg(sum("qty").alias("totalqty"))

Spark: Join within UDF or map function

I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches. The actual use case is much more complex, but I've simplified the case here to minimum reproducible code. Here is the UDF code.
def predict_id(date,zip):
filtered_ids = contest_savm.where((F.col('postal_code')==zip) & (F.col('start_date')>=date))
return filtered_ids.count()
When I define the UDF using the below code, I get a long list of console errors:
predict_id_udf = F.udf(predict_id,types.IntegerType())
The final line of the error is:
py4j.Py4JException: Method __getnewargs__([]) does not exist
I want to know what is the best way to go about it. I also tried map like this:
result_rdd = df.select("party_id").rdd\
.map(lambda x: predict_id(x[0],x[1]))\
.distinct()
It also resulted in a similar final error. I want to know, if there is anyway, I can do a join within UDF or map function, for each row of the original dataframe.
I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches.
It is not possible by design. I you want to achieve effect like this you have to use high level DF / RDD operators:
df.join(ontest_savm,
(F.col('postal_code')==df["zip"]) & (F.col('start_date') >= df["date"])
).groupBy(*df.columns).count()

Grouping a column as list in spark

I have data in the form ("name","id"), where one name can have multiple ids. I wish to convert it to a List or Set so that i have a list of ids corresponding to each name and the "name" becomes a unique field.
I tried the following but it seems to be wrong:
val group = dataFrame.map( r => (dataFrame.rdd.filter(s => s.getAs(0) == r.getAs(0)).collect()))
I get the following error:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
What is the solution to this, does groupBy work here, if yes how?
Assuming this is your DataFrame:
val df = Seq(("dave",1),("dave",2),("griffin",3),("griffin",4)).toDF("name","id")
You can then do:
df.groupBy(col("name")).agg(collect_list(col("id")) as "ids")

How to return a single field from each line in a pyspark RDD?

I'm sure this is very simple, but despite by trying and research I can't find the solution. I'm working with flight info here.
I have an rdd with contents of :
[u'2007-09-22,9E,20363,TUL,OK,36.19,-95.88,MSP,MN,44.88,-93.22,1745,1737,-8,1953,1934,-19', u'2004-02-12,NW,19386,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.2
2,1050,1050,0,1341,1342,1', u'2007-05-07,F9,20436,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1030,1040,10,1325,1347,22']
What transform do I need in order to make a new RDD with all the 2nd fields in it.
[u'9E',u'NW',u'F9']
I've tried filtering but can't make it work. This just gives me the entire line and I only want the 2nd field from each line.
new_rdd = current_rdd.filter(lambda x: x.split(',')[1])
Here is the solution :
data = [u'2007-09-22,9E,20363,TUL,OK,36.19,-95.88,MSP,MN,44.88,-93.22,1745,1737,-8,1953,1934,-19', u'2004-02-12,NW,19386,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1050,1050,0,1341,1342,1', u'2007-05-07,F9,20436,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1030,1040,10,1325,1347,22']
current_rdd = sc.parallelize(data)
rdd = current_rdd.map(lambda x : x.split(',')[1])
rdd.take(10)
# [u'9E', u'NW', u'F9']
You are using filter for the wrong purpose. So let's recall the definition of the filter function :
filter(f) - Return a new RDD containing only the elements that satisfy a predicate.
where as map returns a new RDD by applying a function to each element of this RDD, and that's what you need.
I advice to read the PythonRDD API documentation here to learn more about it.

Resources