Grouping a column as list in spark - apache-spark

I have data in the form ("name","id"), where one name can have multiple ids. I wish to convert it to a List or Set so that i have a list of ids corresponding to each name and the "name" becomes a unique field.
I tried the following but it seems to be wrong:
val group = dataFrame.map( r => (dataFrame.rdd.filter(s => s.getAs(0) == r.getAs(0)).collect()))
I get the following error:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
What is the solution to this, does groupBy work here, if yes how?

Assuming this is your DataFrame:
val df = Seq(("dave",1),("dave",2),("griffin",3),("griffin",4)).toDF("name","id")
You can then do:
df.groupBy(col("name")).agg(collect_list(col("id")) as "ids")

Related

PySpark isin function

I am converting my legacy Python code to Spark using PySpark.
I would like to get a PySpark equivalent of:
usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']
Both, actdataall and orddata are Spark dataframes.
I don't want to use toPandas() function given the drawback associated with it.
If both dataframes are big, you should consider using an inner join which will work as a filter:
First let's create a dataframe containing the order IDs we want to keep:
orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct()
Now let's join it with our actdataall dataframe:
usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):
orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0]
sc.broadcast(orderid_list)
The most direct translation of your code would be:
from pyspark.sql import functions as F
# collect all the unique ORDER_IDs to the driver
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]
# filter ORDValue column by list of order_ids, then select only User ID column
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids)).select('User ID')
However, you should only filter like this only if number of 'ORDER_ID' is definitely small (perhaps <100,000 or so).
If the number of 'ORDER_ID's is large, you should use a broadcast variable which sends the list of order_ids to each executor so it can compare against the order_ids locally for faster processing. Note, this will work even if 'ORDER_ID' is small.
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]
order_ids_broadcast = sc.broadcast(order_ids) # send to broadcast variable
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids_broadcast.value)).select('User ID')
For more information on broadcast variables, check out: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-broadcast.html
So, you have two spark dataframe. One is actdataall and other is orddata, then use following command to get your desire result.
usersofinterest = actdataall.where(actdataall['ORDValue'].isin(orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0])).select('User ID')

Apache Spark 2.0 Dataframes (Dataset) group by multiple aggregations and new column naming

Aggregating multiple columns:
I have a dataframe input.
I would like to apply different aggregation functions per grouped columns.
In the simple case, I can do this, and it works as intended:
val x = input.groupBy("user.lang").agg(Map("user.followers_count" -> "avg", "user.friends_count" -> "avg"))
However, if I want to add more aggregation functions for the same column, they are missed, for instance:
val x = input.groupBy("user.lang").agg(Map("user.followers_count" -> "avg", "user.followers_count" -> "max", "user.friends_count" -> "avg")).
As I am passing a map it is not exactly surprising. How can I resolve this problem and add another aggregation function for the same column?
It is my understanding that this could be a possible solution:
val x = input.groupBy("user.lang").agg(avg($"user.followers_count"), max($"user.followers_count"), avg("user.friends_count")).
This, however returns an error: error: not found: value
avg.
New column naming:
In the first case, I end up with new column names such as: avg(user.followers_count AS ``followers_count``), avg(user.friends_count AS ``friends_count``). Is it possible to define a new column name for the aggregation process?
I know that using SQL syntax might be a solution for this, but my goal eventually is to be able to pass arguments via command line (group by columns, aggregation columns and functions) so I'm trying to construct the pipeline that would allow this.
Thanks for reading this!

Can we apply a Action on another Action in Spark?

I am trying to run some basic spark applications.
Can we apply a Action on another Action ?
or
Action can be applied only on Transformed RDD?
val numbersRDD = sc.parallelize(Array(1,2,3,4,5));
val topnumbersRDD = numbersRDD.take(2)
scala> topnumbersRDD.count
<console>:17: error: missing arguments for method count in trait TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
topnumbersRDD.count
^
I would like to know why I am getting this above error .
Also what can I do if I want to find the count of first 2 numbers.. I need output as 2 .
Actions can be applied on RDD and DataFrame, the take method returns an Array, you could use length or size of array to count the elements.
If you want to choose datas with a condition, you could use filter and that returns a new RDD

How to return a single field from each line in a pyspark RDD?

I'm sure this is very simple, but despite by trying and research I can't find the solution. I'm working with flight info here.
I have an rdd with contents of :
[u'2007-09-22,9E,20363,TUL,OK,36.19,-95.88,MSP,MN,44.88,-93.22,1745,1737,-8,1953,1934,-19', u'2004-02-12,NW,19386,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.2
2,1050,1050,0,1341,1342,1', u'2007-05-07,F9,20436,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1030,1040,10,1325,1347,22']
What transform do I need in order to make a new RDD with all the 2nd fields in it.
[u'9E',u'NW',u'F9']
I've tried filtering but can't make it work. This just gives me the entire line and I only want the 2nd field from each line.
new_rdd = current_rdd.filter(lambda x: x.split(',')[1])
Here is the solution :
data = [u'2007-09-22,9E,20363,TUL,OK,36.19,-95.88,MSP,MN,44.88,-93.22,1745,1737,-8,1953,1934,-19', u'2004-02-12,NW,19386,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1050,1050,0,1341,1342,1', u'2007-05-07,F9,20436,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1030,1040,10,1325,1347,22']
current_rdd = sc.parallelize(data)
rdd = current_rdd.map(lambda x : x.split(',')[1])
rdd.take(10)
# [u'9E', u'NW', u'F9']
You are using filter for the wrong purpose. So let's recall the definition of the filter function :
filter(f) - Return a new RDD containing only the elements that satisfy a predicate.
where as map returns a new RDD by applying a function to each element of this RDD, and that's what you need.
I advice to read the PythonRDD API documentation here to learn more about it.

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

Resources