Spark : put hashmap into Dataset column? - apache-spark

I have a dataset Dataset<Row> which comes from reading a parquet file. Knowing that one column inside InfoMap is of type Map.
Now I want to update this column, but when I use withColumn, it tells me that I cannot put a hashmap inside because it's not a litteral.
I want to know what is the correct way to update a column of type Map for a dataset ?

Try using typedLit instead of lit
typedLit
"...The difference between this function and lit() is that this
function can handle parameterized scala types e.g.: List, Seq and Map"
data.withColumn("dictionary", typedLit(Map("foo" -> 1, "bar" -> 2)))

Related

Returning a default value when looking up a pyspark map Column

I have a map Column that I created using pyspark.sql.functions.create_map. I am performing some actions that require me to look up in this map column as shown below.
lookup_map[col("col1")]
If a value does not exist in lookup_map column, I want it to return a default value. How can I achieve this?
Use coalesce :
F.coalesce(lookup_map[col("col1")], F.lit("default"))
E.g.
For below map
mapping = {'1': 'value'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping.items())])
and Input DF:
Output of
df.withColumn("value", F.coalesce(mapping_expr[F.col("id")], F.lit("x"))).show()
will be :
I managed to do it using when and otherwise
df.withColumn("col", when(col("mapColumn").getItem("key").isNotNull(),col("mapColumn").getItem("key")).otherwise(lit("DEFAULT_VALUE")))
There is another answer that suggests using coalesce. I haven't tried it and I am not sure if there are any difference in performance between them.

How to add column to a DataFrame where value is fetched from a map with other column from row as key

I'm new to Spark, and trying to figure out how I can add a column to a DataFrame where its value is fetched from a HashMap, where the key is another value on the same row which where the value is being set.
For example, I have a map defined as follows:
var myMap: Map<Integer,Integer> = generateMap();
I want to add a new column to my DataFrame where its value is fetched from this map, with the key a current column value. A solution might look like this:
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", lit(myMap.get(col("EXISTING_COLUMN"))))
My issue with this code is that using the col function doesn't return a type of Int, like the keys in my HashMap.
Any suggestions?
I would create a dataframe from the map. Then do a join operation. It should be faster and can be reused.
A UDF (user-defined function) can also be used but they are black boxes to Catalyst, so I would be prudent in using them. Depending on where the content of the map is, it may also be complicated to pass it to a UDF.
As of the next version of Kotlin API for Apache Spark you will be able to simply create a udf which will be usable in almost this way.
val mapUDF by udf { input: Int -> myMap[input] }
dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))
You need to use UDF.
val mapUDF = udf((i:Int)=>myMap.getOrElse(i,0))
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))

How to get back a normal DataFrame after invoking groupBy

For a simple grouping operation apparently the returned type is no longer a DataFrame ??
val itemsQtyDf = pkgItemsDf.groupBy($"packageid").withColumn("totalqty",sum("qty"))
We can not however invoke the DataFrame ops after the groupBy - since it is a GroupedData:
Error:(26, 55) value withColumn is not a member of org.apache.spark.sql.GroupedData
So, then how to get my DataFrame back after a grouping? Is it necessary to use DataFrame.agg() instead??
Grouping only without an aggregate function implies you may want to use the distinct() function instead which does return a DataFrame. But your example shows you want sum("qty"), so just change your code to be like this:
pkgItemsDf.groupBy($"packageid").agg(sum("qty").alias("totalqty"))

Apache Spark 2.0 Dataframes (Dataset) group by multiple aggregations and new column naming

Aggregating multiple columns:
I have a dataframe input.
I would like to apply different aggregation functions per grouped columns.
In the simple case, I can do this, and it works as intended:
val x = input.groupBy("user.lang").agg(Map("user.followers_count" -> "avg", "user.friends_count" -> "avg"))
However, if I want to add more aggregation functions for the same column, they are missed, for instance:
val x = input.groupBy("user.lang").agg(Map("user.followers_count" -> "avg", "user.followers_count" -> "max", "user.friends_count" -> "avg")).
As I am passing a map it is not exactly surprising. How can I resolve this problem and add another aggregation function for the same column?
It is my understanding that this could be a possible solution:
val x = input.groupBy("user.lang").agg(avg($"user.followers_count"), max($"user.followers_count"), avg("user.friends_count")).
This, however returns an error: error: not found: value
avg.
New column naming:
In the first case, I end up with new column names such as: avg(user.followers_count AS ``followers_count``), avg(user.friends_count AS ``friends_count``). Is it possible to define a new column name for the aggregation process?
I know that using SQL syntax might be a solution for this, but my goal eventually is to be able to pass arguments via command line (group by columns, aggregation columns and functions) so I'm trying to construct the pipeline that would allow this.
Thanks for reading this!

Can we apply a Action on another Action in Spark?

I am trying to run some basic spark applications.
Can we apply a Action on another Action ?
or
Action can be applied only on Transformed RDD?
val numbersRDD = sc.parallelize(Array(1,2,3,4,5));
val topnumbersRDD = numbersRDD.take(2)
scala> topnumbersRDD.count
<console>:17: error: missing arguments for method count in trait TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
topnumbersRDD.count
^
I would like to know why I am getting this above error .
Also what can I do if I want to find the count of first 2 numbers.. I need output as 2 .
Actions can be applied on RDD and DataFrame, the take method returns an Array, you could use length or size of array to count the elements.
If you want to choose datas with a condition, you could use filter and that returns a new RDD

Resources