I have a Dataset with the following schema
|-- Name: string (nullable = true)
|-- Values: long (nullable = true)
|-- Count: integer (nullable = true)
Input Dataset
|Name |Values |Count |
|A |1000 |1 |
|B |1150 |0 |
|C |500 |3 |
My result dataset needs to be of the format
|Sum(count>0)| sum(all) | Percentage |
| 1500 | 2650 | 56.60 |
I am currently able to get the sum(count>0) and sum(all) in individual datasets by running
val non_zero = df.filter(col(COUNT).>(0)).select(sum(VALUES).as(NON_ZERO_SUM))
val total = df.select(sum(col(VALUES)).as(TOTAL_SUM))
I'm at a loss on what to do to merge the two independent datasets into a single dataset, with which I would calculate the percentage.
Also could this same problem be solved in a better way?
I'd use single aggregation:
import org.apache.spark.sql.functions._
sum(when($"count" > 0, $"values')).alias("NON_ZERO_SUM"),
After running ALS algorithm in pyspark over a dataset, I have come across a final dataframe which looks like the following
Recommendation column is array type, now I want to split this column, my final dataframe should look like this
Can anyone suggest me, which pyspark function can be used to form this dataframe?
Schema of the dataframe
|-- person: string (nullable = false)
|-- recommendation: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- rating: float (nullable = true)
Assuming ID doesn't duplicate in each array, you can try the following:
import pyspark.sql.functions as f
df.withColumn('recommendation', f.explode('recommendation'))\
.withColumn('ID', f.col('recommendation').getItem('ID'))\
.withColumn('rating', f.col('recommendation').getItem('rating'))\
|person| a| b| c|
| xyz|0.4|0.3|0.3|
| abc|0.5|0.3|0.2|
| def|0.3|0.2|0.5|
Or transform with RDD:
df.rdd.map(lambda r: Row(
person=r.person, **{s.ID: s.rating for s in r.recommendation})
|person| a| b| c|
| abc| 0.5|0.30000001192092896|0.20000000298023224|
| def|0.30000001192092896|0.20000000298023224| 0.5|
| xyz| 0.4000000059604645|0.30000001192092896|0.30000001192092896|
I have a Spark dataframe on which I am doing certain operations as follows. I wanted to know how do I skip processing certain records going through all the operations
finalDf = df.map(mapFunc).reduceGroups(reduceFunc).map(mapFunc2).write().format().option().mode().save();
In the mapFunc, I want to write a logic that, if a certain condition is true, dont return anything and essential quit considering that record for further operations .reduceGroups(reduceFunc).map(mapFunc2).write().format().option().mode().save()
I tried returning Optional.empty() from the map, but the code fails in reduceGroups with below error.
Exception in thread "main" java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:420)
at scala.collection.immutable.Nil$.head(List.scala:417)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$5.apply(ExpressionEncoder.scala:121)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$5.apply(ExpressionEncoder.scala:120)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:120)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:187)
at org.apache.spark.sql.expressions.ReduceAggregator.bufferEncoder(ReduceAggregator.scala:38)
at org.apache.spark.sql.expressions.Aggregator.toColumn(Aggregator.scala:100)
at org.apache.spark.sql.KeyValueGroupedDataset.reduceGroups(KeyValueGroupedDataset.scala:436)
at org.apache.spark.sql.KeyValueGroupedDataset.reduceGroups(KeyValueGroupedDataset.scala:448)
Schema for dataframe :
|-- id: string (nullable = true)
|-- mid: integer (nullable = true)
|-- responses: string (nullable = true)
|-- version: string (nullable = true)
In the mapFunc, i am processing responses string by doing substring and fetching value from it. In some cases, responses string can be empty so in such cases I dont want that entire record to be in finalDf;
id mid responses version
A 1 "hello123" 1
B 1 "hello456" 2
A 2 "hello789" 5
A 1 "hello143" 4
B 3 "hello153" 6
C 3 "" 1
Output (Grouping by id, mid column)
id mid responses version
A 1 "143" 4
B 1 "456" 2
A 2 "789" 5
B 3 "153" 6
If responses for (id,mid) combination of (C,3), was not empty, then it would be in output. So i want to remove the the C,3 in the mapFunc.
Here if you have a map function as below, then you can just return the same row and filter later with filter
val myFunc = (row: Row) => {
val number = row.getString(2).replaceAll("\\D+","")
(row.getString(0), row.getInt(1), number, row.getInt(3))
.toDF(df.columns: _*)
.filter(trim($"responses") =!= "" )
|id |mid|responses|version|
|A |1 |123 |1 |
|B |1 |456 |2 |
|A |2 |789 |5 |
|A |1 |143 |4 |
|B |3 |153 |6 |
I have an input PySpark df:
|timestamp|user_id|results |event_name|product_id|
|1000 |user_1 |result 1|Click |1 |
|1001 |user_1 |result 1|View |1 |
|1002 |user_1 |result 2|Click |3 |
|1003 |user_1 |result 2|View |4 |
|1004 |user_1 |result 2|View |5 |
|-- timestamp: timestamp (nullable = true)
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- event_name: string (nullable = true)
|-- product_id: string (nullable = true)
I'd like to convert this to following making sure that I keep unique combinations of user_id and results, and aggregate product_ids based on given event_name like this:
|user_id|results |product_clicked|products_viewed|
|user_1 |result 1|[1] |[1] |
|user_1 |result 2|[4,5] |[3] |
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- product_clicked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- products_viewed: array (nullable = true)
| |-- element: string (containsNull = true)
I have looked into pivot, it's close but I do not need the aggregation part of it, instead I need array creation on columns which are created based on event_name column. Cannot figure our how to do it.
NOTE: The order in product_clicked and product_viewed columns above is important and is based on timestamp column of input dataframe.
You can use collect_list during the pivot aggregation:
import pyspark.sql.functions as F
df2 = (df.groupBy('user_id', 'results')
.selectExpr("user_id", "results", "Click as product_clicked", "View as product_viewed")
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
To ensure ordering, you can collect a list of structs containing the timestamp, sort the list, and transform the list to only keep the product_id:
df2 = (df.groupBy('user_id', 'results')
.agg(F.sort_array(F.collect_list(F.struct('timestamp', 'product_id'))))
.selectExpr("user_id", "results", "transform(Click, x -> x.product_id) as product_clicked", "transform(View, x -> x.product_id) as product_viewed")
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
I have a dataframe like the following that I want to convert to ISO-8601:
| production_date | expiration_date |
|["20/05/1996","01/01/2018"] | ["15/01/1997","27/03/2019"] |
| .... .... |
I want:
| good_prod_date | good_exp_date |
|[1996-05-20,2018-01-01] | [1997-01-01,2019-03-27] |
| .... .... |
However, there are over 20 columns and millions of rows. Im trying to avoid using UDFs since they are inefficient and, most of the time, a poor approach. Im also avoiding exploding each column because that is:
Inefficient (hundreds of millions of rows are unnecessarily created)
Not an elegant solution
I tried that and it doesn't work
So far I have the following:
def explodeCols(df):
return (df
.withColumn("production_date", sf.explode("production_date"))
.withColumn("expiration_date", sf.explode("expiration_date")))
def fixTypes(df):
return (df
.withColumn("production_date", sf.to_date("production_date", "dd/MM/yyyy"))
.withColumn("expiration_date", sf.to_date("expiration_date", "dd/MM/yyyy")))
def consolidate(df):
cols = ["production_date", "expiration_date"]
return df.groupBy("id").agg(*[sf.collect_list(c) for c in cols])
historyDF = (df
However when I run this code on DataBricks, the jobs never execute, in fact, it results in failed/dead executors (which isn't good).
Another solution I tried is the following:
df.withColumn("good_prod_date", col("production_date").cast(ArrayType(DateType())))
But the result I get is an array of nulls:
| production_date | good_prod_date |
|["20/05/1996","01/01/2018"] | [null,null] |
| .... .... |
Use pyspark.sql.function.transform higher order function instead of explode function, to transform each value in array.
.withColumn("production_date",F.expr("transform(production_date,v -> to_date(v,'dd/MM/yyyy'))"))
.withColumn("expiration_date",F.expr("transform(expiration_date,v -> to_date(v,'dd/MM/yyyy'))"))
df.withColumn("good_prod_date", col("production_date").cast(ArrayType(DateType())))
This will not work because production_date has different date format, if this column has date format like yyyy-MM-dd casting will work.
|-- actual_date: array (nullable = true)
| |-- element: string (containsNull = true)
|actual_date |
|[1997-01-15, 2019-03-27]|
df.select("actual_date").withColumn("actual_date", F.col("actual_date").cast("array<date>")).printSchema()
|-- actual_date: array (nullable = true)
| |-- element: date (containsNull = true)
df.select("actual_date").withColumn("actual_date", F.col("actual_date").cast("array<date>")).show()
|actual_date |
|[1997-01-15, 2019-03-27]|
It seems like Spark sql Window function does not working properly .
I am running a spark job in Hadoop Cluster where a HDFS block size is 128 MB and
Spark Version 1.5 CDH 5.5
My requirement:
If there are multiple records with same data_rfe_id then take the single record as per maximum seq_id and maxiumum service_id
I see that in raw data there are some records with same data_rfe_id and same seq_id so hence, I applied row_number using Window function so that I can filter the records with row_num === 1
But it seems its not working when have huge datasets. I see that same rowNumber is applied .
Why is it happening like this?
Do I need to reshuffle before I apply window function on dataframe?
I am expecting a unique rank number to each data_rfe_id
I want to use Window Function only to achieve this .
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
scala> df.printSchema
|-- transitional_key: string (nullable = true)
|-- seq_id: string (nullable = true)
|-- data_rfe_id: string (nullable = true)
|-- service_id: string (nullable = true)
|-- event_start_date_time: string (nullable = true)
|-- event_id: string (nullable = true)
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id").desc,df("service_id").desc)
val rankDF =df.withColumn("row_num",rowNumber.over(windowFunction))
Expected result :
|data_rfe_id |seq_id |service_id|row_num|
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |2 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |2 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |4 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |5 |
Actual Result I got as per above code :
|data_rfe_id |seq_id |service_id|row_num|
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |1 |
Could someone explain me why I am getting these unexpected results? and How do I resolve that ?
Basically you want to rank , having seq_id and service_id in desc order. Add rangeBetween with range you need. Rank may work for you. following is snippet of code :
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id"),df("service_id")).desc().rangeBetween(-MAXNUMBER,MAXNUMBER))
val rankDF =df.withColumn( "rank", rank().over(windowFunction) )
As you are using older version of spark don't know it will work or not. There is issue with windowSpec here is reference