Spark merging two single value datasets - apache-spark

I have a Dataset with the following schema
|-- Name: string (nullable = true)
|-- Values: long (nullable = true)
|-- Count: integer (nullable = true)
Input Dataset
+------------+-----------------------+--------------+
|Name |Values |Count |
+------------+-----------------------+--------------+
|A |1000 |1 |
|B |1150 |0 |
|C |500 |3 |
+------------+-----------------------+--------------+
My result dataset needs to be of the format
+------------+-----------------------+--------------+
|Sum(count>0)| sum(all) | Percentage |
+------------+-----------------------+--------------+
| 1500 | 2650 | 56.60 |
+------------+-----------------------+--------------+
I am currently able to get the sum(count>0) and sum(all) in individual datasets by running
val non_zero = df.filter(col(COUNT).>(0)).select(sum(VALUES).as(NON_ZERO_SUM))
val total = df.select(sum(col(VALUES)).as(TOTAL_SUM))
I'm at a loss on what to do to merge the two independent datasets into a single dataset, with which I would calculate the percentage.
Also could this same problem be solved in a better way?
Thanks,

I'd use single aggregation:
import org.apache.spark.sql.functions._
df.select(
sum(when($"count" > 0, $"values')).alias("NON_ZERO_SUM"),
sum($"values").alias("TOTAL_SUM")
)

Related

pyspark split array type column to multiple columns

After running ALS algorithm in pyspark over a dataset, I have come across a final dataframe which looks like the following
Recommendation column is array type, now I want to split this column, my final dataframe should look like this
Can anyone suggest me, which pyspark function can be used to form this dataframe?
Schema of the dataframe
root
|-- person: string (nullable = false)
|-- recommendation: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- rating: float (nullable = true)
Assuming ID doesn't duplicate in each array, you can try the following:
import pyspark.sql.functions as f
df.withColumn('recommendation', f.explode('recommendation'))\
.withColumn('ID', f.col('recommendation').getItem('ID'))\
.withColumn('rating', f.col('recommendation').getItem('rating'))\
.groupby('person')\
.pivot('ID')\
.agg(f.first('rating')).show()
+------+---+---+---+
|person| a| b| c|
+------+---+---+---+
| xyz|0.4|0.3|0.3|
| abc|0.5|0.3|0.2|
| def|0.3|0.2|0.5|
+------+---+---+---+
Or transform with RDD:
df.rdd.map(lambda r: Row(
person=r.person, **{s.ID: s.rating for s in r.recommendation})
).toDF().show()
+------+-------------------+-------------------+-------------------+
|person| a| b| c|
+------+-------------------+-------------------+-------------------+
| abc| 0.5|0.30000001192092896|0.20000000298023224|
| def|0.30000001192092896|0.20000000298023224| 0.5|
| xyz| 0.4000000059604645|0.30000001192092896|0.30000001192092896|
+------+-------------------+-------------------+-------------------+

Skip records in dataframe's map transformation

I have a Spark dataframe on which I am doing certain operations as follows. I wanted to know how do I skip processing certain records going through all the operations
finalDf = df.map(mapFunc).reduceGroups(reduceFunc).map(mapFunc2).write().format().option().mode().save();
In the mapFunc, I want to write a logic that, if a certain condition is true, dont return anything and essential quit considering that record for further operations .reduceGroups(reduceFunc).map(mapFunc2).write().format().option().mode().save()
I tried returning Optional.empty() from the map, but the code fails in reduceGroups with below error.
Exception in thread "main" java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:420)
at scala.collection.immutable.Nil$.head(List.scala:417)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$5.apply(ExpressionEncoder.scala:121)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$5.apply(ExpressionEncoder.scala:120)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:120)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:187)
at org.apache.spark.sql.expressions.ReduceAggregator.bufferEncoder(ReduceAggregator.scala:38)
at org.apache.spark.sql.expressions.Aggregator.toColumn(Aggregator.scala:100)
at org.apache.spark.sql.KeyValueGroupedDataset.reduceGroups(KeyValueGroupedDataset.scala:436)
at org.apache.spark.sql.KeyValueGroupedDataset.reduceGroups(KeyValueGroupedDataset.scala:448)
Schema for dataframe :
root
|-- id: string (nullable = true)
|-- mid: integer (nullable = true)
|-- responses: string (nullable = true)
|-- version: string (nullable = true)
In the mapFunc, i am processing responses string by doing substring and fetching value from it. In some cases, responses string can be empty so in such cases I dont want that entire record to be in finalDf;
Input
id mid responses version
A 1 "hello123" 1
B 1 "hello456" 2
A 2 "hello789" 5
A 1 "hello143" 4
B 3 "hello153" 6
C 3 "" 1
Output (Grouping by id, mid column)
id mid responses version
A 1 "143" 4
B 1 "456" 2
A 2 "789" 5
B 3 "153" 6
If responses for (id,mid) combination of (C,3), was not empty, then it would be in output. So i want to remove the the C,3 in the mapFunc.
Here if you have a map function as below, then you can just return the same row and filter later with filter
val myFunc = (row: Row) => {
val number = row.getString(2).replaceAll("\\D+","")
(row.getString(0), row.getInt(1), number, row.getInt(3))
}
df.map(myFunc)
.toDF(df.columns: _*)
.filter(trim($"responses") =!= "" )
.show(false)
Output:
+---+---+---------+-------+
|id |mid|responses|version|
+---+---+---------+-------+
|A |1 |123 |1 |
|B |1 |456 |2 |
|A |2 |789 |5 |
|A |1 |143 |4 |
|B |3 |153 |6 |
+---+---+---------+-------+

Pyspark DF Pivot and Create Arrays columns

I have an input PySpark df:
+---------+-------+--------+----------+----------+
|timestamp|user_id|results |event_name|product_id|
+---------+-------+--------+----------+----------+
|1000 |user_1 |result 1|Click |1 |
|1001 |user_1 |result 1|View |1 |
|1002 |user_1 |result 2|Click |3 |
|1003 |user_1 |result 2|View |4 |
|1004 |user_1 |result 2|View |5 |
+---------+-------+--------+----------+----------+
root
|-- timestamp: timestamp (nullable = true)
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- event_name: string (nullable = true)
|-- product_id: string (nullable = true)
I'd like to convert this to following making sure that I keep unique combinations of user_id and results, and aggregate product_ids based on given event_name like this:
+-------+--------+---------------+---------------+
|user_id|results |product_clicked|products_viewed|
+-------+--------+---------------+---------------+
|user_1 |result 1|[1] |[1] |
|user_1 |result 2|[4,5] |[3] |
+-------+--------+---------------+---------------+
root
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- product_clicked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- products_viewed: array (nullable = true)
| |-- element: string (containsNull = true)
I have looked into pivot, it's close but I do not need the aggregation part of it, instead I need array creation on columns which are created based on event_name column. Cannot figure our how to do it.
NOTE: The order in product_clicked and product_viewed columns above is important and is based on timestamp column of input dataframe.
You can use collect_list during the pivot aggregation:
import pyspark.sql.functions as F
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.collect_list('product_id'))
.selectExpr("user_id", "results", "Click as product_clicked", "View as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+
To ensure ordering, you can collect a list of structs containing the timestamp, sort the list, and transform the list to only keep the product_id:
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.sort_array(F.collect_list(F.struct('timestamp', 'product_id'))))
.selectExpr("user_id", "results", "transform(Click, x -> x.product_id) as product_clicked", "transform(View, x -> x.product_id) as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+

Convert Column of ArrayType(StringType()) to ArrayType(DateType()) in PySpark

I have a dataframe like the following that I want to convert to ISO-8601:
| production_date | expiration_date |
--------------------------------------------------------------
|["20/05/1996","01/01/2018"] | ["15/01/1997","27/03/2019"] |
| .... .... |
--------------------------------------------------------------
I want:
| good_prod_date | good_exp_date |
-------------------------------------------------------------
|[1996-05-20,2018-01-01] | [1997-01-01,2019-03-27] |
| .... .... |
-------------------------------------------------------------
However, there are over 20 columns and millions of rows. Im trying to avoid using UDFs since they are inefficient and, most of the time, a poor approach. Im also avoiding exploding each column because that is:
Inefficient (hundreds of millions of rows are unnecessarily created)
Not an elegant solution
I tried that and it doesn't work
So far I have the following:
def explodeCols(df):
return (df
.withColumn("production_date", sf.explode("production_date"))
.withColumn("expiration_date", sf.explode("expiration_date")))
def fixTypes(df):
return (df
.withColumn("production_date", sf.to_date("production_date", "dd/MM/yyyy"))
.withColumn("expiration_date", sf.to_date("expiration_date", "dd/MM/yyyy")))
def consolidate(df):
cols = ["production_date", "expiration_date"]
return df.groupBy("id").agg(*[sf.collect_list(c) for c in cols])
historyDF = (df
.transform(explodeCols)
.transform(fixTypes)
.transform(consolidate))
However when I run this code on DataBricks, the jobs never execute, in fact, it results in failed/dead executors (which isn't good).
Another solution I tried is the following:
df.withColumn("good_prod_date", col("production_date").cast(ArrayType(DateType())))
But the result I get is an array of nulls:
| production_date | good_prod_date |
-------------------------------------------------------------
|["20/05/1996","01/01/2018"] | [null,null] |
| .... .... |
-------------------------------------------------------------
Use pyspark.sql.function.transform higher order function instead of explode function, to transform each value in array.
df
.withColumn("production_date",F.expr("transform(production_date,v -> to_date(v,'dd/MM/yyyy'))"))
.withColumn("expiration_date",F.expr("transform(expiration_date,v -> to_date(v,'dd/MM/yyyy'))"))
.show()
df.withColumn("good_prod_date", col("production_date").cast(ArrayType(DateType())))
This will not work because production_date has different date format, if this column has date format like yyyy-MM-dd casting will work.
df.select("actual_date").printSchema()
root
|-- actual_date: array (nullable = true)
| |-- element: string (containsNull = true)
df.select("actual_date").show(false)
+------------------------+
|actual_date |
+------------------------+
|[1997-01-15, 2019-03-27]|
+------------------------+
df.select("actual_date").withColumn("actual_date", F.col("actual_date").cast("array<date>")).printSchema()
root
|-- actual_date: array (nullable = true)
| |-- element: date (containsNull = true)
df.select("actual_date").withColumn("actual_date", F.col("actual_date").cast("array<date>")).show()
+------------------------+
|actual_date |
+------------------------+
|[1997-01-15, 2019-03-27]|
+------------------------+

Getting unexpected results from Spark sql Windows Functions

It seems like Spark sql Window function does not working properly .
I am running a spark job in Hadoop Cluster where a HDFS block size is 128 MB and
Spark Version 1.5 CDH 5.5
My requirement:
If there are multiple records with same data_rfe_id then take the single record as per maximum seq_id and maxiumum service_id
I see that in raw data there are some records with same data_rfe_id and same seq_id so hence, I applied row_number using Window function so that I can filter the records with row_num === 1
But it seems its not working when have huge datasets. I see that same rowNumber is applied .
Why is it happening like this?
Do I need to reshuffle before I apply window function on dataframe?
I am expecting a unique rank number to each data_rfe_id
I want to use Window Function only to achieve this .
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
.....
scala> df.printSchema
root
|-- transitional_key: string (nullable = true)
|-- seq_id: string (nullable = true)
|-- data_rfe_id: string (nullable = true)
|-- service_id: string (nullable = true)
|-- event_start_date_time: string (nullable = true)
|-- event_id: string (nullable = true)
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id").desc,df("service_id").desc)
val rankDF =df.withColumn("row_num",rowNumber.over(windowFunction))
rankDF.select("data_rfe_id","seq_id","service_id","row_num").show(200,false)
Expected result :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |2 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |2 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |4 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |5 |
Actual Result I got as per above code :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |1 |
Could someone explain me why I am getting these unexpected results? and How do I resolve that ?
Basically you want to rank , having seq_id and service_id in desc order. Add rangeBetween with range you need. Rank may work for you. following is snippet of code :
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id"),df("service_id")).desc().rangeBetween(-MAXNUMBER,MAXNUMBER))
val rankDF =df.withColumn( "rank", rank().over(windowFunction) )
As you are using older version of spark don't know it will work or not. There is issue with windowSpec here is reference

Resources