I have a Spark dataframe on which I am doing certain operations as follows. I wanted to know how do I skip processing certain records going through all the operations
finalDf = df.map(mapFunc).reduceGroups(reduceFunc).map(mapFunc2).write().format().option().mode().save();
In the mapFunc, I want to write a logic that, if a certain condition is true, dont return anything and essential quit considering that record for further operations .reduceGroups(reduceFunc).map(mapFunc2).write().format().option().mode().save()
I tried returning Optional.empty() from the map, but the code fails in reduceGroups with below error.
Exception in thread "main" java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:420)
at scala.collection.immutable.Nil$.head(List.scala:417)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$5.apply(ExpressionEncoder.scala:121)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$5.apply(ExpressionEncoder.scala:120)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:120)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:187)
at org.apache.spark.sql.expressions.ReduceAggregator.bufferEncoder(ReduceAggregator.scala:38)
at org.apache.spark.sql.expressions.Aggregator.toColumn(Aggregator.scala:100)
at org.apache.spark.sql.KeyValueGroupedDataset.reduceGroups(KeyValueGroupedDataset.scala:436)
at org.apache.spark.sql.KeyValueGroupedDataset.reduceGroups(KeyValueGroupedDataset.scala:448)
Schema for dataframe :
root
|-- id: string (nullable = true)
|-- mid: integer (nullable = true)
|-- responses: string (nullable = true)
|-- version: string (nullable = true)
In the mapFunc, i am processing responses string by doing substring and fetching value from it. In some cases, responses string can be empty so in such cases I dont want that entire record to be in finalDf;
Input
id mid responses version
A 1 "hello123" 1
B 1 "hello456" 2
A 2 "hello789" 5
A 1 "hello143" 4
B 3 "hello153" 6
C 3 "" 1
Output (Grouping by id, mid column)
id mid responses version
A 1 "143" 4
B 1 "456" 2
A 2 "789" 5
B 3 "153" 6
If responses for (id,mid) combination of (C,3), was not empty, then it would be in output. So i want to remove the the C,3 in the mapFunc.
Here if you have a map function as below, then you can just return the same row and filter later with filter
val myFunc = (row: Row) => {
val number = row.getString(2).replaceAll("\\D+","")
(row.getString(0), row.getInt(1), number, row.getInt(3))
}
df.map(myFunc)
.toDF(df.columns: _*)
.filter(trim($"responses") =!= "" )
.show(false)
Output:
+---+---+---------+-------+
|id |mid|responses|version|
+---+---+---------+-------+
|A |1 |123 |1 |
|B |1 |456 |2 |
|A |2 |789 |5 |
|A |1 |143 |4 |
|B |3 |153 |6 |
+---+---+---------+-------+
Related
I have a dataframe like the following that I want to convert to ISO-8601:
| production_date | expiration_date |
--------------------------------------------------------------
|["20/05/1996","01/01/2018"] | ["15/01/1997","27/03/2019"] |
| .... .... |
--------------------------------------------------------------
I want:
| good_prod_date | good_exp_date |
-------------------------------------------------------------
|[1996-05-20,2018-01-01] | [1997-01-01,2019-03-27] |
| .... .... |
-------------------------------------------------------------
However, there are over 20 columns and millions of rows. Im trying to avoid using UDFs since they are inefficient and, most of the time, a poor approach. Im also avoiding exploding each column because that is:
Inefficient (hundreds of millions of rows are unnecessarily created)
Not an elegant solution
I tried that and it doesn't work
So far I have the following:
def explodeCols(df):
return (df
.withColumn("production_date", sf.explode("production_date"))
.withColumn("expiration_date", sf.explode("expiration_date")))
def fixTypes(df):
return (df
.withColumn("production_date", sf.to_date("production_date", "dd/MM/yyyy"))
.withColumn("expiration_date", sf.to_date("expiration_date", "dd/MM/yyyy")))
def consolidate(df):
cols = ["production_date", "expiration_date"]
return df.groupBy("id").agg(*[sf.collect_list(c) for c in cols])
historyDF = (df
.transform(explodeCols)
.transform(fixTypes)
.transform(consolidate))
However when I run this code on DataBricks, the jobs never execute, in fact, it results in failed/dead executors (which isn't good).
Another solution I tried is the following:
df.withColumn("good_prod_date", col("production_date").cast(ArrayType(DateType())))
But the result I get is an array of nulls:
| production_date | good_prod_date |
-------------------------------------------------------------
|["20/05/1996","01/01/2018"] | [null,null] |
| .... .... |
-------------------------------------------------------------
Use pyspark.sql.function.transform higher order function instead of explode function, to transform each value in array.
df
.withColumn("production_date",F.expr("transform(production_date,v -> to_date(v,'dd/MM/yyyy'))"))
.withColumn("expiration_date",F.expr("transform(expiration_date,v -> to_date(v,'dd/MM/yyyy'))"))
.show()
df.withColumn("good_prod_date", col("production_date").cast(ArrayType(DateType())))
This will not work because production_date has different date format, if this column has date format like yyyy-MM-dd casting will work.
df.select("actual_date").printSchema()
root
|-- actual_date: array (nullable = true)
| |-- element: string (containsNull = true)
df.select("actual_date").show(false)
+------------------------+
|actual_date |
+------------------------+
|[1997-01-15, 2019-03-27]|
+------------------------+
df.select("actual_date").withColumn("actual_date", F.col("actual_date").cast("array<date>")).printSchema()
root
|-- actual_date: array (nullable = true)
| |-- element: date (containsNull = true)
df.select("actual_date").withColumn("actual_date", F.col("actual_date").cast("array<date>")).show()
+------------------------+
|actual_date |
+------------------------+
|[1997-01-15, 2019-03-27]|
+------------------------+
Dataframe schema:
root
|-- ID: decimal(15,0) (nullable = true)
|-- COL1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- COL2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- COL3: array (nullable = true)
| |-- element: string (containsNull = true)
Sample data
+--------------------+--------------------+--------------------+
| COL1 | COL2 | COL3 |
+--------------------+--------------------+--------------------+
|[A, B, C, A] |[101, 102, 103, 104]|[P, Q, R, S] |
+--------------------+--------------------+--------------------+
I want to apply nested conditions on array elements.
For example,
Find COL3 elements where COL1 elements are A and COL2 elements are even.
Expected Output : [S]
I looked at various functions. For e.g. - array_position but it returns only the first occurrence.
Is there any straightforward way or I have to explode arrays?
Assuming your condition applies to array elements with the same index, it is possible to filter arrays with lambda functions in SQL since Spark 2.4.0, but this is still not exposed via the other language APIs and you need to use expr(). You simply zip the three arrays and then filter the resulting array of structs:
scala> df.show()
+---+------------+--------------------+------------+
| ID| COL1| COL2| COL3|
+---+------------+--------------------+------------+
| 1|[A, B, C, A]|[101, 102, 103, 104]|[P, Q, R, S]|
+---+------------+--------------------+------------+
scala> df.select($"ID", expr(s"""
| filter(
| arrays_zip(COL1, COL2, COL3),
| e -> e.COL1 == "A" AND CAST(e.COL2 AS integer) % 2 == 0
| ).COL3 AS result
| """)).show()
+---+------+
| ID|result|
+---+------+
| 1| [S]|
+---+------+
Since this uses expr() to supply an SQL expression as a column, it also works with PySpark:
>>> from pyspark.sql.functions import expr
>>> df.select(df.ID, expr("""
... filter(
... arrays_zip(COL1, COL2, COL3),
... e -> e.COL1 == "A" AND CAST(e.COL2 AS integer) % 2 == 0
... ).COL3 AS result
... """)).show()
+---+------+
| ID|result|
+---+------+
| 1| [S]|
+---+------+
I have an RDD (or DataFrame) of measuring data which is ordered by the timestamp, and I need to do a pairwise operation on two subsequent records for the same key (e.g., doing a trapezium integration of accelerometer data to get velocities).
Is there a function in Spark that "remembers" the last record for each key and has it available when the next record for the same key arrives?
I currently thought of this approach:
Get all the keys of the RDD
Use a custom Partitioner to partition the RDD by the found keys so I know there is one partition for each key
Use mapPartitions to do the calculation
However this has one flaw:
First, getting the keys can be a very lengthy task because the input data can be several GiB or even TiB large. I could write a custom InputFormat to just extract the keys which would be significantly faster (as I use Hadoop's API and sc.newAPIHadoopFile to get the data in the first place) but that would be additional things to consider and an additional source of bugs.
So my question is: Is there anything like reduceByKey that doesn't aggregate the data but just gives me the current record and the last one for that key and lets me output one or more records based on that information?
Here is what you can do with dataframe
import java.sql.Timestamp
import org.apache.spark.sql.types.{TimestampType, IntegerType}
import org.apache.spark.sql.functions._
**Create a window for lag function**
val w = org.apache.spark.sql.expressions.Window.partitionBy("key").orderBy("timestamp")
val df = spark.sparkContext.parallelize(List((1, 23, Timestamp.valueOf("2017-12-02 03:04:00")),
(1, 24, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 26, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 27, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 30, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 33, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 39, Timestamp.valueOf("2017-12-02 01:45:20")))).toDF("key","value","timestamp")
scala> df.printSchema
root
|-- key: integer (nullable = false)
|-- value: integer (nullable = false)
|-- timestamp: timestamp (nullable = true)
scala> val lagDF = df.withColumn("lag_value",lag("value", 1, 0).over(w))
lagDF: org.apache.spark.sql.DataFrame = [key: int, value: int ... 2 more fields]
**Previous record and current record is in same row now**
scala> lagDF.show(10, false)
+---+-----+-------------------+---------+
|key|value|timestamp |lag_value|
+---+-----+-------------------+---------+
|1 |24 |2017-12-02 01:45:20|0 |
|1 |26 |2017-12-02 01:45:20|24 |
|1 |27 |2017-12-02 01:45:20|26 |
|1 |23 |2017-12-02 03:04:00|27 |
|2 |30 |2017-12-02 01:45:20|0 |
|2 |33 |2017-12-02 01:45:20|30 |
|2 |39 |2017-12-02 01:45:20|33 |
+---+-----+-------------------+---------+
**Put your distance calculation logic here. I'm putting dummy function for demo**
scala> val result = lagDF.withColumn("dummy_operation_for_dist_calc", lagDF("value") - lagDF("lag_value"))
result: org.apache.spark.sql.DataFrame = [key: int, value: int ... 3 more fields]
scala> result.show(10, false)
+---+-----+-------------------+---------+-----------------------------+
|key|value|timestamp |lag_value|dummy_operation_for_dist_calc|
+---+-----+-------------------+---------+-----------------------------+
|1 |24 |2017-12-02 01:45:20|0 |24 |
|1 |26 |2017-12-02 01:45:20|24 |2 |
|1 |27 |2017-12-02 01:45:20|26 |1 |
|1 |23 |2017-12-02 03:04:00|27 |-4 |
|2 |30 |2017-12-02 01:45:20|0 |30 |
|2 |33 |2017-12-02 01:45:20|30 |3 |
|2 |39 |2017-12-02 01:45:20|33 |6 |
+---+-----+-------------------+---------+-----------------------------+
It seems like Spark sql Window function does not working properly .
I am running a spark job in Hadoop Cluster where a HDFS block size is 128 MB and
Spark Version 1.5 CDH 5.5
My requirement:
If there are multiple records with same data_rfe_id then take the single record as per maximum seq_id and maxiumum service_id
I see that in raw data there are some records with same data_rfe_id and same seq_id so hence, I applied row_number using Window function so that I can filter the records with row_num === 1
But it seems its not working when have huge datasets. I see that same rowNumber is applied .
Why is it happening like this?
Do I need to reshuffle before I apply window function on dataframe?
I am expecting a unique rank number to each data_rfe_id
I want to use Window Function only to achieve this .
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
.....
scala> df.printSchema
root
|-- transitional_key: string (nullable = true)
|-- seq_id: string (nullable = true)
|-- data_rfe_id: string (nullable = true)
|-- service_id: string (nullable = true)
|-- event_start_date_time: string (nullable = true)
|-- event_id: string (nullable = true)
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id").desc,df("service_id").desc)
val rankDF =df.withColumn("row_num",rowNumber.over(windowFunction))
rankDF.select("data_rfe_id","seq_id","service_id","row_num").show(200,false)
Expected result :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |2 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |2 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |4 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |5 |
Actual Result I got as per above code :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |1 |
Could someone explain me why I am getting these unexpected results? and How do I resolve that ?
Basically you want to rank , having seq_id and service_id in desc order. Add rangeBetween with range you need. Rank may work for you. following is snippet of code :
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id"),df("service_id")).desc().rangeBetween(-MAXNUMBER,MAXNUMBER))
val rankDF =df.withColumn( "rank", rank().over(windowFunction) )
As you are using older version of spark don't know it will work or not. There is issue with windowSpec here is reference
I have a Dataset with the following schema
|-- Name: string (nullable = true)
|-- Values: long (nullable = true)
|-- Count: integer (nullable = true)
Input Dataset
+------------+-----------------------+--------------+
|Name |Values |Count |
+------------+-----------------------+--------------+
|A |1000 |1 |
|B |1150 |0 |
|C |500 |3 |
+------------+-----------------------+--------------+
My result dataset needs to be of the format
+------------+-----------------------+--------------+
|Sum(count>0)| sum(all) | Percentage |
+------------+-----------------------+--------------+
| 1500 | 2650 | 56.60 |
+------------+-----------------------+--------------+
I am currently able to get the sum(count>0) and sum(all) in individual datasets by running
val non_zero = df.filter(col(COUNT).>(0)).select(sum(VALUES).as(NON_ZERO_SUM))
val total = df.select(sum(col(VALUES)).as(TOTAL_SUM))
I'm at a loss on what to do to merge the two independent datasets into a single dataset, with which I would calculate the percentage.
Also could this same problem be solved in a better way?
Thanks,
I'd use single aggregation:
import org.apache.spark.sql.functions._
df.select(
sum(when($"count" > 0, $"values')).alias("NON_ZERO_SUM"),
sum($"values").alias("TOTAL_SUM")
)