Custom OrderBy in Spark SQL - apache-spark

I have two columns that need to be sorted in a custom way.
For Eg:
Month Column it should be sorted in a way that Jan2015 to Dec(CurrentYear)
and Also suppose I have Column as Quarter and I want it to or Order by as Q1-2015,Q2-2015,... Q4-CurrentYear ..
in orderby of Spark Sql I'll be giving as orderBy("Month","Quarter") but the Order should be Custom Sequence As before .
I have tried the code below:
import org.apache.spark.sql.SaveMode
import org.apache.spark.storage.StorageLevel
val vDF=spark.sql(""" select month,quarter from table group by month,quarter order by month,quarter """);
vDF.repartition(10).orderBy("Month","Quarter").write(results‌​.csv);
As of now the Month gets Ordered as Apr,Aug,Dec.... in a alphabetical way and Quarter as Q1-2015,Q1-2016,.... but the requirement is the mentioned above

I'd just parse the dates:
import org.apache.spark.sql.functions._
val df = Seq(
("Jul", 2017"), ("May", "Q2-2017"),
("Jan", "Q1-2016"), ("Dec", "Q4-2016"), ("Aug", "Q1-2016")
).toDF("month", "quater")
df.orderBy(unix_timestamp(
concat_ws(" ", col("month"), substring(col("quater"), 4, 6)), "MMM yyyy"
)).show()
+-----+-------+
|month| quater|
+-----+-------+
| Jan|Q1-2016|
| Aug|Q1-2016|
| Dec|Q4-2016|
| May|Q2-2017|
| Jul|Q3-2017|
+-----+-------+

Related

Parse Date Format

I have the following DataFrame containing the date format - yyyyMMddTHH:mm:ss+UTC
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",),
("20211011T00:00:00+0530",),
("20200212T00:00:00+0300",),
("20211021T00:00:00+0530",),
("20211021T00:00:00+0900",),
("20211021T00:00:00-0500",)
]
,['timestamp'])
sparkDF.show(truncate=False)
+----------------------+
|timestamp |
+----------------------+
|20201021T00:00:00+0530|
|20211011T00:00:00+0530|
|20200212T00:00:00+0300|
|20211021T00:00:00+0530|
|20211021T00:00:00+0900|
|20211021T00:00:00-0500|
+----------------------+
I m aware of the date format to parse and convert the values to DateType
Timestamp Parsed
sparkDF.select(F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530").alias('timestamp_parsed')).show()
+----------------+
|timestamp_parsed|
+----------------+
| 2020-10-21|
| 2021-10-11|
| null|
| 2021-10-21|
| null|
| null|
+----------------+
As you can see , its specific to +0530 strings , I m aware of the fact that I can use multiple patterns and coalesce the first non-null values
Multiple Patterns & Coalesce
sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530"))\
.withColumn('p2',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0900"))\
.withColumn('p3',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss-0500"))\
.withColumn('p4',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0300"))\
.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2'),F.col('p3'),F.col('p4')))\
.drop(*['p1','p2','p3','p4'])\
.show(truncate=False)
+----------------------+----------------+
|timestamp |timestamp_parsed|
+----------------------+----------------+
|20201021T00:00:00+0530|2020-10-21 |
|20211011T00:00:00+0530|2021-10-11 |
|20200212T00:00:00+0300|2020-02-12 |
|20211021T00:00:00+0530|2021-10-21 |
|20211021T00:00:00+0900|2021-10-21 |
|20211021T00:00:00-0500|2021-10-21 |
+----------------------+----------------+
Is there a better way to accomplish this, as there might be a bunch of other UTC within the data source, is there a standard UTC TZ available within Spark to parse all the cases
i think you have got the 2nd argument of your to_date function wrong which is causing null values in your output
the +530 in your timestamp is the Zulu value which just denotes how many hours and mins ahead (for +) or behind (for -) the current timestamp is withrespect to UTC.
Please refer to the response by Basil here Java / convert ISO-8601 (2010-12-16T13:33:50.513852Z) to Date object This link has full details available for the same.
To answer your question if you replace +0530 by Z it should solve your problem.
Here is the spark code in scala that I tried and worked:
val data = Seq("20201021T00:00:00+0530",
"20211011T00:00:00+0530",
"20200212T00:00:00+0300",
"20211021T00:00:00+0530",
"20211021T00:00:00+0900",
"20211021T00:00:00-0500")
import spark.implicits._
val sparkDF = data.toDF("custom_time")
import org.apache.spark.sql.functions._
val spark_DF2 = sparkDF.withColumn("new_timestamp", to_date($"custom_time", "yyyyMMdd'T'HH:mm:ssZ"))
spark_DF2.show(false)
here is the snapshot of the output. As you can see there are no null values.
You can usually use x, X or Z for offset pattern as you can find on Spark date pattern documentation page. You can then parse your date with the following complete pattern: yyyyMMdd'T'HH:mm:ssxx
However, if you use those kind of offset patterns, your date will be first converted in UTC format, meaning all timestamp with a positive offset will be matched to the previous day. For instance "20201021T00:00:00+0530" will be matched to 2020-10-20 using to_date with the previous pattern.
If you want to get displayed date as a date, ignoring offset, you should first extract date string from complete timestamp string using regexp_extract function, then perform to_date.
If you take your example "20201021T00:00:00+0530", what you want to extract with a regexp is 20201021 part and apply to_date on it. You can do it with the following pattern: ^(\\d+). If you're interested, you can find how to build other patterns in java's Pattern documentation.
So your code should be:
from pyspark.sql import functions as F
sparkDF.select(
F.to_date(
F.regexp_extract(F.col('timestamp'), '^(\\d+)', 0), 'yyyyMMdd'
).alias('timestamp_parsed')
).show()
And with your input you will get:
+----------------+
|timestamp_parsed|
+----------------+
|2020-10-21 |
|2021-10-11 |
|2020-02-12 |
|2021-10-21 |
|2021-10-21 |
|2021-10-21 |
+----------------+
You can create "udf" in spark and use it. Below is the code in scala.
import spark.implicits._
//just to create the dataset for the example you have given
val data = Seq(
("20201021T00:00:00+0530"),
("20211011T00:00:00+0530"),
("20200212T00:00:00+0300"),
("20211021T00:00:00+0530"),
("20211021T00:00:00+0900"),
("20211021T00:00:00-0500"))
val dataset = data.toDF("timestamp")
val udfToDateUTC = functions.udf((epochMilliUTC: String) => {
val formatter = DateTimeFormatter.ofPattern("yyyyMMdd'T'HH:mm:ssZ")
val res = OffsetDateTime.parse(epochMilliUTC, formatter).withOffsetSameInstant(ZoneOffset.UTC)
res.toString()
})
dataset.select(dataset.col("timestamp"),udfToDateUTC(dataset.col("timestamp")).alias("timestamp_parsed")).show(false)
//output
+----------------------+-----------------+
|timestamp |timestamp_parsed |
+----------------------+-----------------+
|20201021T00:00:00+0530|2020-10-20T18:30Z|
|20211011T00:00:00+0530|2021-10-10T18:30Z|
|20200212T00:00:00+0300|2020-02-11T21:00Z|
|20211021T00:00:00+0530|2021-10-20T18:30Z|
|20211021T00:00:00+0900|2021-10-20T15:00Z|
|20211021T00:00:00-0500|2021-10-21T05:00Z|
+----------------------+-----------------+
from pyspark.sql.functions import date_format
customer_data = select("<column_name>",date_format("<column_name>",'yyyyMMdd').cast('customer')

add the day information to timestep in a dataframe

I am trying to read the csv file into a dataframe,the csv fileThe csv file looks like this.
The cell value only contains the hour information and miss the day information. I would like to read this csv file into a dataframe and transform the timing information into the format like 2021-05-07 04:04.00 i.e., I would like to add the day information. How to achieve that?
I used the following code, but it seems that pyspark just add the day information as 1970-01-01, kind of system setting.
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
df_1 = spark.read.csv('test1.csv', header = True)
df_1 = df_1.withColumn('Timestamp', to_timestamp(col('Timing'), 'HH:mm'))
df_1.show(truncate=False)
And I got the following result.
+-------+-------------------+
| Timing| Timestamp|
+-------+-------------------+
|04:04.0|1970-01-01 04:04:00|
|19:04.0|1970-01-01 19:04:00|
You can concat a date string before calling to_timestamp:
import pyspark.sql.functions as F
df2 = df_1.withColumn(
'Timestamp',
F.to_timestamp(
F.concat_ws(' ', F.lit('2021-05-07'), 'Timing'),
'yyyy-MM-dd HH:mm.s'
)
)
df2.show()
+-------+-------------------+
| Timing| Timestamp|
+-------+-------------------+
|04:04.0|2021-05-07 04:04:00|
|19:04.0|2021-05-07 19:04:00|
+-------+-------------------+

Extract Numeric data from the Column in Spark Dataframe

I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.
A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count
Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+

Conditional aggregation Spark DataFrame

I would like to understand the best way to do an aggregation in Spark in this scenario:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Person(name:String, acc:Int, logDate:String)
val dateFormat = "dd/MM/yyyy"
val filterType = // Could has "MIN" or "MAX" depending on a run parameter
val filterDate = new Timestamp(System.currentTimeMillis)
val df = sc.parallelize(List(Person("Giorgio",20,"31/12/9999"),
Person("Giorgio",30,"12/10/2009")
Person("Diego", 10,"12/10/2010"),
Person("Diego", 20,"12/10/2010"),
Person("Diego", 30,"22/11/2011"),
Person("Giorgio",10,"31/12/9999"),
Person("Giorgio",30,"31/12/9999"))).toDF()
val df2 = df.withColumn("logDate",unix_timestamp($"logDate",dateFormat).cast(TimestampType))
val df3 = df.groupBy("name").agg(/*conditional aggregation*/)
df3.show /*Expected output show below */
Basically I want to group all records by name column and then based on the filterType parameter, I want to filter all valid records for a Person, then after filtering, I want to sum all acc values obtaining a final
DataFrame with name and totalAcc columns.
For example:
filterType = MIN , I want to take all records with having min(logDate) , could be many of them, so basically in this case I completely ignore filterDate param:
Diego,10,12/10/2010
Diego,20,12/10/2010
Giorgio,30,12/10/2009
Final result expected from aggregation is: (Diego, 30),(Giorgio,30)
filterType = MAX , I want to take all records with logDate > filterDate, I for a key I don't have any records respecting this condition, I need to take records with min(logDate) as done in MIN scenario, so:
Diego, 10, 12/10/2010
Diego, 20, 12/10/2010
Giorgio, 20, 31/12/9999
Giorgio, 10, 31/12/9999
Giorgio, 30, 31/12/9999
Final result expected from aggregation is: (Diego,30),(Giorgio,60)
In this case for Diego I didn't have any records with logDate > logFilter, so I fallback to apply MIN scenario, taking just for Diego all records with min logDate.
You can write your conditional aggregation using when/otherwise as
df2.groupBy("name").agg(sum(when(lit(filterType) === "MIN" && $"logDate" < filterDate, $"acc").otherwise(when(lit(filterType) === "MAX" && $"logDate" > filterDate, $"acc"))).as("sum"))
.filter($"sum".isNotNull)
which would give you your desired output according to filterType
But
Eventually you would require both aggregated dataframes so I would suggest you to avoid filterType field and just go with aggregation by creating additional column for grouping using when/otherwise function. So that you can have both aggregated values in one dataframe as
df2.withColumn("additionalGrouping", when($"logDate" < filterDate, "less").otherwise("more"))
.groupBy("name", "additionalGrouping").agg(sum($"acc"))
.drop("additionalGrouping")
.show(false)
which would output as
+-------+--------+
|name |sum(acc)|
+-------+--------+
|Diego |10 |
|Giorgio|60 |
+-------+--------+
Updated
Since the question is updated with the logic changed, here is the idea and solution to the changed scenario
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("name").orderBy($"logDate".asc)
val minDF = df2.withColumn("minLogDate", first("logDate").over(windowSpec)).filter($"minLogDate" === $"logDate")
.groupBy("name")
.agg(sum($"acc").as("sum"))
val finalDF =
if(filterType == "MIN") {
minDF
}
else if(filterType == "MAX"){
val tempMaxDF = df2
.groupBy("name")
.agg(sum(when($"logDate" > filterDate,$"acc")).as("sum"))
tempMaxDF.filter($"sum".isNull).drop("sum").join(minDF, Seq("name"), "left").union(tempMaxDF.filter($"sum".isNotNull))
}
else {
df2
}
so for filterType = MIN you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|30 |
+-------+---+
and for filterType = MAX you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|60 |
+-------+---+
In case if the filterType isn't MAX or MIN then original dataframe is returned
I hope the answer is helpful
You don't need conditional aggregation. Just filter:
df
.where(if (filterType == "MAX") $"logDate" < filterDate else $"logDate" > filterDate)
.groupBy("name").agg(sum($"acc")

Spark Dataframe groupBy and sort results into a list

I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+

Resources