Related
How does Spark SQL implement the group by aggregate? I want to group by name field and based on the latest data to get the latest salary. How to write the SQL
The data is:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
The expected result is:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
Assuming that the dataframe is registered as a temporary view named tmp, first use the row_number windowing function for each group (name) in reverse order by date Assign the line number (rn), and then take all the lines with rn=1.
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
First convert your string to a date.
Covert the date to an UNixTimestamp.(number representation of a date, so you can use Max)
User "First" as an aggregate
function that retrieves a value of your aggregate results. (The first results, so if there is a date tie, it could pull either one.)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
My usual trick is to "zip" date and salary together (depends on what do you want to sort first)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+
I'm having trouble spliting a dataframe's column into more columns in PySpark:
I have a list of lists and I want to transform it into a dataframe, each value in one column.
What I have tried:
I created a dataframe from this list:
[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
Using this code:
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
The result I got is:
+----+--------------------+
|col1| col2|
+----+--------------------+
| 0|[COL-4560, COL-96...|
| 1|[DLL-7760, NAT-98...|
+----+--------------------+
I want to separate the values by comma into columns, so I tried:
from pyspark.sql import functions as F
df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2'))
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
But I get an error on this line df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2')):
AnalysisException: cannot resolve 'split(`col2`, ', ', -1)' due to data type mismatch: argument 1 requires string type, however, '`col2`' is of array<string> type.;;
My ideal final output would be like this:
+----------+----------+----------+----------+----------+
| SKU | REC_01 | REC_02 | REC_03 | REC_04 |
+----------+----------+----------+----------+----------+
| COL-4560 | COL-9655 | NWG-0610 | D81-3754 | null |
| DLL-7760 | NAT-9885 | PED-0550 | MAR-0004 | LLL-5554 |
+---------------------+----------+----------+----------+
Some rows may have four values, but some my have more or less, I don't know the exact number of columns the final dataframe will have.
Does anyone have any idea of what is happening? Thank you very much in advance.
Dataframe df_from_list col2 column is already array type, so no need to split (as split works with stringtype here we have arraytype).
Here are the steps that will work for you.
recs_list=[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
from pyspark.sql import functions as F
df2 = df_from_list
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
cols=['SKU','REC_01','REC_02','REC_03','REC_04']
df_result = df2.select(*[df2['col2'][i] for i in range(nb_columns)]).toDF(*cols)
df_result.show()
#+--------+--------+--------+--------+--------+
#| SKU| REC_01| REC_02| REC_03| REC_04|
#+--------+--------+--------+--------+--------+
#|COL-4560|COL-9655|NWG-0610|D81-3754| null|
#|DLL-7760|NAT-9885|PED-0550|MAR-0004|LLL-5554|
#+--------+--------+--------+--------+--------+
I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have rates meta data of currency sample as below :
val ratesMetaDataDf = Seq(
("EUR","5/10/2019","1.130657","USD"),
("EUR","5/9/2019","1.13088","USD")
).toDF("base_code", "rate_date","rate_value","target_code")
.withColumn("rate_date", to_date($"rate_date" ,"MM/dd/yyyy").cast(DateType))
.withColumn("rate_value", $"rate_value".cast(DoubleType))
Sales records which i received from kafka topic is , as (sample) below
:
val kafkaDf = Seq((15,2016, 4, 100.5,"USD","2021-01-20","EUR",221.4)
).toDF("companyId", "year","quarter","sales","code","calc_date","c_code","prev_sales")
To calculate "prev_sales" , I need get its "c_code" 's respective "rate_value" which is nearest to the "calc_date" i.e. rate_date"
Which i am doing as following
val w2 = Window.orderBy(col("rate_date") desc)
val rateJoinResultDf = kafkaDf.as("k").join(ratesMetaDataDf.as("e"))
.where( ($"k.c_code" === $"e.base_code") &&
($"rate_date" < $"calc_date")
).orderBy($"rate_date" desc)
.withColumn("row",row_number.over(w2))
.where($"row" === 1).drop("row")
.withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast(DoubleType))
.select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
In the above to get nearest record (i.e. "5/10/2019" from ratesMetaDataDf ) for given "rate_date" I am using window and row_number function and sorting the records by "desc".
But in the spark-sql streaming it is causing the error as below
"
Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;"
So how to fetch first record to join in the above.
Replace your last code part with below code. This code will do left join and calculate date difference calc_date & rate_date. Next Window function we will pick nearest date and calculate prev_sales by using same your calculation.
Please note I have added one filter condition filter(col("diff") >=0),
which will handle a case of calc_date < rate_date. I have added few
more records for better understanding of this case.
scala> ratesMetaDataDf.show
+---------+----------+----------+-----------+
|base_code| rate_date|rate_value|target_code|
+---------+----------+----------+-----------+
| EUR|2019-05-10| 1.130657| USD|
| EUR|2019-05-09| 1.12088| USD|
| EUR|2019-12-20| 1.1584| USD|
+---------+----------+----------+-----------+
scala> kafkaDf.show
+---------+----+-------+-----+----+----------+------+----------+
|companyId|year|quarter|sales|code| calc_date|c_code|prev_sales|
+---------+----+-------+-----+----+----------+------+----------+
| 15|2016| 4|100.5| USD|2021-01-20| EUR| 221.4|
| 15|2016| 4|100.5| USD|2019-06-20| EUR| 221.4|
+---------+----+-------+-----+----+----------+------+----------+
scala> val W = Window.partitionBy("companyId","year","quarter","sales","code","calc_date","c_code","prev_sales").orderBy(col("diff"))
scala> val rateJoinResultDf= kafkaDf.alias("k").join(ratesMetaDataDf.alias("r"), col("k.c_code") === col("r.base_code"), "left")
.withColumn("diff",datediff(col("calc_date"), col("rate_date")))
.filter(col("diff") >= 0)
.withColumn("closedate", row_number.over(W))
.filter(col("closedate") === 1)
.drop("diff", "closedate")
.withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast("Decimal(14,5)"))
.select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
scala> rateJoinResultDf.show
+---------+----+-------+-----+----+----------+----------+
|companyId|year|quarter|sales|code| calc_date|prev_sales|
+---------+----+-------+-----+----+----------+----------+
| 15|2016| 4|100.5| USD|2021-01-20| 256.46976|
| 15|2016| 4|100.5| USD|2019-06-20| 250.32746|
+---------+----+-------+-----+----+----------+----------+
I'm trying to implement some kind of pagination feature for my app that using cassandra in the backend.
CREATE TABLE sample (
some_pk int,
some_id int,
name1 txt,
name2 text,
value text,
PRIMARY KEY (some_pk, some_id, name1, name2)
)
WITH CLUSTERING ORDER BY(some_id DESC)
I want to query 100 records, then store the last records keys in memory to use them later.
+---------+---------+-------+-------+-------+
| sample_pk| some_id | name1 | name2 | value |
+---------+---------+-------+-------+-------+
| 1 | 125 | x | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 124 | a | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 124 | b | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 123 | y | '' | '' |
+---------+---------+-------+-------+-------+
(for simplicity, i left some columns empty. partition key(sample_pk) is not important)
let's assume my page size is 2.
select * from sample where sample_pk=1 limit 2;
returns first 2 rows. now i store the last record in my query result and run query again to get next 2 rows;
this is the query that does not work because of restriction of a single non-EQ relation
select * from where sample_pk=1 and some_id <= 124 and name1>='a' and name2>='' limit 2;
and this one returns wrong results because some_id is in descending order and name columns are in ascending order.
select * from where sample_pk=1 and (some_id, name1, name2) <= (124, 'a', '') limit 2;
So I'm stuck. How can I implement pagination?
You can run your second query like,
select * from sample where some_pk =1 and some_id <= 124 limit x;
Now after fetching the records ignore the record(s) which you have already read (this can be done because you are storing the last record from the previous select query).
And after ignoring those records if you are end up with empty list of rows/records that means you have iterated over all the records else continue doing this for your pagination task.
You don't have to store any keys in memory, also you don't need to use limit in your cqlsh query. Just use the capabilities of datastax driver in your application code for doing pagination like the following code:
public Response getFromCassandra(Integer itemsPerPage, String pageIndex) {
Response response = new Response();
String query = "select * from sample where sample_pk=1";
Statement statement = new SimpleStatement(query).setFetchSize(itemsPerPage); // set the number of items we want per page (fetch size)
// imagine page '0' indicates the first page, so if pageIndex = '0' then there is no paging state
if (!pageIndex.equals("0")) {
statement.setPagingState(PagingState.fromString(pageIndex));
}
ResultSet rows = session.execute(statement); // execute the query
Integer numberOfRows = rows.getAvailableWithoutFetching(); // this should get only number of rows = fetchSize (itemsPerPage)
Iterator<Row> iterator = rows.iterator();
while (numberOfRows-- != 0) {
response.getRows.add(iterator.next());
}
PagingState pagingState = rows.getExecutionInfo().getPagingState();
if(pagingState != null) { // there is still remaining pages
response.setNextPageIndex(pagingState.toString());
}
return response;
}
note that if you make the while loop like the following:
while(iterator.hasNext()) {
response.getRows.add(iterator.next());
}
it will first fetch number of rows as equal as the fetch size we set, then as long as the query still matches some rows in Cassandra it will go fetch again from cassandra till it fetches all rows matching the query from cassandra which may not be intended if you want to implement a pagination feature
source: https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
I would like to understand the best way to do an aggregation in Spark in this scenario:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Person(name:String, acc:Int, logDate:String)
val dateFormat = "dd/MM/yyyy"
val filterType = // Could has "MIN" or "MAX" depending on a run parameter
val filterDate = new Timestamp(System.currentTimeMillis)
val df = sc.parallelize(List(Person("Giorgio",20,"31/12/9999"),
Person("Giorgio",30,"12/10/2009")
Person("Diego", 10,"12/10/2010"),
Person("Diego", 20,"12/10/2010"),
Person("Diego", 30,"22/11/2011"),
Person("Giorgio",10,"31/12/9999"),
Person("Giorgio",30,"31/12/9999"))).toDF()
val df2 = df.withColumn("logDate",unix_timestamp($"logDate",dateFormat).cast(TimestampType))
val df3 = df.groupBy("name").agg(/*conditional aggregation*/)
df3.show /*Expected output show below */
Basically I want to group all records by name column and then based on the filterType parameter, I want to filter all valid records for a Person, then after filtering, I want to sum all acc values obtaining a final
DataFrame with name and totalAcc columns.
For example:
filterType = MIN , I want to take all records with having min(logDate) , could be many of them, so basically in this case I completely ignore filterDate param:
Diego,10,12/10/2010
Diego,20,12/10/2010
Giorgio,30,12/10/2009
Final result expected from aggregation is: (Diego, 30),(Giorgio,30)
filterType = MAX , I want to take all records with logDate > filterDate, I for a key I don't have any records respecting this condition, I need to take records with min(logDate) as done in MIN scenario, so:
Diego, 10, 12/10/2010
Diego, 20, 12/10/2010
Giorgio, 20, 31/12/9999
Giorgio, 10, 31/12/9999
Giorgio, 30, 31/12/9999
Final result expected from aggregation is: (Diego,30),(Giorgio,60)
In this case for Diego I didn't have any records with logDate > logFilter, so I fallback to apply MIN scenario, taking just for Diego all records with min logDate.
You can write your conditional aggregation using when/otherwise as
df2.groupBy("name").agg(sum(when(lit(filterType) === "MIN" && $"logDate" < filterDate, $"acc").otherwise(when(lit(filterType) === "MAX" && $"logDate" > filterDate, $"acc"))).as("sum"))
.filter($"sum".isNotNull)
which would give you your desired output according to filterType
But
Eventually you would require both aggregated dataframes so I would suggest you to avoid filterType field and just go with aggregation by creating additional column for grouping using when/otherwise function. So that you can have both aggregated values in one dataframe as
df2.withColumn("additionalGrouping", when($"logDate" < filterDate, "less").otherwise("more"))
.groupBy("name", "additionalGrouping").agg(sum($"acc"))
.drop("additionalGrouping")
.show(false)
which would output as
+-------+--------+
|name |sum(acc)|
+-------+--------+
|Diego |10 |
|Giorgio|60 |
+-------+--------+
Updated
Since the question is updated with the logic changed, here is the idea and solution to the changed scenario
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("name").orderBy($"logDate".asc)
val minDF = df2.withColumn("minLogDate", first("logDate").over(windowSpec)).filter($"minLogDate" === $"logDate")
.groupBy("name")
.agg(sum($"acc").as("sum"))
val finalDF =
if(filterType == "MIN") {
minDF
}
else if(filterType == "MAX"){
val tempMaxDF = df2
.groupBy("name")
.agg(sum(when($"logDate" > filterDate,$"acc")).as("sum"))
tempMaxDF.filter($"sum".isNull).drop("sum").join(minDF, Seq("name"), "left").union(tempMaxDF.filter($"sum".isNotNull))
}
else {
df2
}
so for filterType = MIN you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|30 |
+-------+---+
and for filterType = MAX you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|60 |
+-------+---+
In case if the filterType isn't MAX or MIN then original dataframe is returned
I hope the answer is helpful
You don't need conditional aggregation. Just filter:
df
.where(if (filterType == "MAX") $"logDate" < filterDate else $"logDate" > filterDate)
.groupBy("name").agg(sum($"acc")