Spark structured streaming drop duplicates keep last - apache-spark

I would like to maintain a streaming dataframe that get "update".
To do so I will use dropDuplicates.
But dropDuplicates drop the latest change.
How can I retain the last only?

Assuming you need to select the last record on id column by removing other duplicates, you can use the window functions and filter on row_number = count. Check this out
scala> val df = Seq((120,34.56,"2018-10-11"),(120,65.73,"2018-10-14"),(120,39.96,"2018-10-20"),(122,11.56,"2018-11-20"),(122,24.56,"2018-10-20")).toDF("id","amt","dt")
df: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]
scala> val df2=df.withColumn("dt",'dt.cast("date"))
df2: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]
scala> df2.show(false)
+---+-----+----------+
|id |amt |dt |
+---+-----+----------+
|120|34.56|2018-10-11|
|120|65.73|2018-10-14|
|120|39.96|2018-10-20|
|122|11.56|2018-11-20|
|122|24.56|2018-10-20|
+---+-----+----------+
scala> df2.createOrReplaceTempView("ido")
scala> spark.sql(""" select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido """).show(false)
+---+-----+----------+---+---+
|id |amt |dt |rw |cw |
+---+-----+----------+---+---+
|122|24.56|2018-10-20|1 |2 |
|122|11.56|2018-11-20|2 |2 |
|120|34.56|2018-10-11|1 |3 |
|120|65.73|2018-10-14|2 |3 |
|120|39.96|2018-10-20|3 |3 |
+---+-----+----------+---+---+
scala> spark.sql(""" select id,amt,dt from (select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido) where rw=cw """).show(false)
+---+-----+----------+
|id |amt |dt |
+---+-----+----------+
|122|11.56|2018-11-20|
|120|39.96|2018-10-20|
+---+-----+----------+
scala>
If you want to sort on dt descending you can just give "order by dt desc" in the over(0 clause.. Does this help?

Related

SparkSQL query using "PARTITION by" giving wrong output

I have a bunch of csv files for which I am using Pyspark for faster processing. However, am a total noob with Spark (Pyspark). So far I have been able to create a RDD, a subsequent data frame and a temporary view (country_name) to easily query the data.
Input Data
+---+--------------------------+-------+--------------------------+-------------------+
|ID |NAME |COUNTRY|ADDRESS |DESCRIPTION |
+---+--------------------------+-------+--------------------------+-------------------+
|1 | |QAT | |INTERIOR DECORATING|
|2 |S&T |QAT |AL WAAB STREET |INTERIOR DECORATING|
|3 | |QAT | |INTERIOR DECORATING|
|4 |THE ROSA BERNAL COLLECTION|QAT | |INTERIOR DECORATING|
|5 | |QAT |AL SADD STREET |INTERIOR DECORATING|
|6 |AL MANA |QAT |SALWA ROAD |INTERIOR DECORATING|
|7 | |QAT |SUHAIM BIN HAMAD STREET |INTERIOR DECORATING|
|8 |INTERTEC |QAT |AL MIRQAB AL JADEED STREET|INTERIOR DECORATING|
|9 | |EGY | |HOTELS |
|10 | |EGY |QASIM STREET |HOTELS |
|11 |AIRPORT HOTEL |EGY | |HOTELS |
|12 | |EGY |AL SOUQ |HOTELS |
+---+--------------------------+-------+--------------------------+-------------------+
I am stuck trying to convert this particular PostgreSQL query into sparksql.
select country,
name as 'col_name',
description,
ct,
ct_desc,
(ct*100/ct_desc)
from
(select description,
country,
count(name) over (PARTITION by description) as ct,
count(description) over (PARTITION by description) as ct_desc
from country_table
) x
group by 1,2,3,4,5,6
Correct output from PostgreSQL -
+-------+--------+-------------------+--+-------+----------------+
|country|col_name|description |ct|ct_desc|(ct*100/ct_desc)|
+-------+--------+-------------------+--+-------+----------------+
|QAT |name |INTERIOR DECORATING|7 |14 |50.0 |
+-------+--------+-------------------+--+-------+----------------+
Here is the sparksql query I am using -
df_fill_by_col = spark.sql("select country,
name as 'col_name',
description,
ct,
ct_desc,
(ct*100/ct_desc)
from
( Select description,
country,
count(name) over (PARTITION by description) as ct,
count(description) over (PARTITION by description) as ct_desc
from country_name
)x
group by 1,2,3,4,5,6 ")
df_fill_by_col.show()
From SparkSQL -
+-------+--------+-------------------+--+-------+----------------+
|country|col_name|description |ct|ct_desc|(ct*100/ct_desc)|
+-------+--------+-------------------+--+-------+----------------+
|QAT |name |INTERIOR DECORATING|14|14 |100.0 |
+-------+--------+-------------------+--+-------+----------------+
The sparksql query is giving odd outputs especially where few values are null in the dataframe.
For the same file and record the ct column is giving double value 7 v/s 14.
Below is the entire code, from reading the csv file to creating dataframe and querying data.
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
import csv, copy, os, sys, unicodedata, string, time, glob
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType
if __name__ == "__main__":
spark = SparkSession.builder.appName("PythonSQL").config("spark.some.config.option", "some-value").getOrCreate()
sc = spark.sparkContext
lines = sc.textFile("path_to_csvfiles")
parts = lines.map(lambda l: l.split("|"))
country_name = parts.map(lambda p: (p[0], p[1], p[2], p[3], p[4].strip()))
schemaString = "ID NAME COUNTRY ADDRESS DESCRIPTION"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
df_schema = StructType(fields)
df_schema1 = spark.createDataFrame(country_name, df_schema)
df_schema1.createOrReplaceTempView("country_name")
df_schema1.cache()
df_fill_by_col = spark.sql("select country, name as 'col_name', description, ct, ct_desc, (ct*100/ct_desc) from ( Select description, country, count(name) over (PARTITION by description) as ct, count(description) over (PARTITION by description) as ct_desc from country_name )x group by 1,2,3,4,5,6 ")
df_fill_by_col.show()
Please let me know if there is a way of getting the sparksql query to work.
Thanks,
Pankaj
Edit - This code will run on multiple countries and columns

SparkSQL - Extract multiple regex matches (using SQL only)

I have a dataset of SQL queries in raw text and another with a regular expression of all the possible table names:
# queries
+-----+----------------------------------------------+
| id | query |
+-----+----------------------------------------------+
| 1 | select * from table_a, table_b |
| 2 | select * from table_c join table_d... |
+-----+----------------------------------------------+
# regexp
'table_a|table_b|table_c|table_d'
And I wanted the following result:
# expected result
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | [table_a, table_b] |
| 2 | [table_c, table_d] |
+-----+----------------------------------------------+
But using the following SQL in Spark, all I get is the first match...
select
id,
regexp_extract(query, 'table_a|table_b|table_c|table_d') as tables
from queries
# actual result
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | table_a |
| 2 | table_c |
+-----+----------------------------------------------+
Is there any way to do this using only Spark SQL? This is the function I am using https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#regexp_extract
EDIT
I would also accept a solution that returned the following:
# alternative solution
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | table_a |
| 1 | table_b |
| 2 | table_c |
| 2 | table_d |
+-----+----------------------------------------------+
SOLUTION
#chlebek solved this below. I reformatted his SQL using CTEs for better readability:
with
split_queries as (
select
id,
explode(split(query, ' ')) as col
from queries
),
extracted_tables as (
select
id,
regexp_extract(col, 'table_a|table_b|table_c|table_d', 0) as rx
from split_queries
)
select
id,
collect_set(rx) as tables
from extracted_tables
where rx != ''
group by id
Bear in mind that the split(query, ' ') part of the query will split your SQL only by spaces. If you have other things such as tabs, line breaks, comments, etc., you should deal with these before or when splitting.
If you have only a few values to check you can achieve it using contains function instead of regexp:
val names = Seq("table_a","table_b","table_c","table_d")
def c(col: Column) = names.map(n => when(col.contains(n),n).otherwise(""))
df.select('id,array_remove(array(c('query):_*),"").as("result")).show(false)
but using regexp it will looks like below (Spark SQL API):
df.select('id,explode(split('query," ")))
.select('id,regexp_extract('col,"table_a|table_b|table_c|table_d",0).as("rx"))
.filter('rx=!="")
.groupBy('id)
.agg(collect_list('rx))
and it could be translated to below SQL query:
select id, collect_list(rx) from
(select id, regexp_extract(col,'table_a|table_b|table_c|table_d',0) as rx from
(select id, explode(split(query,' ')) as col from df) q1
) q2
where rx != '' group by id
so output will be:
+---+------------------+
| id| collect_list(rx)|
+---+------------------+
| 1|[table_a, table_b]|
| 2|[table_c, table_d]|
+---+------------------+
As you are using spark-sql, you can use sql parser & it will do job for you.
def getTables(query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}
val query = "select * from table_1 as a left join table_2 as b on
a.id=b.id"
scala> getTables(query).foreach(println)
table_1
table_2
You can register 'getTables' as udf & use in query
You can use another SQL function available in Spark called collect_list https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#collect_list. You can find another sample https://mungingdata.com/apache-spark/arraytype-columns/
Basically, applying to your code it should be
val df = spark.sql("select 1 id, 'select * from table_a, table_b' query" )
val df1 = spark.sql("select 2 id, 'select * from table_c join table_d' query" )
val df3 = df.union(df1)
df3.createOrReplaceTempView("tabla")
spark.sql("""
select id, collect_list(tables) from (
select id, explode(split(query, ' ')) as tables
from tabla)
where tables like 'table%' group by id""").show
The output will be
+---+--------------------+
| id|collect_list(tables)|
+---+--------------------+
| 1| [table_a,, table_b]|
| 2| [table_c, table_d]|
+---+--------------------+
Hope this helps
If you are on spark>=2.4 then you can remove exploding and collecting the same operations by using higher order functions on array and without any subqueries-
Load the test data
val data =
"""
|id | query
|1 | select * from table_a, table_b
|2 | select * from table_c join table_d on table_c.id=table_d.id
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(";"))
.toSeq.toDS()
val df = spark.read
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- id: integer (nullable = true)
* |-- query: string (nullable = true)
*
* +---+-----------------------------------------------------------+
* |id |query |
* +---+-----------------------------------------------------------+
* |1 |select * from table_a, table_b |
* |2 |select * from table_c join table_d on table_c.id=table_d.id|
* +---+-----------------------------------------------------------+
*/
Extract the tables from query
// spark >= 2.4.0
df.createOrReplaceTempView("queries")
spark.sql(
"""
|select id,
| array_distinct(
| FILTER(
| split(query, '\\.|=|\\s+|,'), x -> x rlike 'table_a|table_b|table_c|table_d'
| )
| )as tables
|FROM
| queries
""".stripMargin)
.show(false)
/**
* +---+------------------+
* |id |tables |
* +---+------------------+
* |1 |[table_a, table_b]|
* |2 |[table_c, table_d]|
* +---+------------------+
*/

How do i change string to HH:mm:ss only in spark

I am getting the time as string like 134455 and I need to convert into 13:44:55 using spark sql how can we get this in right format
You can try the regexp_replace function.
scala> val df = Seq((134455 )).toDF("ts_str")
df: org.apache.spark.sql.DataFrame = [ts_str: int]
scala> df.show(false)
+------+
|ts_str|
+------+
|134455|
+------+
scala> df.withColumn("ts",regexp_replace('ts_str,"""(\d\d)""","$1:")).show(false)
+------+---------+
|ts_str|ts |
+------+---------+
|134455|13:44:55:|
+------+---------+
scala> df.withColumn("ts",trim(regexp_replace('ts_str,"""(\d\d)""","$1:"),":")).show(false)
+------+--------+
|ts_str|ts |
+------+--------+
|134455|13:44:55|
+------+--------+
scala>
val df = Seq("133456").toDF
+------+
| value|
+------+
|133456|
+------+
df.withColumn("value", unix_timestamp('value, "HHmmss"))
.withColumn("value", from_unixtime('value, "HH:mm:ss"))
.show
+--------+
| value|
+--------+
|13:34:56|
+--------+
Note that a unix timestamp is stored as the number of seconds since 00:00:00, 1 January 1970. If you try to convert a time with millisecond accuracy to a timestamp, you will lose the millisecond part of the time. For times including milliseconds, you will need to use a different approach.

pyspark AnalysisException: "Reference '<COLUMN>' is ambiguous [duplicate]

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.

Spark SCALA - Joining two dataframes where join value in one dataframe is between two fields in the second dataframe

I have two dataframes (deleting the fields that are not relevant to the question):
df1: org.apache.spark.sql.DataFrame = [rawValue: bigint]
df2: org.apache.spark.sql.DataFrame = [startLong: bigint, endLong: bigint]
I now want to join the two dataframes where:
rawValue(df1) >= startLong(df2) AND <= endLong(df2)
Can anyone recommend an efficient way of doing this? The one option I was thinking of was to flatmap df2 and then do a straight join, but I don't want to do that if there is an efficient way to do the above join.
You can directly use the condition that you have while joining the two dataframes
Let me illustrate with an example. I created two dataframes identical to the ones you've mentioned
val df1 = Seq((2L), (5L), (15L), (9L)).toDF("rawValue")
//df1: org.apache.spark.sql.DataFrame = [rawValue: bigint]
val df2 = Seq((3L, 5L), (10L, 16L), (9L, 9L)).toDF("startLong", "endLong")
//df2: org.apache.spark.sql.DataFrame = [startLong: bigint, endLong: bigint]
I now want to join the two dataframes where rawValue(df1) >= startLong(df2) AND <= endLong(df2)
For that you can use the condition as
df1.join(df2, df1("rawValue") >= df2("startLong") && df1("rawValue") <= df2("endLong")).show(false)
which should give you
+--------+---------+-------+
|rawValue|startLong|endLong|
+--------+---------+-------+
|5 |3 |5 |
|15 |10 |16 |
|9 |9 |9 |
+--------+---------+-------+

Resources