how to handle this in spark - apache-spark

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information.
If columns year === prev_year then I need to join with different table i.e. exchange_rates.
If columns year =!= prev_year then I need to return the base dataset itself
How to do this in spark-sql ?

You can refer below approach for your case.
scala> Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2016| 2017| 12|
| 1|2017| 2017|21.4|
| 2|2018| 2017|11.7|
| 2|2018| 2018|44.6|
| 3|2016| 2017|34.5|
| 4|2017| 2017| 56|
+---------+----+---------+----+
scala> exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
| 1|12.3|
| 2|12.5|
| 3|22.3|
| 4|34.6|
| 5|45.2|
+---------+----+
scala> val equaldf = Input_df.filter(col("year") === col("prev_year"))
scala> val notequaldf = Input_df.filter(col("year") =!= col("prev_year"))
scala> val joindf = notequaldf.alias("n").drop("rate").join(exch_rates.alias("e"), List("companyId"), "left")
scala> val finalDF = equaldf.union(joindf)
scala> finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2017| 2017|21.4|
| 2|2018| 2018|44.6|
| 4|2017| 2017| 56|
| 1|2016| 2017|12.3|
| 2|2018| 2017|12.5|
| 3|2016| 2017|22.3|
+---------+----+---------+----+

Related

Pyspark: join some the type of date data

Forexample:
I have two dataframes in Pyspark.
A_dataframe【table name: link_data_test】,The size is so big about 1 billion rows:
-----+--------------------+---------------+
| id| link_date| tuch_url|
+-----+--------------------+-------------+
|day_1|2020-01-01 06:00:...|www.google.com|
|day_2|2020-01-01 11:00:...|www.33e.......|
|day_3|2020-01-03 22:21:...|www.3tg.......|
|day_4|2019-01-04 20:00:...|www.96g.......|
.........
+-----+--------------------+
B_dataframe【table name: url_data_test】:
-----+--------------------+
| url| extra_date|
+-----+--------------------+
|www.google.com|2019-02-01 |
|www.23........|2020-01-02 |
|www.hsi.......|2020-01-03 |
|www.cc........|2020-01-05 |
.......
+-----+--------------------+
I can use the spark.sql() to create a query:
sql_str="""
select
t1.*,t2.*
from
link_data_test as t1
inner join
url_data_test as t2
on
t1.link_date> t2.extra_date and t1.link_date< date_add(t2.extra_date,8)
where
t1.tuch_url like "%t2.url%"
"""
test1=spark.sql(sql_str).saveAsTable("xxxx",mode="overwrite")
I tried this to use the following writing replace the sql wording above for some other tests,but I don't know how writing this.
A_dataframe.join(B_dataframe, ......,'inner').select(....).saveAsTable("xxxx",mode="overwrite")
Thank you for your help!
Here is the way.
df1 = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
df1.show(10, False)
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv")
df2.show(10, False)
+-----+-------------------+--------------+
|id |link_date |tuch_url |
+-----+-------------------+--------------+
|day_1|2020-01-08 23:59:59|www.google.com|
+-----+-------------------+--------------+
+--------------+----------+
|url |extra_date|
+--------------+----------+
|www.google.com|2020-01-01|
+--------------+----------+
df1.join(broadcast(df2),
col('link_date').between(col('extra_date'), date_add('extra_date', 7))
& col('url').contains(col('tuch_url')), 'inner') \
.show(10, False)
+-----+-------------------+--------------+--------------+----------+
|id |link_date |tuch_url |url |extra_date|
+-----+-------------------+--------------+--------------+----------+
|day_1|2020-01-08 23:59:59|www.google.com|www.google.com|2020-01-01|
+-----+-------------------+--------------+--------------+----------+

How to get individual datasets of inner join after join when we have few non-join column names same?

Hi have data something like below
+----+------------+------+-------------------+
|company_id|city |state | updated_date|
+----+------------+------+-------------------+
| 111| city1| state1 | 1990-12-01|
| 222| city2| state2 | 1991-12-01|
+----+------------+------+-----------------+
+----+------------+------+-------------------+
|companyId|city |state | zipcode|
+----+------------+------+-------------------+
| 111| city1| state1 | 111111 |
| 222| city2| state2 | 222222 |
+----+------------+------+-----------------+
I am doing a join on companyId as below
Dataset<Row> joinDs = firstDs.join(secondDs, firstDs.col("company_id").equalTo(secondDs.col("companyId")), "inner");
joinDs has ambiguity columns "city" & "state"
Questions
how to deal with ambiguity columns? is there any way to distinguish ambiguity columns from joinDs ?
how can I select specific columns from joinDs , in which few from "firstDs" and few from "secondDs"?
Joining static dF to steaming dF/s?
When join the static dataframe "secondDs" (from hdfs) with streaming
dataframe "firstDs" ( from kafka) as below
Dataset<Row> joinUpdatedRecordsDs = firstDs.join(secondDs, firstDs.col("companyId").equalTo(secondDs.col("company_id"), "inner"));
It is not giving any results
Where are the same is working fine or giving results in spark-batch processing.
What am I doing wrong here ? how to fix this ?
Error :
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 385, Column 34: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 385, Column 34: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1304)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
There are two ways to implement this. one is data frame and another one is using Spark-Sql.
Dataframe:
var df1 = spark.createDataFrame(Seq((111,"city1","state1","1990-12-01"),(222,"city2","state2","1991-12-01"))).toDF("company_id","city","state","updated_date")
df1.show
+----------+-----+------+------------+
|company_id| city| state|updated_date|
+----------+-----+------+------------+
| 111|city1|state1| 1990-12-01|
| 222|city2|state2| 1991-12-01|
+----------+-----+------+------------+
var df2 = spark.createDataFrame(Seq((111,"city1","state1",111111),(222,"city2","state2",222222))).toDF("company_id","city","state","zipcode")
scala> df1.join(df2,Seq("company_id"),"inner").show
+----------+-----+------+------------+-----+------+-------+
|company_id| city| state|updated_date| city| state|zipcode|
+----------+-----+------+------------+-----+------+-------+
| 111|city1|state1| 1990-12-01|city1|state1| 111111|
| 222|city2|state2| 1991-12-01|city2|state2| 222222|
+----------+-----+------+------------+-----+------+-------+
performing inner join using spark select
scala> df1.select("company_id","city","updated_date").join(df2.select("company_id","state","zipcode"),Seq("company_id"),"inner").show
+----------+-----+------------+------+-------+
|company_id| city|updated_date| state|zipcode|
+----------+-----+------------+------+-------+
| 111|city1| 1990-12-01|state1| 111111|
| 222|city2| 1991-12-01|state2| 222222|
+----------+-----+------------+------+-------+
Spark-Sql
Register both Dataframe as temp table then perform join using spark sql
scala> df1.registerTempTable("temp1")
scala> df2.registerTempTable("temp2")
scala> spark.sql("select a.company_id,a.city,a.updated_date,b.state,b.zipcode from temp1 a inner join temp2 as b on a.company_id = b.company_id ").show
+----------+-----+------------+------+-------+
|company_id| city|updated_date| state|zipcode|
+----------+-----+------------+------+-------+
| 111|city1| 1990-12-01|state1| 111111|
| 222|city2| 1991-12-01|state2| 222222|
+----------+-----+------------+------+-------+
if you have any query related this please let me know. HAppy HAdoop

Spark sql query giving data type miss match error

I have small sql query which working perfectly fine in sql, but the same query working in hive as expected.
Table has user information and below is the query
spark.sql("select * from users where (id,id_proof) not in ((1232,345))").show;
I am getting below exception in spark
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('age', deleted_inventory.`age`, 'id_proof', deleted_inventory.`id_proof`) IN (named_struct('col1',1232, 'col2', 345)))' due to data type mismatch: Arguments must be same type but were: StructType(StructField(id,IntegerType,true), StructField(id_proof,IntegerType,true)) != StructType(StructField(col1,IntegerType,false), StructField(col2,IntegerType,false));
I id and id_proof are of integer types.
Try using the with() table, it works.
scala> val df = Seq((101,121), (1232,345),(222,2242)).toDF("id","id_proof")
df: org.apache.spark.sql.DataFrame = [id: int, id_proof: int]
scala> df.show(false)
+----+--------+
|id |id_proof|
+----+--------+
|101 |121 |
|1232|345 |
|222 |2242 |
+----+--------+
scala> df.createOrReplaceTempView("girish")
scala> spark.sql("with t1( select 1232 id,345 id_proof ) select id, id_proof from girish where (id,id_proof) not in (select id,id_proof from t1) ").show(false)
+---+--------+
|id |id_proof|
+---+--------+
|101|121 |
|222|2242 |
+---+--------+
scala>

How to access nested schema column?

I have a Kafka streaming source with JSONs, e.g. {"type":"abc","1":"23.2"}.
The query gives the following exception:
org.apache.spark.sql.catalyst.parser.ParseException: extraneous
input '.1' expecting {<EOF>, .......}
== SQL ==
person.1
What is the correct syntax to access "person.1"?
I have even changed DoubleType to StringType, but that didn't work either. Example works fine with just by keeping person.type and removing person.1 in selectExpr:
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)")
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.DoubleType)
val personNestedDf = personJsonDf
.select(from_json($"value", struct).as("person"))
val personFlattenedDf = personNestedDf
.selectExpr("person.type", "person.1")
val consoleOutput = personNestedDf.writeStream
.outputMode("update")
.format("console")
.start()
Interesting, since select($"person.1") should work fine (but you used selectExpr which could've confused Spark SQL).
StructField(1,DoubleType,true) won't work however since the type should actually be StringType.
Let's see...
$ cat input.json
{"type":"abc","1":"23.2"}
val input = spark.read.text("input.json")
scala> input.show(false)
+-------------------------+
|value |
+-------------------------+
|{"type":"abc","1":"23.2"}|
+-------------------------+
import org.apache.spark.sql.types._
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.StringType)
val q = input.select(from_json($"value", struct).as("person"))
scala> q.show
+-----------+
| person|
+-----------+
|[abc, 23.2]|
+-----------+
val q = input.select(from_json($"value", struct).as("person")).select($"person.1")
scala> q.show
+----+
| 1|
+----+
|23.2|
+----+
I have solved this problem by using person.*
+-----+--------+
|type | 1 |
+-----+--------+
|abc |23.2 |
+-----+--------+

Spark: DataFrame Aggregation (Scala)

I have a below requirement to aggregate the data on Spark dataframe in scala.
And, I have two datasets.
Dataset 1 contains values (val1, val2..) for each "t" types distributed on several different columns like (t1,t2...) .
val data1 = Seq(
("1","111",200,"221",100,"331",1000),
("2","112",400,"222",500,"332",1000),
("3","113",600,"223",1000,"333",1000)
).toDF("id1","t1","val1","t2","val2","t3","val3")
data1.show()
+---+---+----+---+----+---+----+
|id1| t1|val1| t2|val2| t3|val3|
+---+---+----+---+----+---+----+
| 1|111| 200|221| 100|331|1000|
| 2|112| 400|222| 500|332|1000|
| 3|113| 600|223|1000|333|1000|
+---+---+----+---+----+---+----+
Dataset 2 represent the same thing by having a separate row for each "t" type.
val data2 = Seq(("1","111",200),("1","221",100),("1","331",1000),
("2","112",400),("2","222",500),("2","332",1000),
("3","113",600),("3","223",1000), ("3","333",1000)
).toDF("id*","t*","val*")
data2.show()
+---+---+----+
|id*| t*|val*|
+---+---+----+
| 1|111| 200|
| 1|221| 100|
| 1|331|1000|
| 2|112| 400|
| 2|222| 500|
| 2|332|1000|
| 3|113| 600|
| 3|223|1000|
| 3|333|1000|
+---+---+----+
Now,I need to groupBY(id,t,t*) fields and print the balances for sum(val) and sum(val*) as a separate record.
And both balances should be equal.
My output should look like below:
+---+---+--------+---+---------+
|id1| t |sum(val)| t*|sum(val*)|
+---+---+--------+---+---------+
| 1|111| 200|111| 200|
| 1|221| 100|221| 100|
| 1|331| 1000|331| 1000|
| 2|112| 400|112| 400|
| 2|222| 500|222| 500|
| 2|332| 1000|332| 1000|
| 3|113| 600|113| 600|
| 3|223| 1000|223| 1000|
| 3|333| 1000|333| 1000|
+---+---+--------+---+---------+
I'm thinking of exploding the dataset1 into mupliple records for each "t" type and then join with dataset2.
But could you please suggest me a better approach which wouldn't affect the performance if datasets become bigger?
The simplest solution is to do sub-selects and then union datasets:
val ts = Seq(1, 2, 3)
val dfs = ts.map (t => data1.select("t" + t as "t", "v" + t as "v"))
val unioned = dfs.drop(1).foldLeft(dfs(0))((l, r) => l.union(r))
val ds = unioned.join(df2, 't === col("t*")
here aggregation
You can also try array with explode:
val df1 = data1.withColumn("colList", array('t1, 't2, 't3))
.withColumn("t", explode(colList))
.select('t, 'id1 as "id")
val ds = df2.withColumn("val",
when('t === 't1, 'val1)
.when('t === 't2, 'val2)
.when('t === 't3, 'val3)
.otherwise(0))
The last step is to join this Dataset with data2:
ds.join(data2, 't === col("t*"))
.groupBy("t", "t*")
.agg(first("id1") as "id1", sum(val), sum("val*"))

Resources