How to efficiently select partial data from RDBMS tables in Pyspark - apache-spark

Let's assume I have an employee table like this:
| employee_id | employee_name | department | created_at | updated_at |
|-------------|---------------|------------|---------------------|---------------------|
| 1 | Jessica | Finance | 2020-10-10 12:00:00 | 2020-10-10 12:00:00 |
| 2 | Michael | IT | 2020-10-10 15:00:00 | 2020-10-10 15:00:00 |
| 3 | Sheila | HR | 2020-10-11 17:00:00 | 2020-10-11 17:00:00 |
| ... | ... | ... | ... | ... |
| 1000 | Emily | IT | 2020-10-20 20:00:00 | 2020-10-20 20:00:00 |
Usually, I can batch the data in Pyspark using JDBC connection and write to GCS like this:
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
df.write.parquet("gs://{bucket_name}/{target_directory}/")
When I create df like my code above and using .load(), does the data still in the database server or spark download all the data from the table and move it to spark cluster (assuming the database and spark cluster placed on different server).
And if I need to get specific data in time range let's say I need the data where created_at > 2020-10-15 00:00:00
does the code below enough? because I found it really slow when the data size reach more than 25 GB
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
df.createOrReplaceTempView("get_specific_data")
get_specific_data = spark.sql('''
SELECT employee_id, employee_name, department, created_at, updated_at
FROM get_specific_data
WHERE created_at > '2020-10-15 00:00:00'
'''
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")
My question is more like how to efficiently get specific data in Pyspark if I know what data I need to retrieve by column created_date (or any other column, by ID, or something else). Do I need spark sql for this? or using another tools? (for the purpose of to batch the data daily)

Turns out if I specify only the table name in table_source it will load all the data into spark cluster.
To select specific data that I need, I can use something just like this:
last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
"FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")

Related

Updated dataframe column value failed to overwrite in Hive

Consider hive table tbl with column aid and bid
| aid | bid |
---------------
| | 12 |
| 24 | 13 |
| 18 | 3 |
| | 7 |
---------------
requirement is when aid is null or empty string, aid should be overwritten by value of bid
| aid | bid |
---------------
| 12 | 12 |
| 24 | 13 |
| 18 | 3 |
| 7 | 7 |
---------------
code is simple
val df01 = spark.sql("select * from db.tbl")
val df02 = df01.withColumn("aid", when(col("aid").isNull || col("aid") <=> "", col("bid")) otherwise(col("aid")))
and when running in spark-shell, df02.show displayed correct data just like above table
problem is when write the data back to hive
df02.write
.format("orc")
.mode("Overwrite")
.option("header", "false")
.option("orc.compress", "snappy")
.insertInto(tbl)
there is no error but when I validate the data
select * from db.tbl where aid is null or aid= '' limit 10;
I can still see multiple rows return from the query with aid being null
How to overwrite the data back to hive if previously update column value just like above example?
I would try this
df02.write
.orc
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.insertInto(tbl)

How to CREATE TABLE USING delta with Spark 2.4.4?

This is Spark 2.4.4 and Delta Lake 0.5.0.
I'm trying to create a table using delta data source and seems I'm missing something. Although the CREATE TABLE USING delta command worked fine neither the table directory is created nor insertInto works.
The following CREATE TABLE USING delta worked fine, but insertInto failed.
scala> sql("""
create table t5
USING delta
LOCATION '/tmp/delta'
""").show
scala> spark.catalog.listTables.where('name === "t5").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
| t5| default| null| EXTERNAL| false|
+----+--------+-----------+---------+-----------+
scala> spark.range(5).write.option("mergeSchema", true).insertInto("t5")
org.apache.spark.sql.AnalysisException: `default`.`t5` requires that the data to be inserted have the same number of columns as the target table: target table has 0 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).;
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:341)
...
I thought I'd create with columns defined, but that didn't work either.
scala> sql("""
create table t6
(id LONG, name STRING)
USING delta
LOCATION '/tmp/delta'
""").show
org.apache.spark.sql.AnalysisException: delta does not allow user-specified schemas.;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:194)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3370)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
... 54 elided
The OSS version of Delta does not have the SQL Create Table syntax as of yet. This will be implemented the future versions using Spark 3.0.
To create a Delta table, you must write out a DataFrame in Delta format. An example in Python being
df.write.format("delta").save("/some/data/path")
Here's a link to the create table documentation for Python, Scala, and Java.
An example with pyspark 3.0.0 & delta 0.7.0
print(f"LOCATION '{location}")
spark.sql(f"""
CREATE OR REPLACE TABLE {TABLE_NAME} (
CD_DEVICE INT,
FC_LOCAL_TIME TIMESTAMP,
CD_TYPE_DEVICE STRING,
CONSUMTION DOUBLE,
YEAR INT,
MONTH INT,
DAY INT )
USING DELTA
PARTITIONED BY (YEAR , MONTH , DAY, FC_LOCAL_TIME)
LOCATION '{location}'
""")
Where "location" is a dir HDFS for spark cluster mode save de delta table.
tl;dr CREATE TABLE USING delta is not supported by Spark before 3.0.0 and Delta Lake before 0.7.0.
Delta Lake 0.7.0 with Spark 3.0.0 (both just released) do support CREATE TABLE SQL command.
Be sure to "install" Delta SQL using spark.sql.catalog.spark_catalog configuration property with org.apache.spark.sql.delta.catalog.DeltaCatalog.
$ ./bin/spark-submit \
--packages io.delta:delta-core_2.12:0.7.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
scala> spark.version
res0: String = 3.0.0
scala> sql("CREATE TABLE delta_101 (id LONG) USING delta").show
++
||
++
++
scala> spark.table("delta_101").show
+---+
| id|
+---+
+---+
scala> sql("DESCRIBE EXTENDED delta_101").show(truncate = false)
+----------------------------+---------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+---------------------------------------------------------+-------+
|id |bigint | |
| | | |
|# Partitioning | | |
|Not partitioned | | |
| | | |
|# Detailed Table Information| | |
|Name |default.delta_101 | |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/delta_101| |
|Provider |delta | |
|Table Properties |[] | |
+----------------------------+---------------------------------------------------------+-------+

How to pivot a pyspark streaming dataframe

I receive streaming data in pyspark structured streaming and I need to pivot them such that I can have a single row from that data.
The struct of data coming to my cluster is that:
{
"version": 1.0.0,
"message": {
"data": [{
"name": "name_1",
"value": 1.0},
...
{
"name": "name_2",
"value": 2.0}]
}
}
My code is the following:
dfStreaming = spark \
.readStream \
.format("eventhubs") \
.options(**optionConf()) \
.load() \
.select(explode("message.data").alias("data")) \
.select(("data.*")) \
I get the following result dataframe:
|---------------------|------------------|
| Name | Value |
|---------------------|------------------|
| Name_1 | 1.0 |
|---------------------|------------------|
| Name_2 | 2.0 |
|---------------------|------------------|
But I need the following structure (it's actually a pivot of the table):
|---------------------|------------------|
| Name_1 | Name_2 |
|---------------------|------------------|
| 1.0 | 2.0 |
|---------------------|------------------|
The pivot on streaming dataframe is not permitted, but there should be a solution for that I suppose.
Thank you so much for your help.
The solution was adding several aggregation with case when to recreate the row of the dataframe.
dfStreaming = spark \
.readStream \
.format("eventhubs") \
.options(**optionConf()) \
.load() \
.select(explode("message.data").alias("data")) \
.select(("data.*")) \
.selectexpr(["sum(case when Name=Name_of_desired_column then Value else null) as Name_of_desired_column"])

Duplicate Records move to other temp table in pyspark

I am using Pyspark
My Input Data looks like below.
COL1|COL2
|TYCO|130003|
|EMC |120989|
|VOLVO|102329|
|BMW|130157|
|FORD|503004|
|TYCO|130003|
I have created DataFrame and querying for duplicates like below.
from pyspark.sql import Row
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
data = spark.read.csv("filepath")
data.registerTempTable("data")
spark.sql("SELECT count(col2)CNT, col2 from data GROUP BY col2 ").show()
This give correct result but can we get duplicate value in seperate temp table.
output data in Temp1
+----+------+
| 1|120989|
| 1|102329|
| 1|130157|
| 1|503004|
+----+------+
output data in temp2
+----+------+
| 2|130003|
+----+------+
sqlDF = spark.sql("SELECT count(col2)CNT, col2 from data GROUP BY col2 having cnt > 1 ");
sqlDF.createOrReplaceTempView("temp2");

Spark Structured Streaming watermark error

Followup to this question
I have json streaming data in the format same as below
| A | B |
|-------|------------------------------------------|
| ABC | [{C:1, D:1}, {C:2, D:4}] |
| XYZ | [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] |
I need to transform it to the format below
| A | C | D |
|-------|-----|------|
| ABC | 1 | 1 |
| ABC | 2 | 4 |
| XYZ | 3 | 6 |
| XYZ | 9 | 11 |
| XYZ | 5 | 12 |
To achieve this performed the transformations as suggested to the previous question.
val df1 = df0.select($"A", explode($"B")).toDF("A", "Bn")
val df2 = df1.withColumn("SeqNum", monotonically_increasing_id()).toDF("A", "Bn", "SeqNum")
val df3 = df2.select($"A", explode($"Bn"), $"SeqNum").toDF("A", "B", "C", "SeqNum")
val df4 = df3.withColumn("dummy", concat( $"SeqNum", lit("||"), $"A"))
val df5 = df4.select($"dummy", $"B", $"C").groupBy("dummy").pivot("B").agg(first($"C"))
val df6 = df5.withColumn("A", substring_index(col("dummy"), "||", -1)).drop("dummy")
Now I trying to save the result to a csv file in HDFS
df6.withWatermark("event_time", "0 seconds")
.writeStream
.trigger(Trigger.ProcessingTime("0 seconds"))
.queryName("query_db")
.format("parquet")
.option("checkpointLocation", "/path/to/checkpoint")
.option("path", "/path/to/output")
// .outputMode("complete")
.start()
Now I get the below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
EventTimeWatermark event_time#223: timestamp, interval
My doubt is that I am not performing any aggregation that will require it store the aggregated value beyond the processing time for that row. Why do I get this error? Can I keep watermarking as 0 seconds?
Any help on this will be deeply appreciated.
As per my understanding, watermarking is required only when you are performing window operation on event time. Spark used watermarking to handle late data and for the same purpose Spark needs to save older aggregation.
The following link explains this very well with example:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
I don't see any window operations in your transformation and if that is the case then I think you can try running the stream query without watermarking.
when grouping spark streaming structures you have to already have the watermark in the dataframe and take it into account while grouping, by including a window of the watermarks in your aggregation.
df.groupBy(col("dummy"), window(col("event_time"), "1 day")).

Resources