Spark Action after dataframe load from jdbc driver taking too long - apache-spark

I am loading 10Gb or more data from two databases namely, redshift and snowflake into the dataframe. The data is loaded for comparison between the datasets. The code for loading data is same except for the jars used with no partition or with partition as well
This is the one below with one partition
df1 = sparkcontext.read \
.format("jdbc") \
.option("driver", "com.amazon.redshift.jdbc42.Driver") \
.option("url", jdbcurl) \
.option("query", query) \
.option("user", user) \
.option("password", password).load()
df1.count()
df2 = sparkcontext.read \
.format("jdbc") \
.option("url", jdbcurl) \
.option("driver", "net.snowflake.client.jdbc.SnowflakeDriver") \
.option("schema", SchemaName) \
.option("query", query) \
.option("user", user) \
.option("password", password).load()
df2.count()
the time taken by second one is 5 secs and while the first one count takes 60 secs.
Since, at this moment we are not performing any calculation except counting number of rows in each dataframe, why is there a difference of ~50-55 secs between the two?
df1.join(df2, on=<clause>,how=outer).where(<clause>)
If I bypass this operation, the df1 and df2 are then joined with partitioning being done at the time of loading by passing , the snowflake stage completes in 4 minutes, but the redshift one takes 10 minutes, why is there a difference of 6 minutes between the two stages. The partition at the time of loading is done like this-
.option("partitionColumn", partitioncolumn) \
.option("lowerBound", lowerbound) \
.option("upperBound", upperbound) \
.option("numPartitions", numpartitions) \
Just wanted to understand why is there a difference between two datasets when the data is same and code is also same?

Related

How to filter data using spark.read in place?

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

In Azure databricks writing pyspark dataframe to eventhub is taking too long as there3 Million records in dataframe

Oracle database table has 3 million records. I need to read it into dataframe and then convert it to json format and send it to eventhub for downstream systems.
Below is my pyspark code to connect and read oracle db table as dataframe
df = spark.read \
.format("jdbc") \
.option("url", databaseurl) \
.option("query","select * from tablename") \
.option("user", loginusername) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.load()
then I am converting the column names and values of each row into json (placing under a new column named body) and then sending it to Eventhub.
I have defined ehconf and eventhub connection string. Below is my write to eventhub code
df.select("body") \
.write\
.format("eventhubs") \
.options(**ehconf) \
.save()
my pyspark code is taking 8 hours to send 3 million records to eventhub.
Could you please suggest how to write pyspark dataframe to eventhub faster ?
My Eventhub is created under eventhub cluster which has 1 CU in capacity
Databricks cluster config :
mode: Standard
runtime: 10.3
worker type: Standard_D16as_v4 64GB Memory,16 cores (min workers :1, max workers:5)
driver type: Standard_D16as_v4 64GB Memory,16 cores
The problem is that the jdbc connector just uses one connection to the database by default so most of your workers are probably idle. That is something you can confirm in Cluster Settings > Metrics > Ganglia UI.
To actually make use of all the workers the jdbc connector needs to know how to parallelize retrieving your data. For this you need a field that has evenly distributed data over its values. For example if you have a date field in your data and every date has a similar amount of records, you can use it to split up the data:
df = spark.read \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", tableName) \
.option("user", jdbcUsername) \
.option("password", jdbcPassword) \
.option("numPartitions", 64) \
.option("partitionColumn", "<dateField>") \
.option("lowerBound", "2019-01-01") \
.option("upperBound", "2022-04-07") \
.load()
You have to define the field name and the min and max value of that field so that the jdbc connector can try to split the work evenly between the workers. The numPartitions is the amount of individual connections opened and the best value depends on the count of workers in your cluster and how many connections your datasource can handle.

Refreshing static Dataframe periodically not working after unpersist in Spark Structured Streaming

I am writing a Pyspark Structured Streaming application (version 3.0.1) and I'm trying to refresh a static dataframe from JDBC source periodically.
I have followed the instructions in this post using a rate stream.
Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically
However, whenever the first unpersist (wether with or without blocking=True) occurs, the following persist is ignored, and from then on the dataframe is read in each trigger instead of being cached in storage.
this is how my code looks like:
# Read static dataframe
static_df = spark.read \
.option("multiline", True) \
.format("jdbc") \
.option("url", JDBC_URL) \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.option("numPartitions", NUMPARTITIONS) \
.option("query", QUERY) \
.load()\
.na.drop(subset=[MY_COL]) \
.repartition(MY_COL) \
.persist()
# Create rate stream
staticRefreshStream = spark.readStream.format("rate") \
.option("rowsPerSecond", 1) \
.option("numPartitions", 1) \
.load()
def foreachBatchRefresher(batch_df, batch_id):
global static_df
print("Refreshing static table")
static_df.unpersist()
static_df = spark.read \
.option("multiline", True) \
.format("jdbc") \
.option("url", JDBC_URL) \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.option("numPartitions", NUMPARTITIONS) \
.option("query", QUERY) \
.load()\
.na.drop(subset=[MY_COL]) \
.repartition(MY_COL) \
.persist()
staticRefreshStream.writeStream \
.format("console") \
.outputMode("append") \
.queryName("RefreshStaticDF") \
.foreachBatch(foreachBatchRefresher) \
.trigger(processingTime='1 minutes') \
.start()
staticRefreshStream.awaitTermination()
The other parts including reading the streaming Dataframe, dataframe transformations and writing to a sink are omitted.
Any idea what I'm missing?

Upsert data in postgresql using spark structured streaming

I am trying to run a structured streaming application using (py)spark. My data is read from a Kafka topic and then I am running windowed aggregation on event time.
# I have been able to create data frame pn_data_df after reading data from Kafka
Schema of pn_data_df
|
- id StringType
- source StringType
- source_id StringType
- delivered_time TimeStamp
windowed_report_df = pn_data_df.filter(pn_data_df.source == 'campaign') \
.withWatermark("delivered_time", "24 hours") \
.groupBy('source_id', window('delivered_time', '15 minute')) \
.count()
windowed_report_df = windowed_report_df \
.withColumn('start_ts', unix_timestamp(windowed_report_df.window.start)) \
.withColumn('end_ts', unix_timestamp(windowed_report_df.window.end)) \
.selectExpr('CAST(source_id as LONG)', 'start_ts', 'end_ts', 'count')
I am writing this windowed aggregation to my postgresql database which I have already created.
CREATE TABLE pn_delivery_report(
source_id bigint not null,
start_ts bigint not null,
end_ts bigint not null,
count integer not null,
unique(source_id, start_ts)
);
Writing to postgresql using spark jdbc allows me to either Append or Overwrite. Append mode fails if there is an existing composite key existing in the database, and Overwrite just overwrites entire table with current batch output.
def write_pn_report_to_postgres(df, epoch_id):
df.write \
.mode('append') \
.format('jdbc') \
.option("url", "jdbc:postgresql://db_endpoint/db") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "pn_delivery_report") \
.option("user", "postgres") \
.option("password", "PASSWORD") \
.save()
windowed_report_df.writeStream \
.foreachBatch(write_pn_report_to_postgres) \
.option("checkpointLocation", '/home/hadoop/campaign_report_df_windowed_checkpoint') \
.outputMode('update') \
.start()
How can I execute a query like
INSERT INTO pn_delivery_report (source_id, start_ts, end_ts, COUNT)
VALUES (1001, 125000000001, 125000050000, 128),
(1002, 125000000001, 125000050000, 127) ON conflict (source_id, start_ts) DO
UPDATE
SET COUNT = excluded.count;
in foreachBatch.
Spark has a jira feature ticket open for it, but it seems that it has not been prioritised till now.
https://issues.apache.org/jira/browse/SPARK-19335
that's worked for me:
def _write_streaming(self,
df,
epoch_id
) -> None:
df.write \
.mode('append') \
.format("jdbc") \
.option("url", f"jdbc:postgresql://localhost:5432/postgres") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", 'table_test') \
.option("user", 'user') \
.option("password", 'password') \
.save()
df_stream.writeStream \
.foreachBatch(_write_streaming) \
.start() \
.awaitTermination()
You need to add ".awaitTermination()" at the end.

spark sql load all data into memory without using where clause

so I have a really huge table contains billions of rows, I tried the Spark DataFrame API to load data, here is my code:
sql = "select * from mytable where day = 2016-11-25 and hour = 10"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", table) \
.load(sql)
df.show()
I typed the sql in mysql, it returns about 100 rows,but the above sql did not work in spark sql, it occurs OOM error, it seems like that spark sql load all data into memory without using where clause. So how can spark sql using where clause?
I have solved the problem. the spark doc gives the answer:
spark doc
So the key is to change the "dbtalble" option, make your sql a subquery. The correct answer is :
// 1. write your query sql as a subquery
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) t1"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", sql) \ // 2. change "dbtable" option to your subquery sql
.load(sql)
df.show()
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) as t1"
df = sqlContext.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("user", user)
.option("password", password)
.option("dbtable", sql)
.load(sql)
df.show()

Resources