spark sql load all data into memory without using where clause - apache-spark

so I have a really huge table contains billions of rows, I tried the Spark DataFrame API to load data, here is my code:
sql = "select * from mytable where day = 2016-11-25 and hour = 10"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", table) \
.load(sql)
df.show()
I typed the sql in mysql, it returns about 100 rows,but the above sql did not work in spark sql, it occurs OOM error, it seems like that spark sql load all data into memory without using where clause. So how can spark sql using where clause?

I have solved the problem. the spark doc gives the answer:
spark doc
So the key is to change the "dbtalble" option, make your sql a subquery. The correct answer is :
// 1. write your query sql as a subquery
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) t1"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", sql) \ // 2. change "dbtable" option to your subquery sql
.load(sql)
df.show()

sql = "(select * from mytable where day = 2016-11-25 and hour = 10) as t1"
df = sqlContext.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("user", user)
.option("password", password)
.option("dbtable", sql)
.load(sql)
df.show()

Related

Spark Action after dataframe load from jdbc driver taking too long

I am loading 10Gb or more data from two databases namely, redshift and snowflake into the dataframe. The data is loaded for comparison between the datasets. The code for loading data is same except for the jars used with no partition or with partition as well
This is the one below with one partition
df1 = sparkcontext.read \
.format("jdbc") \
.option("driver", "com.amazon.redshift.jdbc42.Driver") \
.option("url", jdbcurl) \
.option("query", query) \
.option("user", user) \
.option("password", password).load()
df1.count()
df2 = sparkcontext.read \
.format("jdbc") \
.option("url", jdbcurl) \
.option("driver", "net.snowflake.client.jdbc.SnowflakeDriver") \
.option("schema", SchemaName) \
.option("query", query) \
.option("user", user) \
.option("password", password).load()
df2.count()
the time taken by second one is 5 secs and while the first one count takes 60 secs.
Since, at this moment we are not performing any calculation except counting number of rows in each dataframe, why is there a difference of ~50-55 secs between the two?
df1.join(df2, on=<clause>,how=outer).where(<clause>)
If I bypass this operation, the df1 and df2 are then joined with partitioning being done at the time of loading by passing , the snowflake stage completes in 4 minutes, but the redshift one takes 10 minutes, why is there a difference of 6 minutes between the two stages. The partition at the time of loading is done like this-
.option("partitionColumn", partitioncolumn) \
.option("lowerBound", lowerbound) \
.option("upperBound", upperbound) \
.option("numPartitions", numpartitions) \
Just wanted to understand why is there a difference between two datasets when the data is same and code is also same?

How to filter data using spark.read in place?

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

Refreshing static Dataframe periodically not working after unpersist in Spark Structured Streaming

I am writing a Pyspark Structured Streaming application (version 3.0.1) and I'm trying to refresh a static dataframe from JDBC source periodically.
I have followed the instructions in this post using a rate stream.
Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically
However, whenever the first unpersist (wether with or without blocking=True) occurs, the following persist is ignored, and from then on the dataframe is read in each trigger instead of being cached in storage.
this is how my code looks like:
# Read static dataframe
static_df = spark.read \
.option("multiline", True) \
.format("jdbc") \
.option("url", JDBC_URL) \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.option("numPartitions", NUMPARTITIONS) \
.option("query", QUERY) \
.load()\
.na.drop(subset=[MY_COL]) \
.repartition(MY_COL) \
.persist()
# Create rate stream
staticRefreshStream = spark.readStream.format("rate") \
.option("rowsPerSecond", 1) \
.option("numPartitions", 1) \
.load()
def foreachBatchRefresher(batch_df, batch_id):
global static_df
print("Refreshing static table")
static_df.unpersist()
static_df = spark.read \
.option("multiline", True) \
.format("jdbc") \
.option("url", JDBC_URL) \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.option("numPartitions", NUMPARTITIONS) \
.option("query", QUERY) \
.load()\
.na.drop(subset=[MY_COL]) \
.repartition(MY_COL) \
.persist()
staticRefreshStream.writeStream \
.format("console") \
.outputMode("append") \
.queryName("RefreshStaticDF") \
.foreachBatch(foreachBatchRefresher) \
.trigger(processingTime='1 minutes') \
.start()
staticRefreshStream.awaitTermination()
The other parts including reading the streaming Dataframe, dataframe transformations and writing to a sink are omitted.
Any idea what I'm missing?

Query table names

I have been given access to a database. I am querying the data from a spark cluster. How do I check all the databases/tables I have access to?
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user='{3}';password='{4}'".format(jdbcHostname, jdbcPort, jdbcDatabase, jdbcUsername, jdbcPassword)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, properties=connectionProperties)
Access to the database has been authenticated.
In SQL Server itself:
select *
from sys.tables
Not sure if you use synonym or not as the way into the sys schema.
val tables = spark.read.jdbc(jdbc_url, "sys.tables", connectionProperties)
tables.select(...
If you have synonym, replace the sys.tables with that. There are different ways of writing, you go down the tables or own SQL Query approach approach. This is the tables approach. Here under the SQL Query approach, an example:
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample order by k DESC) e", connectionProperties)
SCALA version I just realized.
pyspark specifically
See: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Just the same approach, but specifically for this case postgres:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
JDBC loading and saving can be achieved via either the load/save or jdbc methods, see the guide.

Upsert data in postgresql using spark structured streaming

I am trying to run a structured streaming application using (py)spark. My data is read from a Kafka topic and then I am running windowed aggregation on event time.
# I have been able to create data frame pn_data_df after reading data from Kafka
Schema of pn_data_df
|
- id StringType
- source StringType
- source_id StringType
- delivered_time TimeStamp
windowed_report_df = pn_data_df.filter(pn_data_df.source == 'campaign') \
.withWatermark("delivered_time", "24 hours") \
.groupBy('source_id', window('delivered_time', '15 minute')) \
.count()
windowed_report_df = windowed_report_df \
.withColumn('start_ts', unix_timestamp(windowed_report_df.window.start)) \
.withColumn('end_ts', unix_timestamp(windowed_report_df.window.end)) \
.selectExpr('CAST(source_id as LONG)', 'start_ts', 'end_ts', 'count')
I am writing this windowed aggregation to my postgresql database which I have already created.
CREATE TABLE pn_delivery_report(
source_id bigint not null,
start_ts bigint not null,
end_ts bigint not null,
count integer not null,
unique(source_id, start_ts)
);
Writing to postgresql using spark jdbc allows me to either Append or Overwrite. Append mode fails if there is an existing composite key existing in the database, and Overwrite just overwrites entire table with current batch output.
def write_pn_report_to_postgres(df, epoch_id):
df.write \
.mode('append') \
.format('jdbc') \
.option("url", "jdbc:postgresql://db_endpoint/db") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "pn_delivery_report") \
.option("user", "postgres") \
.option("password", "PASSWORD") \
.save()
windowed_report_df.writeStream \
.foreachBatch(write_pn_report_to_postgres) \
.option("checkpointLocation", '/home/hadoop/campaign_report_df_windowed_checkpoint') \
.outputMode('update') \
.start()
How can I execute a query like
INSERT INTO pn_delivery_report (source_id, start_ts, end_ts, COUNT)
VALUES (1001, 125000000001, 125000050000, 128),
(1002, 125000000001, 125000050000, 127) ON conflict (source_id, start_ts) DO
UPDATE
SET COUNT = excluded.count;
in foreachBatch.
Spark has a jira feature ticket open for it, but it seems that it has not been prioritised till now.
https://issues.apache.org/jira/browse/SPARK-19335
that's worked for me:
def _write_streaming(self,
df,
epoch_id
) -> None:
df.write \
.mode('append') \
.format("jdbc") \
.option("url", f"jdbc:postgresql://localhost:5432/postgres") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", 'table_test') \
.option("user", 'user') \
.option("password", 'password') \
.save()
df_stream.writeStream \
.foreachBatch(_write_streaming) \
.start() \
.awaitTermination()
You need to add ".awaitTermination()" at the end.

Resources