How to create a CSV file with PySpark? - apache-spark

I have a short question about pyspark write.
read_jdbc = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
read_jdbc.show()
When I command show, it works perfectly,
However, when I try to write,
read_jdbc.write.csv("some/aaa")
it only creates aaa folder, but the CSV is not created.
I also tried
read_jdbc.write.jdbc(mysql_url,table="csv_test",mode="append")
This does not work either. Any help?

you can write dataframe to csv
df.write.csv("file-path")
or
df.write.format("csv").save("file-path")

Related

Spark Action after dataframe load from jdbc driver taking too long

I am loading 10Gb or more data from two databases namely, redshift and snowflake into the dataframe. The data is loaded for comparison between the datasets. The code for loading data is same except for the jars used with no partition or with partition as well
This is the one below with one partition
df1 = sparkcontext.read \
.format("jdbc") \
.option("driver", "com.amazon.redshift.jdbc42.Driver") \
.option("url", jdbcurl) \
.option("query", query) \
.option("user", user) \
.option("password", password).load()
df1.count()
df2 = sparkcontext.read \
.format("jdbc") \
.option("url", jdbcurl) \
.option("driver", "net.snowflake.client.jdbc.SnowflakeDriver") \
.option("schema", SchemaName) \
.option("query", query) \
.option("user", user) \
.option("password", password).load()
df2.count()
the time taken by second one is 5 secs and while the first one count takes 60 secs.
Since, at this moment we are not performing any calculation except counting number of rows in each dataframe, why is there a difference of ~50-55 secs between the two?
df1.join(df2, on=<clause>,how=outer).where(<clause>)
If I bypass this operation, the df1 and df2 are then joined with partitioning being done at the time of loading by passing , the snowflake stage completes in 4 minutes, but the redshift one takes 10 minutes, why is there a difference of 6 minutes between the two stages. The partition at the time of loading is done like this-
.option("partitionColumn", partitioncolumn) \
.option("lowerBound", lowerbound) \
.option("upperBound", upperbound) \
.option("numPartitions", numpartitions) \
Just wanted to understand why is there a difference between two datasets when the data is same and code is also same?

How to filter data using spark.read in place?

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

Refreshing static Dataframe periodically not working after unpersist in Spark Structured Streaming

I am writing a Pyspark Structured Streaming application (version 3.0.1) and I'm trying to refresh a static dataframe from JDBC source periodically.
I have followed the instructions in this post using a rate stream.
Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically
However, whenever the first unpersist (wether with or without blocking=True) occurs, the following persist is ignored, and from then on the dataframe is read in each trigger instead of being cached in storage.
this is how my code looks like:
# Read static dataframe
static_df = spark.read \
.option("multiline", True) \
.format("jdbc") \
.option("url", JDBC_URL) \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.option("numPartitions", NUMPARTITIONS) \
.option("query", QUERY) \
.load()\
.na.drop(subset=[MY_COL]) \
.repartition(MY_COL) \
.persist()
# Create rate stream
staticRefreshStream = spark.readStream.format("rate") \
.option("rowsPerSecond", 1) \
.option("numPartitions", 1) \
.load()
def foreachBatchRefresher(batch_df, batch_id):
global static_df
print("Refreshing static table")
static_df.unpersist()
static_df = spark.read \
.option("multiline", True) \
.format("jdbc") \
.option("url", JDBC_URL) \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.option("numPartitions", NUMPARTITIONS) \
.option("query", QUERY) \
.load()\
.na.drop(subset=[MY_COL]) \
.repartition(MY_COL) \
.persist()
staticRefreshStream.writeStream \
.format("console") \
.outputMode("append") \
.queryName("RefreshStaticDF") \
.foreachBatch(foreachBatchRefresher) \
.trigger(processingTime='1 minutes') \
.start()
staticRefreshStream.awaitTermination()
The other parts including reading the streaming Dataframe, dataframe transformations and writing to a sink are omitted.
Any idea what I'm missing?

Upsert data in postgresql using spark structured streaming

I am trying to run a structured streaming application using (py)spark. My data is read from a Kafka topic and then I am running windowed aggregation on event time.
# I have been able to create data frame pn_data_df after reading data from Kafka
Schema of pn_data_df
|
- id StringType
- source StringType
- source_id StringType
- delivered_time TimeStamp
windowed_report_df = pn_data_df.filter(pn_data_df.source == 'campaign') \
.withWatermark("delivered_time", "24 hours") \
.groupBy('source_id', window('delivered_time', '15 minute')) \
.count()
windowed_report_df = windowed_report_df \
.withColumn('start_ts', unix_timestamp(windowed_report_df.window.start)) \
.withColumn('end_ts', unix_timestamp(windowed_report_df.window.end)) \
.selectExpr('CAST(source_id as LONG)', 'start_ts', 'end_ts', 'count')
I am writing this windowed aggregation to my postgresql database which I have already created.
CREATE TABLE pn_delivery_report(
source_id bigint not null,
start_ts bigint not null,
end_ts bigint not null,
count integer not null,
unique(source_id, start_ts)
);
Writing to postgresql using spark jdbc allows me to either Append or Overwrite. Append mode fails if there is an existing composite key existing in the database, and Overwrite just overwrites entire table with current batch output.
def write_pn_report_to_postgres(df, epoch_id):
df.write \
.mode('append') \
.format('jdbc') \
.option("url", "jdbc:postgresql://db_endpoint/db") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "pn_delivery_report") \
.option("user", "postgres") \
.option("password", "PASSWORD") \
.save()
windowed_report_df.writeStream \
.foreachBatch(write_pn_report_to_postgres) \
.option("checkpointLocation", '/home/hadoop/campaign_report_df_windowed_checkpoint') \
.outputMode('update') \
.start()
How can I execute a query like
INSERT INTO pn_delivery_report (source_id, start_ts, end_ts, COUNT)
VALUES (1001, 125000000001, 125000050000, 128),
(1002, 125000000001, 125000050000, 127) ON conflict (source_id, start_ts) DO
UPDATE
SET COUNT = excluded.count;
in foreachBatch.
Spark has a jira feature ticket open for it, but it seems that it has not been prioritised till now.
https://issues.apache.org/jira/browse/SPARK-19335
that's worked for me:
def _write_streaming(self,
df,
epoch_id
) -> None:
df.write \
.mode('append') \
.format("jdbc") \
.option("url", f"jdbc:postgresql://localhost:5432/postgres") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", 'table_test') \
.option("user", 'user') \
.option("password", 'password') \
.save()
df_stream.writeStream \
.foreachBatch(_write_streaming) \
.start() \
.awaitTermination()
You need to add ".awaitTermination()" at the end.

Structured Streaming (Spark 2.3.0) cannot write to Parquet file sink when submitted as a job

I'm consuming from Kafka and writing to parquet in EMRFS. Below code works in spark-shell:
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
SBT is able to package the code without errors. When the .jar is sent to spark-submit, the job is accepted and stays in running state forever without writing data to HDFS.
There is no ERROR in the .inprogress log
Some posts suggest that a large watermark duration can cause it, but I have not set a custom watermark duration.
I can write to parquet using Pyspark, I put you my code in case that will be useful:
stream = self.spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", self.kafka_bootstrap_servers) \
.option("subscribe", self.topic) \
.option("startingOffsets", self.startingOffsets) \
.option("max.poll.records", self.max_poll_records) \
.option("auto.commit.interval.ms", self.auto_commit_interval_ms) \
.option("session.timeout.ms", self.session_timeout_ms) \
.option("key.deserializer", self.key_deserializer) \
.option("value.deserializer", self.value_deserializer) \
.load()
self.query = stream \
.select(col("value")) \
.select((self.proto_function("value")).alias("value_udf")) \
.select(*columns,
date_format(column_time, "yyyy").alias("date").alias("year"),
date_format(column_time, "MM").alias("date").alias("month"),
date_format(column_time, "dd").alias("date").alias("day"),
date_format(column_time, "HH").alias("date").alias("hour"))
query = self.query \
.writeStream \
.format("parquet") \
.option("checkpointLocation", self.path) \
.partitionBy("year", "month", "day", "hour") \
.option("path", self.path) \
.start()
Also, you need to run the code in that way: spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 <code>

Resources