Incremental Data loading and Querying in Pyspark without restarting Spark JOB - apache-spark

Hi All I want to do incremental data query.
df = spark .read.csv('csvFile', header=True) #1000 Rows
df.persist() #Assume it takes 5 min
df.registerTempTable('data_table') #or createOrReplaceTempView
result = spark.sql('select * from data_table where column1 > 10') #100 rows
df_incremental = spark.read.csv('incremental.csv') #200 Rows
df_combined = df.unionAll(df_incremental)
df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time
df_combined.registerTempTable("data_table")
result = spark.sql('select * from data_table where column1 > 10') # 105 Rows.
read a csv/mysql Table data into spark dataframe.
Persist that dataframe in memory Only(reason: I need performance & My dataset can fit to memory)
Register as temp table and run spark sql queries. #Till this my spark job is UP and RUNNING.
Next day i will receive a incremental Dataset(in a temp_mysql_table or a csv file). Now I want to run same query on a Total set i:e persisted_prevData + recent_read_IncrementalData. i will call it mixedDataset.
*** there is no certainty that when incremental data comes to system, it can come 30 times a day.
Till here also I don't want the spark-Application to be down,. It should always be Up. And I need performance of querying mixedDataset with same time measure as if it is persisted.
My Concerns :
In P4, Do i need to unpersist the prev_data and again persist the union-Dataframe of prev&Incremantal data?
And my most important concern is i don't want to restart the Spark-JOB to load/start with Updated Data(Only if server went down, i have to restart of course).
So, on a high level, i need to query (faster performance) dataset + Incremnatal_data_if_any dynamically.
Currently i am doing this exercise by creating a folder for all the data, and incremental file also placed in the same directory. Every 2-3 hrs, i am restarting the server and my sparkApp starts with reading all the csv files present in that system. Then queries running on them.
And trying to explore hive persistentTable and Spark Streaming, will update here if found any result.
Please suggest me a way/architecture to achieve this.
Please comment, if anything is not clear on Question, without downvoting it :)
Thanks.

Try streaming instead it will be much faster since the session is already running and it will be triggered everytime you place something in the folder:
df_incremental = spark \
.readStream \
.option("sep", ",") \
.schema(input_schema) \
.csv(input_path)
df_incremental.where("column1 > 10") \
.writeStream \
.queryName("data_table") \
.format("memory") \
.start()
spark.sql("SELECT * FROM data_table).show()

Related

Method to optimize PySpark dataframe saving time

I'm running a notebook on Azure databricks using a multinode cluster with 1 driver and 1-8 workers(each with 16 cores and 56 gb ram). Reading the source data from Azure ADLS which has 30K records. Notebook is consist of few transformation steps, also using two UDFs which are necessary for code implementation. While my entire transformation steps are running within 12 minutes(which is expected), it is taking more than 2 hours to save the final dataframe to ADSL Delta table. I'm providing some code snippet here(can't provide the entire code), suggest me ways to reduce this dataframe saving time.
# All the data reading and transformation code
# only one display statement before saving it to delta table. Up to this statement it is taking 12 minutes to run
data.display()
# Persisting the data frame
from pyspark import StorageLevel
data.persist(StorageLevel.MEMORY_ONLY)
# Finally writing the data to delta table
# This part is taking more than 2 hours to run
# Persist Brand Extraction Output
(
data
.write
.format('delta')
.mode('overwrite')
.option('overwriteSchema', 'true')
.saveAsTable('output_table')
)
Another save option tried but not much improvement
mount_path = "/mnt/********/"
table_name = "********"
adls_path = mount_path + table_name
(data.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(adls_path))

Is my use case for GCP Dataproc feasible?

Not sure if there is a place/people to ask for one on one advice for Dataproc setup and tuning. But figure here is as good as place as any to find some help.
Our team has been primarily using BigQuery to do our data analysis on location driven data. We're carrying data back to 2019, so we're carry a lot of data. We've added some clustering (always had date partitioning) to help keep cost down, but its getting to the point where it just not feasible. At the moment we have upwards to 200 TB of data and daily raw data ranges from 3-8 TB (gets reduce quite a bit after a few steps).
First we'd like to move our 200 TB of data to GCS and segment it to more granular level. The schema for this data is:
uid -- STRING
timestamp_of_observation -- TIMESTAMP,
lat -- FLOAT,
lon -- FLOAT,
datasource -- STRING,
cbg (short for census_block_group) -- STRING
We would like to save the data to GCS using hive partitioning so that our bucket folder structure looks like
year > month > day > cbg
Knowing we are processing about 200TB and 3 years of data and cbgs alone have about 200,000 possibilities is this feasible?
We have a few other options using either census block tracts (84,414 subfolders) or counties (35,000), the more granularity for us the better.
My first attempts I either get just a OOM or I get stages just running forever. My initial pyspark code looks like the following:
from pyspark import SparkFiles
from pyspark.sql.functions import year, month, dayofmonth, rand
from pyspark.sql.functions import col, spark_partition_id, asc, desc
# use appropriate version for jar depending on the scala version
spark = SparkSession.builder\
.appName('BigNumeric')\
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 365*100)
df = spark.read \
.format("bigquery") \
.load("data-location-338617.Raw_Location_Data.data_cbg_enrich_proto")
df1 = df.withColumn("year", year(col("visit_timestamp"))) \
.withColumn("month", month(col("visit_timestamp"))) \
.withColumn("day", dayofmonth(col("visit_timestamp"))) \
.withColumn("cbg", col("boundary_partition")) \
.withColumn('salt', rand())
df1.repartition(365*100,'salt','year','month','day') \
.drop('salt') \
.write.mode("overwrite") \
.format("parquet") \
.partitionBy("year", "month", "day", "cbg") \
.save("gs://initial_test/cbg_data/")
This code was given to me but a fellow engineer. He told me to add salt for skewness, to increase my partitions.
Any and all advice would be helpful. The goal here to do one huge batch to migrate our data to GCS and then daily begin to save our raw data transformed to GCS as oppose to Bigquery.
I would envision that the file numbers to be written are 31230*200000 (216000000) which seems like a lot. Is there a better way to organize this, our original purpose was to make this data MUCH cheaper downstream to query. Right now the date partition has been the best way to minimize cost, we have clustering on CBG column but it doesn't seem to drive cost down very much. My thought is that with the GCS hive structure, it would essentially make CBG (or other spatial grouping) as a true partition and now just a cluster.
Lastly I"m not doing much to the cluster configuration, I've played around with number of worker nodes and machines but haven't truly gotten anything to work again any help is appreciated and thank you for looking!
This is the cluster setup CLI code
gcloud dataproc clusters create cluster-f35f --autoscaling-policy location_data --enable-component-gateway --bucket cbg-test-patino --region us-central1 --zone us-central1-f --master-machine-type n1-standard-8 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 30 --worker-machine-type n2-standard-16 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 1000 --image-version 2.0-debian10 --optional-components JUPYTER --project data-*********** --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh --metadata bigquery-connector-version=1.2.0 --metadata spark-bigquery-connector-version=0.21.0

Why is repartition faster than partitionBy in Spark?

I am attempting to use Spark for a very simple use case: given a large set of files (90k) with device time-series data for millions of devices group all of the time-series reads for a given device into a single set of files (partition). For now let’s say we are targeting 100 partitions, and it is not critical that a given devices data shows up in the same output file, just the same partition.
Given this problem we’ve come up with two ways to do this - repartition then write or write with partitionBy applied to the Writer. The code for either of these is very simple:
repartition (hash column is added to ensure that comparison to partitionBy code below is one-to-one):
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.mode("overwrite") \
.save(output_path)
partitionBy:
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.write.format("json") \
.partitionBy(“partition”) \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.mode("overwrite") \
.save(output_path)
In our testing repartition is 10x faster than partitionBy. Why is this?
Based on my understanding repartition will incur a shuffle which my Spark learnings have told me to try to avoid whenever possible. On the other hand, partitionBy (based on my understanding) only incurs an sort operation local to each node - no shuffle is needed. Am I misunderstanding something that is causing me to think partitionBy would be faster?
TLDR: Spark triggers a sort when you call partitionBy, and not a hash re-partitioning. This is why it is much slower in your case.
We can check that with a toy example:
spark.range(1000).withColumn("partition", 'id % 100)
.repartition('partition).write.csv("/tmp/test.csv")
Don't pay attention to the grey stage, it is skipped because it was computed in a previous job.
Then, with partitionBy:
spark.range(1000).withColumn("partition", 'id % 100)
.write.partitionBy("partition").csv("/tmp/test2.csv")
You can check that you can add repartition before partitionBy, the sort will still be there. So what's happening? Notice that the sort in the second DAG does not trigger a shuffle. It is a map partition. In fact, when you call partitionBy, spark does not shuffle the data as one would expect at first. Spark sorts each partition individually and then each executor writes his data in the according partition, in a separate file. Therefore, note that with partitionBy you are not writing num_partitions files but something between num_partitions and num_partitions * num_executors files. Each partition has one file per executor containing data belonging to that partition.
I think #Oli has explained the issue perfectly in his comments to the main answer. I just want to add my 2 cents and try to explain the same.
Let's say when you are reading the XML files [90K files], spark reads it into N partitions. This is decided based on the number of factors like spark.sql.files.maxPartitionBytes, file format, compression type etc.
Let's assume it to be 10K partitions. This is happening in the below part of the code.
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
Assuming you are using num_partitions = 100, you are adding a new column called partition with values 0-99. Spark is just adding a new column to the existing dataframe [or rdd] which is split across the 10K partitions.
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
Till this point, both the codes are the same.
Now, let's compare what is happening with repartition v/s partitionBy
Case 1: repartition
.repartition("partition") \
.write.format("json") \
Here, you are repartitioning the existing dataframe based on the column "partition" which has 100 distinct values. So the existing dataframe will incur a full shuffle bringing down the number of partitions from 10K to 100. This stage will be compute-heavy since a full shuffle is involved. This could also fail if the size of one particular partition is really huge [skewed partition].
But the advantage here is that in the next stage where write happens, Spark has to write only 100 files to the output_path. Each file will only have data corresponding to only one value of column "partition"
Case 2: partitionBy
.write.format("json") \
.partitionBy("partition") \
Here, you are asking spark to write the existing dataframe into output_path partitioned by the distinct values of the column "partition". You are nowhere asking spark to reduce the existing partition count of the dataframe.
So spark will create new folders inside the output_path
and write data corresponding to each partitions inside it.
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
Now since you have 10K spark partitions on the existing data frame and assuming the worst case where each of these 10K partitions has all the distinct values of the column "partition", Spark will have to write 10K * 100 = 1M files.
ie, some part of all the 10K partitions will be written to all of the 100 folders created by the column "partition". This way spark will be writing 1M files to the output_path by creating sub-directories inside. The advantage is that we are skipping a full-shuffle using this method.
Now compared to the in-memory compute-intensive shuffle in Case 1, this will be much slower since Spark has to create 1M files and write them to persistent storage.
That too, initially to a temporary folder and then to the output_path.
This will be much more slower if the write is happening to an object-store like AWS S3 or GCP Blob
Case 3: coalesce + partitionBy
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
In this case, you will be reducing the number of spark partitions from 10K to 100 with coalesce() and writing it to output_path partitioned by column "partition".
So, assuming the worst case where each of these 100 partitions has all the distinct values of the column "partition", spark will have to write 100 * 100 = 10K files.
This will still be faster than Case 2, but will be slower than Case 1.
This is because you are doing a partial shuffle with coalesce() but still end up writing 10K files to output_path.
Case 4: repartition+ partitionBy
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
In this case, you will be reducing the number of spark partitions from 10K to 100 [distinct values of column "partition"] with repartition() and writing it to output_path partitioned by column "partition".
So, each of these 100 partitions has only one distinct value of the column "partition", spark will have to write 100 * 1 = 100 files. Each sub-folder created by partitionBy() will only have 1 file inside it.
This will take the same time as Case 1 since both the cases involve a full-shuffle and then writing 100 files. The only difference here will be that 100 files will be inside sub-folders under the output_path.
This setup will be useful for predicate push-down of filters while reading the output_path via spark or hive.
Conclusion:
Even though partitionBy is faster than repartition, depending on the number of dataframe partitions and distribution of data inside those partitions, just using partitionBy alone might end up costly.

PySpark structured streaming apply udf to window

I am trying to apply a pandas udf to a window of a pyspark structured stream. The problem is that as soon as the stream has caught up with the current state all new windows only contain a single value somehow.
As you can see in the screenshot all windows after 2019-10-22T15:34:08.730+0000 only contain a single value. The code used to generate this is this:
#pandas_udf("Count long, Resampled long, Start timestamp, End timestamp", PandasUDFType.GROUPED_MAP)
def myudf(df):
df = df.dropna()
df = df.set_index("Timestamp")
df.sort_index(inplace=True)
# resample the dataframe
resampled = pd.DataFrame()
oidx = df.index
nidx = pd.date_range(oidx.min(), oidx.max(), freq="30S")
resampled["Value"] = df.Value.reindex(oidx.union(nidx)).interpolate('index').reindex(nidx)
return pd.DataFrame([[len(df.index), len(resampled.index), df.index.min(), df.index.max()]], columns=["Count", "Resampled", "Start", "End"])
predictionStream = sensorStream.withWatermark("Timestamp", "90 minutes").groupBy(col("Name"), window(col("Timestamp"), "70 minutes", "5 minutes"))
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.start()
The stream does get new values every 5 minutes. Its just that the window somehow only takes values from the last batch even though the watermark should not have expired.
Is there anything I am doing wrong ? I already tried playing with the watermark; that did have no effect on the result. I need all values of the window inside the udf.
I am running this in databricks on a cluster set to 5.5 LTS ML (includes Apache Spark 2.4.3, Scala 2.11)
It looks like you could specify the Output Mode you want for you writeStream
See documentation at Output Modes
By default it's using Append Mode:
This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink.
Try using:
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.outputMode(OutputMode.Complete) \
.start()
I found a Spark JIRA issue concerning this problem but it was closed without resolution. The bug appears to be, and I confirmed this independently on Spark version 3.1.1, that the Pandas UDF is executed on every trigger only with the data since the last trigger. So you are likely only processing a subset of the data you want to take into account on each trigger. Grouped Map Pandas UDFs do not appear to be functional for structured streaming with a delta table source. Please do follow up if you previously found a solution, otherwise I’ll just leave this here for folks that also find this thread.
Edit: There's some discussion in the Databricks forums about first doing a streaming aggregation and following that up with a Pandas UDF (that will likely expect a single record with columns containing arrays) as shown below. I tried it. It works. However, my batch duration is high and I'm uncertain how much this additional work is contributing to it.
agg_exprs = [f.collect_list('col_of_interest_1'),
f.collect_list('col_of_interest_2'),
f.collect_list('col_of_interest_3')]
intermediate_sdf = source_sdf.groupBy('time_window', ...).agg(agg_exprs)
final_sdf = intermediate_sdf.groupBy('time_window', ...).applyInPandas(func, schema)

SparkSQL queries execution is slower than my database

Greeting,
I have created a Spark 2.1.1 cluster in Amazon EC2 with instance type m4.large of 1 master and 5 slaves to start. My PostgreSQL 9.5 database (t2.large) has a table of over 2 billions rows and 7 column that I would like to process. I have followed the direction from Apache Spark website and other various sources on how to connect and process these data.
My problem is that Spark SQL performance is way slower than my database. My sql statement (see below in the code) takes about 21mins in PSQL, but Spark SQL take about 42 min to finish. My main goal is to measure the performance of PSQL vs Spark SQL and so far I am not getting the desire results. I would appreciate the help.
Thank you
I have tried increasing fetchSize from 10000 to 100000, caching the dataframe, increase numpartition to 100, set spark.sql.shuffle to 2000, double my cluster size, and use larger instance type and so far I have not seen any improvements.
val spark = SparkSession.builder()
.appName("Spark SQL")
.getOrCreate();
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.load()
.createOrReplaceTempView("ghcn_all");
val sqlStatement = "SELECT ghcn_date, element_value/10.0
FROM ghcn_all
WHERE station_id = 'USW00094846'
AND (ghcn_date >= '2015-01-01' AND ghcn_date <= '2015-12-31')
AND qflag IS NULL
AND element_type = 'PRCP'
ORDER BY ghcn_date";
val sqlDF = spark.sql(sqlStatement);
var start:Long = System.nanoTime;
val num_rows:Long = sqlDF.count();
var end:Long = System.nanoTime;
println("Total Row : " + num_rows);
println("Total Collect Time Lapse : " + ((end - start) / 1000000) + " ms");
There is no good reason for this code to ever run faster on Spark, than database alone. First of all it is not even distributed, as you made the same mistake as many before you and don't partition the data.
But it more important is that you actually load data from the database - as a result it has to do at least as much work (and in practice more), then send data over the network, then data has to parsed by Spark, and processed. You basically do way more work and expect things to be faster - that's not going to happen.
If you want to reliably improve performance on Spark you should at least:
Extract data from the database.
Write to efficient (like not S3) distributed storage.
Use proper bucketing and partitioning to enable partition pruning and predicate pushdown.
Then you might have a better lack. But again, proper indexing of your data on the cluster, should improve performance as well, likely at a lower overall cost.
It is very important to set partitionColumn when your read from SQL. It use for parallel query. So you should decide which column is your partitionColumn.
In your case for example:
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.option("partitionColumn", "ghcn_date")
.option("lowerBound", "2015-01-01")
.option("upperBound", "2015-12-31")
.option("numPartitions",16 )
.load()
.createOrReplaceTempView("ghcn_all");
More Reference:
How Apache Spark Makes Your Slow MySQL Queries 10x Faster (or More)
Tips for using JDBC in Apache Spark SQL

Resources