Is my use case for GCP Dataproc feasible?

Is my use case for GCP Dataproc feasible? - apache-spark

Not sure if there is a place/people to ask for one on one advice for Dataproc setup and tuning. But figure here is as good as place as any to find some help.
Our team has been primarily using BigQuery to do our data analysis on location driven data. We're carrying data back to 2019, so we're carry a lot of data. We've added some clustering (always had date partitioning) to help keep cost down, but its getting to the point where it just not feasible. At the moment we have upwards to 200 TB of data and daily raw data ranges from 3-8 TB (gets reduce quite a bit after a few steps).
First we'd like to move our 200 TB of data to GCS and segment it to more granular level. The schema for this data is:
uid -- STRING
timestamp_of_observation -- TIMESTAMP,
lat -- FLOAT,
lon -- FLOAT,
datasource -- STRING,
cbg (short for census_block_group) -- STRING
We would like to save the data to GCS using hive partitioning so that our bucket folder structure looks like
year > month > day > cbg
Knowing we are processing about 200TB and 3 years of data and cbgs alone have about 200,000 possibilities is this feasible?
We have a few other options using either census block tracts (84,414 subfolders) or counties (35,000), the more granularity for us the better.
My first attempts I either get just a OOM or I get stages just running forever. My initial pyspark code looks like the following:
from pyspark import SparkFiles
from pyspark.sql.functions import year, month, dayofmonth, rand
from pyspark.sql.functions import col, spark_partition_id, asc, desc
# use appropriate version for jar depending on the scala version
spark = SparkSession.builder\
.appName('BigNumeric')\
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 365*100)
df = spark.read \
.format("bigquery") \
.load("data-location-338617.Raw_Location_Data.data_cbg_enrich_proto")
df1 = df.withColumn("year", year(col("visit_timestamp"))) \
.withColumn("month", month(col("visit_timestamp"))) \
.withColumn("day", dayofmonth(col("visit_timestamp"))) \
.withColumn("cbg", col("boundary_partition")) \
.withColumn('salt', rand())
df1.repartition(365*100,'salt','year','month','day') \
.drop('salt') \
.write.mode("overwrite") \
.format("parquet") \
.partitionBy("year", "month", "day", "cbg") \
.save("gs://initial_test/cbg_data/")
This code was given to me but a fellow engineer. He told me to add salt for skewness, to increase my partitions.
Any and all advice would be helpful. The goal here to do one huge batch to migrate our data to GCS and then daily begin to save our raw data transformed to GCS as oppose to Bigquery.
I would envision that the file numbers to be written are 31230*200000 (216000000) which seems like a lot. Is there a better way to organize this, our original purpose was to make this data MUCH cheaper downstream to query. Right now the date partition has been the best way to minimize cost, we have clustering on CBG column but it doesn't seem to drive cost down very much. My thought is that with the GCS hive structure, it would essentially make CBG (or other spatial grouping) as a true partition and now just a cluster.
Lastly I"m not doing much to the cluster configuration, I've played around with number of worker nodes and machines but haven't truly gotten anything to work again any help is appreciated and thank you for looking!
This is the cluster setup CLI code
gcloud dataproc clusters create cluster-f35f --autoscaling-policy location_data --enable-component-gateway --bucket cbg-test-patino --region us-central1 --zone us-central1-f --master-machine-type n1-standard-8 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 30 --worker-machine-type n2-standard-16 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 1000 --image-version 2.0-debian10 --optional-components JUPYTER --project data-*********** --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh --metadata bigquery-connector-version=1.2.0 --metadata spark-bigquery-connector-version=0.21.0

Related

Slow performance writing from pyspark dataframe to Azure Synapse pool

I am writing data from a Spark dataframe in an Azure Databricks notebook into a dedicated Synapse pool. The problem is this takes an extremely long time given the small size of the data involved.
Read performance is fine, this syntax will happily read 100,000 rows in a couple of seconds. However takes about 25 minutes to write a similar number of rows (of 3 columns).
Are there any options I should be adding to improve write performance? Or is there a faster way of completing the same task?
(df.write
.format("jdbc")
.option("url", f"jdbc:sqlserver://<blahblah>.sql.azuresynapse.net:1433;database=<blahblah>;user=<blah>#<blahblah>;password={password};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;")
.option("dbtable", "dbo.newtablename")
.option("user", <username>)
.option("password", <password>)
.option("createTableColumnTypes", "column1 VARCHAR(36), column2 VARCHAR(1), column3 VARCHAR(1)")
.save()
)
I added the createTableColumnTypes option as the default column type did not permit a columnstore index to be created.
Other options I have reviwed in the documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) do not seem relevant.

Why is repartition faster than partitionBy in Spark?

I am attempting to use Spark for a very simple use case: given a large set of files (90k) with device time-series data for millions of devices group all of the time-series reads for a given device into a single set of files (partition). For now let’s say we are targeting 100 partitions, and it is not critical that a given devices data shows up in the same output file, just the same partition.
Given this problem we’ve come up with two ways to do this - repartition then write or write with partitionBy applied to the Writer. The code for either of these is very simple:
repartition (hash column is added to ensure that comparison to partitionBy code below is one-to-one):
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.mode("overwrite") \
.save(output_path)
partitionBy:
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.write.format("json") \
.partitionBy(“partition”) \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.mode("overwrite") \
.save(output_path)
In our testing repartition is 10x faster than partitionBy. Why is this?
Based on my understanding repartition will incur a shuffle which my Spark learnings have told me to try to avoid whenever possible. On the other hand, partitionBy (based on my understanding) only incurs an sort operation local to each node - no shuffle is needed. Am I misunderstanding something that is causing me to think partitionBy would be faster?

TLDR: Spark triggers a sort when you call partitionBy, and not a hash re-partitioning. This is why it is much slower in your case.
We can check that with a toy example:
spark.range(1000).withColumn("partition", 'id % 100)
.repartition('partition).write.csv("/tmp/test.csv")
Don't pay attention to the grey stage, it is skipped because it was computed in a previous job.
Then, with partitionBy:
spark.range(1000).withColumn("partition", 'id % 100)
.write.partitionBy("partition").csv("/tmp/test2.csv")
You can check that you can add repartition before partitionBy, the sort will still be there. So what's happening? Notice that the sort in the second DAG does not trigger a shuffle. It is a map partition. In fact, when you call partitionBy, spark does not shuffle the data as one would expect at first. Spark sorts each partition individually and then each executor writes his data in the according partition, in a separate file. Therefore, note that with partitionBy you are not writing num_partitions files but something between num_partitions and num_partitions * num_executors files. Each partition has one file per executor containing data belonging to that partition.

I think #Oli has explained the issue perfectly in his comments to the main answer. I just want to add my 2 cents and try to explain the same.
Let's say when you are reading the XML files [90K files], spark reads it into N partitions. This is decided based on the number of factors like spark.sql.files.maxPartitionBytes, file format, compression type etc.
Let's assume it to be 10K partitions. This is happening in the below part of the code.
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
Assuming you are using num_partitions = 100, you are adding a new column called partition with values 0-99. Spark is just adding a new column to the existing dataframe [or rdd] which is split across the 10K partitions.
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
Till this point, both the codes are the same.
Now, let's compare what is happening with repartition v/s partitionBy
Case 1: repartition
.repartition("partition") \
.write.format("json") \
Here, you are repartitioning the existing dataframe based on the column "partition" which has 100 distinct values. So the existing dataframe will incur a full shuffle bringing down the number of partitions from 10K to 100. This stage will be compute-heavy since a full shuffle is involved. This could also fail if the size of one particular partition is really huge [skewed partition].
But the advantage here is that in the next stage where write happens, Spark has to write only 100 files to the output_path. Each file will only have data corresponding to only one value of column "partition"
Case 2: partitionBy
.write.format("json") \
.partitionBy("partition") \
Here, you are asking spark to write the existing dataframe into output_path partitioned by the distinct values of the column "partition". You are nowhere asking spark to reduce the existing partition count of the dataframe.
So spark will create new folders inside the output_path
and write data corresponding to each partitions inside it.
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
Now since you have 10K spark partitions on the existing data frame and assuming the worst case where each of these 10K partitions has all the distinct values of the column "partition", Spark will have to write 10K * 100 = 1M files.
ie, some part of all the 10K partitions will be written to all of the 100 folders created by the column "partition". This way spark will be writing 1M files to the output_path by creating sub-directories inside. The advantage is that we are skipping a full-shuffle using this method.
Now compared to the in-memory compute-intensive shuffle in Case 1, this will be much slower since Spark has to create 1M files and write them to persistent storage.
That too, initially to a temporary folder and then to the output_path.
This will be much more slower if the write is happening to an object-store like AWS S3 or GCP Blob
Case 3: coalesce + partitionBy
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
In this case, you will be reducing the number of spark partitions from 10K to 100 with coalesce() and writing it to output_path partitioned by column "partition".
So, assuming the worst case where each of these 100 partitions has all the distinct values of the column "partition", spark will have to write 100 * 100 = 10K files.
This will still be faster than Case 2, but will be slower than Case 1.
This is because you are doing a partial shuffle with coalesce() but still end up writing 10K files to output_path.
Case 4: repartition+ partitionBy
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
In this case, you will be reducing the number of spark partitions from 10K to 100 [distinct values of column "partition"] with repartition() and writing it to output_path partitioned by column "partition".
So, each of these 100 partitions has only one distinct value of the column "partition", spark will have to write 100 * 1 = 100 files. Each sub-folder created by partitionBy() will only have 1 file inside it.
This will take the same time as Case 1 since both the cases involve a full-shuffle and then writing 100 files. The only difference here will be that 100 files will be inside sub-folders under the output_path.
This setup will be useful for predicate push-down of filters while reading the output_path via spark or hive.
Conclusion:
Even though partitionBy is faster than repartition, depending on the number of dataframe partitions and distribution of data inside those partitions, just using partitionBy alone might end up costly.

delta lake in databricks - a consistent "view" of just the last half hour of a stream

I have consisently updated table from spark structured streaming (kafka source)
Written like this (in eachBatch)
parsedDf \
.select("somefield", "anotherField",'partition', 'offset') \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(f"/mnt/defaultDatalake/{append_table_name}")
I need to keep a fast view on this table for "items inserted in the last half hour"
How can this be achieved?
I can have a readStream from this table, but what I'm missing is how keep just the "tail" of the stream there
Databricks 7.5 spark 3.

Given that Delta lake does not have materalized views and that Delta Lake time-travel is not relevant as you want the most current data:
You can load the data and include a key that does not need to be looked up whilst inserting.
Pre-populate a time dimension for joining with your data. See it as a dimension with a grain of a minute.
Join the data with this dimension, relying on dynamic file pruning. Thus you need to query per minute every 30 minutes with rolling window and set those values in the query.
See https://databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-pruning.html#:~:text=Dynamic%20File%20Pruning%20(DFP)%2C%20a%20new%20feature%20now%20enabled,queries%20on%20non%2Dpartitioned%20tables.

Spark Cluster configuration

I'm using a spark cluster with of two nodes each having two executors(each using 2 cores and 6GB memory).
Is this a good cluster configuration for a faster execution of my spark jobs?
I am kind of new to spark and I am running a job on 80 million rows of data which includes shuffling heavy tasks like aggregate(count) and join operations(self join on a dataframe).
Bottlenecks:
Showing Insufficient resources for my executors while reading the data.
On a smaller dataset, it's taking a lot of time.
What should be my approach and how can I do away with my bottlenecks?
Any suggestion would be highly appreciable.
query= "(Select x,y,z from table) as df"
jdbcDF = spark.read.format("jdbc").option("url", mysqlUrl) \
.option("dbtable", query) \
.option("user", mysqldetails[2]) \
.option("password", mysqldetails[3]) \
.option("numPartitions", "1000")\
.load()
This gives me a dataframe which on jdbcDF.rdd.getNumPartitions() gives me value of 1. Am I missing something here?. I think I am not parallelizing my dataset.

There are different ways to improve the performance of your application. PFB some of the points which may help.
Try to reduce the number of records and columns for processing. As you have mentioned you are new to spark and you might not need all 80 million rows, so you can filter the rows to whatever you require. Also, select the columns which is required but not all.
If you are using some data frequently then try considering caching the data, so that for the next operation it will be read from the memory.
If you are joining two DataFrames and if one of them is small enough to fit in memory then you can consider broadcast join.
Increasing the resources might not improve the performance of your application in all cases, but looking at your configuration of the cluster, it should help. It might be good idea to throw some more resources and check the performance.
You can also try using Spark UI to monitor your application and see if there are few task which are taking long time than others. Then probably you need to deal with skewness of your data.
You can try considering to Partition your data based on the columns which you are using in your filter criteria.

Incremental Data loading and Querying in Pyspark without restarting Spark JOB

Hi All I want to do incremental data query.
df = spark .read.csv('csvFile', header=True) #1000 Rows
df.persist() #Assume it takes 5 min
df.registerTempTable('data_table') #or createOrReplaceTempView
result = spark.sql('select * from data_table where column1 > 10') #100 rows
df_incremental = spark.read.csv('incremental.csv') #200 Rows
df_combined = df.unionAll(df_incremental)
df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time
df_combined.registerTempTable("data_table")
result = spark.sql('select * from data_table where column1 > 10') # 105 Rows.
read a csv/mysql Table data into spark dataframe.
Persist that dataframe in memory Only(reason: I need performance & My dataset can fit to memory)
Register as temp table and run spark sql queries. #Till this my spark job is UP and RUNNING.
Next day i will receive a incremental Dataset(in a temp_mysql_table or a csv file). Now I want to run same query on a Total set i:e persisted_prevData + recent_read_IncrementalData. i will call it mixedDataset.
*** there is no certainty that when incremental data comes to system, it can come 30 times a day.
Till here also I don't want the spark-Application to be down,. It should always be Up. And I need performance of querying mixedDataset with same time measure as if it is persisted.
My Concerns :
In P4, Do i need to unpersist the prev_data and again persist the union-Dataframe of prev&Incremantal data?
And my most important concern is i don't want to restart the Spark-JOB to load/start with Updated Data(Only if server went down, i have to restart of course).
So, on a high level, i need to query (faster performance) dataset + Incremnatal_data_if_any dynamically.
Currently i am doing this exercise by creating a folder for all the data, and incremental file also placed in the same directory. Every 2-3 hrs, i am restarting the server and my sparkApp starts with reading all the csv files present in that system. Then queries running on them.
And trying to explore hive persistentTable and Spark Streaming, will update here if found any result.
Please suggest me a way/architecture to achieve this.
Please comment, if anything is not clear on Question, without downvoting it :)
Thanks.

Try streaming instead it will be much faster since the session is already running and it will be triggered everytime you place something in the folder:
df_incremental = spark \
.readStream \
.option("sep", ",") \
.schema(input_schema) \
.csv(input_path)
df_incremental.where("column1 > 10") \
.writeStream \
.queryName("data_table") \
.format("memory") \
.start()
spark.sql("SELECT * FROM data_table).show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string