Pyspark Checkpointing in Dataproc(StackOverFlowError) - apache-spark

I faced stackoveroverflow error when persisting dataset with pyspark. I am casting whole dataframe into doubletype and then persisting to calculate statistics, and I read that checkpointing is a solution to stackoverflow. However, I am having trouble with implementing it in dataproc.
I am working with pyspark, and when I checkpoint the dataframe and checkedpointed with df.isCheckpointed(), it returns false. However, when I debug it, df.rdd.is_checkpointed says True. Is there any issue with the package / am I doing something wrong?
I thought localCheckpoint is more appropriate for my purpose(https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint()), as my problem was simply DAG depth was too deep, but I couldn't find any use case. Also, if I just checkpointed RDD says it is checkpointed(as in first question), but if I tried localcheckpoint, it says it is not. Has anyone tried this function?
After I tried with local standalone mode, I tried it with dataproc. I tried both hdfs and google cloud storage, but either way the storage was empty, but rdd says it is checkpointed.
Can anyone help me with this? Thanks in advance!

If you're using localCheckpoint, it will write to the local disk of executors, not to HDFS/GCS: https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint--.
Also note that there's an eager (checkpoint immediately) and non-eager (checkpoint when the RDD is actually materialized) mode to checkpointing. That may affect what those methods return. The code is often the best documentation: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1615.
In general, please attach sample code (repros) to questions like this -- that way we can answer your question more directly.

Create a temp dir using your SparkSession object:
spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
dataframe_name = # Any Spark Dataframe of type <pyspark.sql.dataframe.DataFrame>
at this point of time, the dataframe_name would be a DAG, which you can store as a checkpoint like,
dataframe_checkpoint = dataframe_name.checkpoint()
dataframe_checkpoint is also a spark dataframe of type <pyspark.sql.dataframe.DataFrame> but instead of the DAG, it stores the result of the query
Use checkpoints if:
the computation takes a long time
the computing chain is too long
depends too many RDDs

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

Dataframe Checkpoint Example Pyspark

I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it.
My questions are:
Should I specifiy the checkpoint dir? Is it possible to do it like this:
df.checkpoint()
Are there any optional params that I should be aware about?
Is there a default checkpoint dir or I must specify one as default?
When I checkpoint dataframe and I reuse it - It autmoatically read the data from the dir that we wrote the files?
It will be great if you can share with me example of using checkpoint in pyspark with some explanation. Thanks!
You should assign the checkpointed dataframe to a variable as checkpoint "Returns a checkpointed version of this Dataset" (https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.checkpoint.html). So
df = df.checkpoint()
The only parameter is eager which dictates whether you want the checkpoint to trigger an action and be saved immediately, it is True by default and you usually want to keep it this way.
You have to set the checkpoint directory with SparkContext.setCheckpointDir(dirName) somewhere in your script before using checkpoints. Alternatively if you want to save to memory instead you can use localCheckpoint() instead of checkpoint() but that is not reliable and in case of issues/after termination the checkpoints will be lost (but it should be faster as it uses the caching subsystem instead of only writing to disk).
And yes, it should be read automatically, you can look at the history server and there should be "load data" nodes (I don't remember the exact name) at the start of blocks/queries

Lazy loading of partitioned parquet in Apache Spark

As I understand it, Apache Spark uses lazy evaluation. So for example code like the following that consists only of transformations will do no actual processing:
val transformed_df = df.filter("some_field = 10").select("some_other_field", "yet_another_field")
Only when we do an "action" will any processing actually occur:
transformed_df.show()
I had been under the impression that load operations are also lazy in spark. (See How spark loads the data into memory.)
However, my experiences with spark have not borne this out. When I do something like the following,
val df = spark.read.parquet("/path/to/parquet/")
execution seems to depend greatly on the size of the data in the path. In other words, it's not strictly lazy. This is inconvenient if the data is partitioned and I only need to look at a fraction of the partitions.
For example:
df.filter("partitioned_field = 10").show()
If the data is partitioned in storage on "partitioned_field", I would have expected spark to wait until show() is called, and then read only data under "/path/to/parquet/partitioned_field=10/". But again, this doesn't seem to be the case. Spark appears to perform at least some operations on all of the data as soon as read or load is called.
I could get around this by only loading /path/to/parquet/partitioned_field=10/ in the first place, but this is much less elegant than just calling "read" and filtering on the partitioned field, and it's harder to generalize.
Is there a more elegant preferred way to lazily load partitions of parquet data?
(To clarify, I am using Spark 2.4.3)
I think I've stumbled on an answer to my question while learning about a key distinction that is often overlooked when talking about lazy evaluation in spark.
Data is lazily evaluated, but schemas are not. So if we are reading parquet, which is a structured data type, spark does have to at least determine the schema of any files it's reading as soon as read() or load() is called. So calling read() on a large number of files will take longer than on a small number of files.
Given that partitions are part of the schema, it's less surprising to me now that spark has to look at all of the files in the path to determine the schema before filtering on a partition field.
It would be convenient for my purposes if spark were to wait until schema evaluation was strictly necessary and was able to filter on partition fields prior to determining the rest of the schema, but it sounds like this is not the case. I believe Dataset objects always must have a schema, so I'm not sure there's a way around this problem without significant changes to Spark.
In conclusion, it seems like my only option currently is to pass in a list of paths for the partitions that I need rather than the base path if I want to avoid evaluating the schema over the entire data repository.

Issues with long lineages (DAG) in Spark

We usually use Spark as processing engines for data stored on S3 or HDFS. We use Databricks and EMR platforms.
One of the issues I frequently face is when the task size grows, the job performance is degraded severely. For example, let's say I read data from five tables with different levels of transformation like (filtering, exploding, joins, etc), union subset of data from these transformations, then do further processing (ex. remove some rows based on a criteria that requires windowing functions etc) and then some other processing stages and finally save the final output to a destination s3 path. If we run this job without it takes very long time. However, if we save(stage) temporary intermediate dataframes to S3 and use this saved (on S3) dataframe for the next steps of queries, the job finishes faster. Does anyone have similar experience? Is there a better way to handle this kind of long tasks lineages other than checkpointing?
What is even more strange is for longer lineages spark throws an expected error like column not found, while the same code works if intermediate results are temporarily staged.
Writing the intermediate data by saving the dataframe, or using a checkpoint is the only way to fix it. You're probably running into an issue where the optimizer is taking a really long time to generate the plan. The quickest/most efficient way to fix this is to use localCheckpoint. This materializes a checkpoint locally.
val df = df.localCheckpoint()

Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it.
df.persist(StorageLevel.MEMORY_AND_DISK)
Whenever we use such persist on a HBase read - the same data is returned again and again for the other subsequent batches of the streaming job but HBase is updated for every batch run.
HBase Read Code:
val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog -> schema)).format(dbSetup.dbClass).load().persist(StorageLevel.MEMORY_AND_DISK)
I replaced persist(StorageLevel.MEMORY_AND_DISK) with cache() and it was returning updated records from HBase table as expected.
The reason we tried to use persist(StorageLevel.MEMORY_AND_DISK) is to ensure that the in-memory storage does not get full and we do not end up doing all transformations all over again during the execution of a particular stream.
Spark Version - 1.6.3
HBase Version - 1.1.2.2.6.4.42-1
Could someone explain me this and help me get a better understanding?
As you mentioned you are looking for a reason "why" therefore I'm answering this because otherwise this question will remain unanswered as there's no rational reason these days to run spark 1.6.3 to sense what happens with that specific HBASE version.
Internally, spark calls persist() when you use cache() and it behaves differently on RDDs than on Datasets(or Dataframes).
On RDDs it uses MEMORY_ONLY and on Datasets, MEMORY_AND_DISK.I cant see what you've coded(fully) but generally I can say, you shouldn't have face the difference between the two ways of cache and persist and your issue is simply a version incompatibility btw those or simply a bug that wasn't fixed by Apache.
There are several places to check to see what's wrong
In this link https://spark.apache.org/releases/spark-release-1-6-3.html you can find that maintainance of the code is hapening in branch 1.6 so this is the place to find the code https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/CacheManager.scala
Hope it helped.

Resources