How to manage HDFS memory with Structured Streaming Checkpoints - apache-spark

I have a long running structured streaming job which consumes several Kafka topics and aggregated over a sliding window. I need to understand how checkpoints are managed/cleaned up within HDFS.
Jobs run fine and I am able to recover from a failed step with no data loss, however, I can see the HDFS utilisation increasing day by day. I can not find any documentation around how Spark manages/cleans up the checkpoints. Previously the checkpoints were being stored on s3 but this turned out to be quite costly with the large amount of small files being read/written.
query = formatted_stream.writeStream \
.format("kafka") \
.outputMode(output_mode) \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("checkpointLocation", "hdfs:///path_to_checkpoints") \
.start()
From what I understand the checkpoints should be cleaned up automatically; after several days I just see my HDFS utilisation increasing linearly. How can I ensure the checkpoints are managed and HDFS does not run out of space?
The accepted answer to Spark Structured Streaming Checkpoint Cleanup informs that Structured Streaming should deal with this issue, but not how or how it can be configured.

As you can see in the code for Checkpoint.scala, the checkpointing mechanism persists the last 10 checkpoint data, but that should not be a problem over a couple of days.
A usual reason for this is that the RDDs you are persisting on disk are also growing linearly with time. This may be due to some RDDs that you don't care about getting persisted.
You need to make sure that from your use of Structured Streaming there are no RDDs that grow that need to be persisted. For example, if you want to calculate a precise count of distinct elements over a column of a Dataset, you need to know the full input data (which means persisting data that linearly increases with time, if you have a constant influx of data per batch). Instead, if you can work with an approximate count, you can use algorithms such as HyperLogLog++, which typically requires much less memory for a tradeoff on precision.
Keep in mind that if you are using Spark SQL, you may want to further inspect what your optimized queries turn into, as this may be related to how Catalyst optimizes your query. If you are not, then maybe Catalyst would have optimized your query for you if you did.
In any case, a further thought: if the checkpoint usage is increasing with time, this should be reflected with your streaming job also consuming more RAM linearly with time, since the Checkpoint is just a serialization of the Spark Context (plus constant-size metadata). If that is the case, check SO for related questions, such as why does memory usage of Spark Worker increase with time?.
Also, be meaningful of what RDDs you call .persist() on (and which cache level, so that you can metadata to disk RDDs and only load them partially into the Spark Context at a time).

Related

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

Processing an entire SQL table via JDBC by streaming or batching on a constrained environment

I am trying to set up a pipeline for processing entire SQL tables one by one with the initial ingestion happening through JDBC. I need to be able to use higher-level processing capabilities such as the ones available in Apache Spark or Flink and would like to use any existing capabilities rather than having to write my own, although it could be an inevitability. I need to be able to execute this pipeline on a constrained setup (potentially a single laptop). Please note that I am not talking about capturing or ingesting CDC here, I just want to batch process an existing table in a way that would not OOM a single machine.
As a trivial example, I have a table in SQL Server that's 500GB. I want to break it down into smaller chunks that would fit into the 16GB-32GB of available memory in a recently modern laptop, apply a transformation function to each of the rows and then forward them into a sink.
Some of the available solutions that seem close to doing what I need:
Apache Spark partitioned reads:
spark.read.format("jdbc").
.option("driver", driver)
.option("url", url)
.option("partitionColumn", id)
.option("lowerBound", min)
.option("upperBound", max)
.option("numPartitions", 10)
.option("fetchsize",1000)
.option("dbtable", query)
.option("user", "username")
.option("password", "password")
.load()
It looks like I can even repartition the datasets further after the initial read.
Problem is, in a local execution mode I expect the entire table to be partitioned across multiple CPU cores which will all try to load their respective chunk into memory, OOMing the whole business.
Is there a way to throttle the reading jobs so that only as many execute as can fit in memory? Can I force jobs to run sequentually?
Could I perhaps partition the table into much smaller chunks, many more than there are cores, causing only a small amount to be processed at once? Wouldn't that hamper everything with endless task scheduling etc?
If I wanted to write my own source for streaming into Spark, would that alleviate my memory woes? Does something like this help me?
Does Spark's memory management kick into play here at all? Why does it need to load the entire partition into memory at once during the read?
I looked at Apache Flink as an alternative as the streaming model is perhaps a little more appropriate here. Here's what it offers in terms of JDBC:
JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.jdbc.Driver")
.setDBUrl("jdbc:mysql://localhost/log_db")
.setUsername("username")
.setPassword("password")
.setQuery("select id, something from SOMETHING")
.setRowTypeInfo(rowTypeInfo)
.finish()
However, it seems like this is also designed for batch processing and still attempts to load everything into memory.
How would I go about wrangling Flink to stream micro-batches of SQL data for processing?
Could I potentially write my own streaming source that wraps the JDBC input format?
Is it safe to assume that OOMs do not happen with Flink unless some state/accumulators become too big?
I also saw that Kafka has JDBC connectors but it looks like it is not really possible to run it locally (i.e. same JVM) like the other streaming frameworks. Thank you all for the help!
It's true that with Flink, input formats are only intended to be used for batch processing, but that shouldn't be a problem. Flink does batch processing one event at-a-time, without loading everything into memory. I think what you want should just work.

How to speed up recovery partition on S3 with Spark?

I am using Spark 3.0 on EMR to write down some data on S3 with a daily partitioning (data goes back to ~5 years), in this way:
writer.option("path", somepath).saveAsTable("my_schema.my_table")
Due to the large number of partitions the process is taking very long time just to "recover partitions" as all tasks seem completed before. Is there any way to reduce this intermediate time?
In the above code, you haven't mentioned the write mode. Default write mode is ErrorIfExists. This could cause an overhead by checking whether it exists while writing.
Also we could use dynamic partition which could optimize the huge volume of writes that is discussed here.
Here is a sample snippet
# while building the session
sparkSession.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”)
...
...
# while writing
yourDataFrame.write
.option("path", somepath)
.partitionBy(“date”)
.mode(SaveMode.Overwrite)
.saveAsTable("my_schema.my_table")
If you are writing one time and if it's not a repeated process, the use case may not need dynamic partition. Dynamic partition is useful to skip overwriting already written partitions. The operation will be idempotent with performance benefits.

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Resources