I am trying to implement spark streaming checkpoints, using GCS as storage for checkpoints. On enabling the checkpointing causes the performance of the job to degrade. Just thinking if checkpoint can be done on sql or some other storage which would be faster then writing to HDFS or GCS.
Spark 3.x (and previous version) do not provide native support for checkpointing data directly to a SQL database. You have to checkpoint to a file system or a distributed file system like HDFS/GCS/S3.
Having said that you can write(and also then retrieving) your own custom checkpointing mechanism to a different destination.
Related
As far as i understand,for a spark streaming application(structured streaming or otherwise),to manually manage the offsets ,spark provides the feature of checkpointing where you just have to configure the checkpoint location(hdfs most of the times) while writing the data to your sink and spark itself will take care of managing the offsets.
But i see a lot of usecases where checkpointing is not preferred and instead an offset managemenent framework is created to save offets in hbase or mongodb etc. I just wanted to understand why checkpointing is not preferred and instead a custom framework is created to manage the offsets?
Is it because it will lead to creation of small file problem in hdfs?
https://blog.cloudera.com/offset-management-for-apache-kafka-with-apache-spark-streaming/
Small files is just one problem for HDFS. Zookeeper would be more recommended out of your listed options since you'd likely have a Zookeeper cluster (or multiple) as part of Kafka and Hadoop ecosystem.
The reason checkpoints aren't used is because they are highly coupled to the code's topology. For example, if you run map, filter, reduce or other Spark functions, then the exact order of those matters, and are used by the checkpoints.
Storing externally will keep consistent ordering, but with different delivery semantics.
You could also just store in Kafka itself (but disable auto commits)
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets
In the past, the general consensus was such that you should not use S3 as checkpointing location for Spark Structured Streaming applications.
However, now that S3 offers strong read after write consistency, is it safe to use S3 as a checkpointing location? If it is not safe, why?
In my experiments, I continue to see checkpointing related exceptions in my Spark Structured streaming applications, but I am uncertain where the problem actually lies.
not really. you get consistency of list and updates, but rename is still mocked with copy and delete...and I think the standard checkpoint algorithm depends on it.
hadoop 3.3.1 added a new API, Abortable to aid with a custom S3 stream checkpoint committer -the idea os that the checkpointer woudl write sstraight to the destination, but abort the write when aborting the checkpoint. a normal close() would finish the write and manifest the file. see https://issues.apache.org/jira/browse/HADOOP-16906
AFAIK nobody has done the actual committer. opportunity for you to contribute there...
You really answer your own question. You do not state if on Databricks or EMR so I am going to assume EC2.
Use HDFS as checkpoint location on local EC2 disk.
Where I am now we have HDFS using HDP and IBM S3, HDFS is used still for checkpointing.
I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?
Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead
I am using mapwithState in my java spark streaming application, and I would like to avoid doing a checkpoint. The reason for this is that I do not want to install HDFS. I believe a checkpoint is only required for fault tolerance.
However, if I do not care about fault tolerance, is it possible to skip the checkpoint, but still use mapwithState?
It is not possible, checkpoint directory configuration is mandatory for mapWithState operator. It is not only for fault tolerance, it is also for storing the previous state of object in HDFS. One alternative you can try by providing local file system path as checkpoint directory when running application in local mode.
Is it possible to skip the checkpoint with mapWithState?
No. At least not till Spark-2.0.
mapWithState uses checkpoint directory to store state of the streaming data. That is the cost you have to pay for making Spark Streaming stateful. You should consider using Kryo Serialization in case building stateful applications with Spark streaming.
Snapshot from databricks.
I am new to Apache Spark, I would like to know is it possible to store data using Apache Spark. Or is it only a processing tool?
Thanks for spending your time,
Satya
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
In real life use-case you usually have database, or data repository frome where you access data from spark.
Spark can access data that's in:
SQL Databases (Anything that can be connected using JDBC driver)
Local files
Cloud storage (eg. Amazon S3)
NoSQL databases.
Hadoop File System (HDFS)
and many more...
Detailed description can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Apache Spark is primarily processing engine. It works with underlying file systems such as HDFS, s3 and other supported file systems. It has capabilities to read the data from relational databases as well. But primarily it is in memory distributed processing tool.
As you can read in Wikipedia, Apache Spark is defined as:
is an open source cluster computing framework
When we refer about computing, it's related to a processing tool, in essence it allows to work as a pipeline scheme (or somehow ETL), you read the dataset, you process the data, and then you store the data processed, or models that describe the data.
If your main objective is to distribute your data, there are some good alternatives like HDFS (Hadoop File System), and others.