Spark need of HDFS - apache-spark

Hi can anyone explain me, does Apache 'Spark Standalone' need HDFS?
If it's required how Spark uses the HDFS block size during the Spark application execution.
I mean am trying to understand what will be the HDFS role during Spark application execution.
Spark documentation says that the processing parallelism is controlled through RDD partitions and the executors/cores.
Can anyone please help me to understand.

Spark can work without any issues without using HDFS and most certainly it is not required for core execution.
Some distributed storage (not necessarily HDFS) is required for checkpoiniting and is useful for saving results.

Related

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?
Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead

Spark S3 Eventual Consistency Issues

I have several Spark jobs that write data to and read data from S3. Occasionally (about once per week for approximately 3 hours), the Spark jobs will fail with the following exception:
org.apache.spark.sql.AnalysisException: Path does not exist.
I've uncovered that this is likely due to the consistency model in S3, where list operations are eventually consistent. S3 Guard claims to solve this issue, but I'm in a Spark environment that doesn't support that utility.
Has anyone else run into this issue and figured out a reasonable approach for dealing with it?
If you are using AWS EMR, they offer consistent EMR.
if you are using Databricks: they offer a consistency mechanism in their transactional IO
Both HDP and CDH ship with S3Guard
if you are running your own home-rolled spark stack, , move to Hadoop 2.9+ to get S3Guard, even better: Hadoop 3.1 for the zero-rename S3A committer.
Otherwise: don't use S3 as your direct destination of work.

Spark and HDFS on Kuberenetes data locality

I'm trying to run Spark on K8 and struggling a bit with data locality. I'm using the native spark support but just watched https://databricks.com/session/hdfs-on-kubernetes-lessons-learned. I've followed the steps there in setting up my HDFS cluster (namenode on first k8 node, using host networking). I was wondering if anyone knows if the fix to the spark driver presented has been merged into the mainline spark code?
I ask as I still see ANY locality in places I'd expect NODE_LOCAL.
The code has been a part of version v2.2.0-kubernetes-0.4.0

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

How to share Spark RDD between 2 Spark contexts?

I have an RMI cluster. Each RMI server has a Spark context.
Is there any way to share an RDD between different Spark contexts?
As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it (SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver, Livy, or Apache Zeppelin). Since RDD or DataFrame is just a small local object there is really not much to share.
Sharing data is a completely different problem. You can use specialized in memory cache (Apache Ignite) or distributed in memory file systems (like Alluxio - former Tachyon) to minimize the latency when switching between application but you cannot really avoid it.
No, an RDD is tied to a single SparkContext. The general idea is that you have a Spark cluster and one driver program that tells the cluster what to do. This driver would have the SparkContext and kick off operations on the RDDs.
If you want to just move an RDD from one driver program to another, the solution is to write it to disk (S3/HDFS/...) in the first driver and load it from disk in the other driver.
You cant natively, in my understanding, RDD is not data, but a way to create data via transformations/filters from original data.
Another idea, is to share the final data instead. So, you will store the RDD in a data-store, such as :
- HDFS (a parquet file etc..)
- Elasticsearch
- Apache Ignite (in-memory)
I think you will love Apache Ignite: https://ignite.apache.org/features/igniterdd.html
Apache Ignite provides an implementation of Spark RDD abstraction
which allows to easily share state in memory across multiple Spark
jobs, either within the same application or between different Spark
applications.
IgniteRDD is implemented is as a view over a distributed Ignite cache,
which may be deployed either within the Spark job executing process,
or on a Spark worker, or in its own cluster.
(I let you dig their documentation to find what you are looking for.)

Resources