Separating Hive storage from Spark cluster (compute layer) - apache-spark

We have a scenario to use Storage capabilities of Hive (HDFS underneath) and Computing power of Spark cluster in the cloud environment. Is there a way that we can separate the two layers clearly.
Scenario:
Hive keeps getting data on regular basis (persistence layer). This
can't be deleted/removed at wish.
Processing the data sitting in Hive layer using Spark cluster at any
point. But we don't want to keep the cluster infrastructure in idle
state once computing is finished.
So, we are thinking of having cluster created in cloud just before the processing is needed and delete the spark cluster as soon as the processing is over. Advantage will be in saving cost of keeping the cluster resources .
If we load data onto Hive in one cluster of nodes, then can we read this data for processing in a spark cluster without doing data movement.
Assumption- the datanodes of Hadoop are not using a high end configuration and they are not suitable for doing spark in memory processing (low on CPUs; low on RAM).
Please suggest if this kind of scenario is possible in cloud infrastructure (GCP). Is there a better way to approach this.

Related

Can we create a Hadoop Cluster on Dataproc with 0%-2% of HDFS?

Is it possible to create a Hadoop cluster on Dataproc with no or very minimal HDFS space by setting dfs.datanode.du.reserved to about 95% or 100% of the total node size? The plan is to use GCS for all persistent storage while the local file system will primarily be used for Spark's shuffle data. Some of the Hive queries may still need the scratch on HDFS which explains the need for minimal HDFS.
I did create a cluster with a 10-90 split and did not notice any issues with my test jobs.
Could there be stability issues with Dataproc if this approach is taken?
Also, are there concerns with deleting the Data Node daemon from Dataproc's worker nodes, thereby using the Primary workers as compute only nodes. The rationale is that Dataproc currently doesn't allow a mix of preemptible and non preemptible secondary workers. So want to check if we can repurpose primary workers as compute only non-PVM nodes while the other secondary workers can be compute only PVM nodes.
I am starting a GCP project and am well-versed enough in AZURE and AWS less so, but know enough there having done a DDD setup.
What you describe is similar to AWS setup and I looked recently here: https://jayendrapatil.com/google-cloud-dataproc/
My impression is you can run without HDFS here as well - 0%. The key point is that performance with a suite of jobs will - like also for AWS & AZURE - benefit from writing to and reading from ephemeral HDFS, as it is faster than Google Cloud Storage. I cannot see stability issues; I can use Spark now without HDFS if I really want.
On the 2nd question, stick to what they have engineered. Why try and force things? On AWS we lived with the limitations on scaling down with Spark.

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?
Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead

Can Apache Spark worker nodes be different machines than HDFS data nodes?

I have a HDFS cluster (say it has 5 data nodes), if I want to setup a Spark cluster (say it has 3 worker nodes) read/write data to the HDFS cluster, do I need to make sure the Spark worker nodes are in the same machines of the HDFS data nodes? IMO they can be different machines. But if Spark worker nodes and HDFS data nodes are different machines, when read data from HDFS, Spark worker nodes need to download data from different machines which can lead to higher latency. While if they are on same machines latency can be reduced. Is my understanding correct?
In a bare metal set up and as originally postulated by MR, the Data Locality principle applies as you state, and Spark would be installed on all the Data Nodes, implying they would be also a Worker Node. So, Spark Worker resides on Data Node for rack-awareness and Data Locality for HDFS. That said, there are other storage managers such as KUDU now and other NOSQL variants that do not use HDFS.
With Cloud approaches for Hadoop you see Storage and compute divorced necessarily, e.g. AWS EMR and EC2, et al. That cannot be otherwise in terms of elasticity in compute. Not that bad as Spark shuffles to same Workers once data gotten for related keys where possible.
So, for Cloud the question is not actually relevant anymore. For bare metal Spark can be installed on different machines but would not make sense. I would install on all HDFS nodes, 5 not 3 as I understand in such a case.

auto scale spark cluster

I have a spark streaming job running on a cluster. Spark job pulls messages from Kafka and do the required processing before dumping the processed data to database. I have sized my cluster as per the current load. But this load requirement may go up/down in the future. I want to know the techniques to facilitate this auto scaling without restarting the job. Scaling becomes more complicated if kakfa is being used (as in my case) as I won't like the partitions to be moved around in stateful streaming. Currently the cluster is completely in house but I won't mind migrating to cloud if that assists the scaling use case.
it is not an answer. Just some notes
"in stateful streaming". What did you mean by that? All state in spark is distributed. And you should not rely on local system, as if some task failed, it can be send to any other executor.
do you speak about increasing size of cluster or resources dedicated for your spark job in cluster?
If the first one, you need to monitor each node (memory, cpu) and when it's time (hit some threshold) add more nodes.
If the second one: we didn't find nice solution. Spark provides 'autoscaling' feature, however it doesn't work properly with kafka streaming.

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

Resources