How does data locality work with OpenStack Swift on IBM Bluemix? - apache-spark

I'm currently playing around with the Apache Spark Service in IBM Bluemix. Since the IBM Cloud relies on OpenStack Swift as Data Storage for this service I'm wondering if there is any data locality (at least possible) with that architecture.
If I'm right with HDFS the SparkDriver asks the HDFS namenode about the datanodes containing the various blocks of a file and then schedules the work to the SparkWorkers.
So I've checked the Swift API there is a Range parameter which would allow the SparkWorker to at least read only local blocks, but how can the SparkDriver find out these ranges?
Any ideas?

This is the disaggregation of compute and storage. That is, the spark compute nodes are not at all shared with the swift cluster storage nodes. This confers benefits on scalability of compute separate from storage, and vice versa. But in this model, you cannot have data locality ... by definition. So how this works, roughly, is that each spark executor can pull its own range of blocks of the object from the swift cluster, such that each executor does not need to pull in all the object data only operate on its own portion; which would be inefficient. But the blocks are still pulled from the remote swift cluster, then are not local. The only question here is how long it takes to pull the blocks into each executor so that doesn't slow you down. In the case of the Bluemix Apache Spark Service and the Bluemix or Softlayer Object Storage service, there is low latency and a fast network between them.
re: "Since the IBM Cloud relies on OpenStack Swift as Data Storage for this service". There will be other data sources available to the spark service as the beta progresses, so it will not be 100% reliance.

Related

Can we create a Hadoop Cluster on Dataproc with 0%-2% of HDFS?

Is it possible to create a Hadoop cluster on Dataproc with no or very minimal HDFS space by setting dfs.datanode.du.reserved to about 95% or 100% of the total node size? The plan is to use GCS for all persistent storage while the local file system will primarily be used for Spark's shuffle data. Some of the Hive queries may still need the scratch on HDFS which explains the need for minimal HDFS.
I did create a cluster with a 10-90 split and did not notice any issues with my test jobs.
Could there be stability issues with Dataproc if this approach is taken?
Also, are there concerns with deleting the Data Node daemon from Dataproc's worker nodes, thereby using the Primary workers as compute only nodes. The rationale is that Dataproc currently doesn't allow a mix of preemptible and non preemptible secondary workers. So want to check if we can repurpose primary workers as compute only non-PVM nodes while the other secondary workers can be compute only PVM nodes.
I am starting a GCP project and am well-versed enough in AZURE and AWS less so, but know enough there having done a DDD setup.
What you describe is similar to AWS setup and I looked recently here: https://jayendrapatil.com/google-cloud-dataproc/
My impression is you can run without HDFS here as well - 0%. The key point is that performance with a suite of jobs will - like also for AWS & AZURE - benefit from writing to and reading from ephemeral HDFS, as it is faster than Google Cloud Storage. I cannot see stability issues; I can use Spark now without HDFS if I really want.
On the 2nd question, stick to what they have engineered. Why try and force things? On AWS we lived with the limitations on scaling down with Spark.

HDFS vs HDFS with YARN, and if I use spark can I put new resource management?

What is HDFS alone without YARN Use Case, and with it? Should I use MapReduce or I can only use the spark? Also if I use spark can I put new resource management for the spark instead of the yarn in the same system? And is this the optimal solution for it, how to decide each one here? based on use case
Sorry, I don't have a specific use case!
Hadoop Distributed File System
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.
HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
Take Away
HDFS is based on a master Slave Architecture with Name Node (NN) being the master and Data Nodes (DN) being the slaves.
Name Node stores only the meta Information about the files, actual data is stored in Data Node.
Both Name Node and Data Node are processes and not any super fancy Hardware.
The Data Node uses the underlying OS file System to save the data.
You need to use HDFS client to interact with HDFS. The hdfs clients always talks to Name Node for meta Info and subsequently talks to Data Nodes to read/write data. No Data IO happens through Name Node.
HDFS clients never send data to Name Node hence Name Node never becomes a bottleneck for any Data IO in the cluster
HDFS client has "short-circuit" feature enabled hence if the client is running on a Node hosting Data Node it can read the file from the Data Node making the complete read/write Local.
To even make it simple imagine HDFSclient is a web client and HDFS as whole is a web service which has predefined task to GET, PUT, COPYFROMLOCAL etc.
How is a 400 MB file Saved on HDFS with hdfs block size of 100 MB.
YARN (Yet Another Resource Negotiator )
"does it ring a bell 'Yet Another Hierarchically Organized Oracle' YAHOO"
YARN is essentially a system for managing distributed applications. It consists of a central Resource manager (RM), which arbitrates all available cluster resources, and a per-node Node Manager (NM), which takes direction from the Resource manager. The Node manager is responsible for managing available resources on a single node.
Take Away
YARN is based on a master Slave Architecture with Resource Manager being the master and Node Manager being the slaves.
Resource Manager keeps the meta info about which jobs are running on which Node Manage and how much memory and CPU is consumed and hence has a holistic view of total CPU and RAM consumption of the whole cluster.
The jobs run on the Node Manager and jobs never get execute on Resource Manager. Hence RM never becomes a bottleneck for any job execution. Both RM and NM are processes and not some fancy hardware
Container is logical abstraction for CPU and RAM.
YARN (Yet Another Resource Negotiator) is scheduling container (CPU and RAM ) over the whole cluster. Hence for end user if he needs CPU and RAM in the cluster it needs to interact with YARN
While Requesting for CPU and RAM you can specify the Host one which you need it.
To interact with YARN you need to use yarn-client which
How HDFS and YARN work in TANDEM
Name Node and Resource Manager process are hosted on two different host. As they hold key meta information.
The Data Node and Node manager processes are co-located on same host.
A file is saved onto HDFS (Data Nodes) and to access a file in Distributed way one can write a YARN Application (MR2, SPARK, Distributed Shell, Slider Application) using YARN client and to read data use HDFSclient.
The Distributed application can fetch file location ( meta info From Name Node ) ask Resource Manager (YARN) to provide containers on the hosts which hold the file blocks.
Do remember the short-circuit optimization provided by HDFS, hence if the Distributed job gets a container on a host which host the file block and tries to read it, the read will be local and not over the network.
The same file If read sequentially would have taken 4 sec (100 MB/sec speed) can be read in 1 second as Distributed process is running parallely on different YARN container( Node Manager) and reading 100 MB/sec *4 in 1 second.
HDFS can be used as a fault tolerant filesystem without YARN, yes. But so can Ceph, MinIO, GlusterFS, etc, each of which can work with Spark.
That addresses storage, but for processing, you can only configure one scheduler per Spark job, but the same code should be able to run in any environment, whether that be YARN, Spark Standalone, Mesos, or Kubernetes, but you ideally would not install these together
So therefore, neither HDFS nor YARN are required for Spark

Do nodes in Spark Cluster share the same storage?

I a newbie to spark. I am using Azure Databricks and I am writing python code with PySpark. There is one particular topic which is confusing me:
Do nodes have separate storage memory (I don't mean RAM/cache)? Or they all share the same storage? If they share the same storage, then can two different applications running in different Spark Context exchange data accordingly?
I don't understand why sometime we refer to the storage by dbfs:/tmp/..., and other times we refer to it by /dbfs/tmp/... Example: If I am using the dbutils package from databricks, we use something like: dbfs:/tmp/... to refer to a directory in the file system. However, if I'm using regular python code, I say /dbfs/tmp/.
Your help is much appreciated!!
Each node has separate RAM memory and caching. For example, if you have a cluster with 4GB and 3 nodes. When you deploy your spark application, it will run worker processes depending on cluster configuration and query requirements and it will create virtual machines on separate nodes or on same node. These node memories are not shared between each other during the life of the application.
This is more about Hadoop resource sharing question and can find more information from YARN resource management. This is very brief overview
https://databricks.com/session/resource-management-and-spark-as-a-first-class-data-processing-framework-on-hadoop

How Data distribution is achieved in Azure HDInsight while processing it

One of the selling points of Hadoop is that the data sits with the compute? How does that work with WASB?
When processing a MapReduce job the map and reduce tasks are executed where the blocks of data are resided. This way the data locality is achieved.
But in the case of HDInsight, the data is stored in the wasb. So when the MapReduce is executed does the data is copied from wasb to each of the compute node and then the processing is proceeded? If so, then the single channel to copy data to compute nodes will be a bottleneck.
Can anyone explain to me how data is stored on wasb and how during processing the data is handled?
Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the Azure storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same with disks locally attached to the VMs.
HDInsight clusters are located in any of Azure's regions. The storage accounts that clusters can read from can only be from the same region to avoid high latency. Azure has done a lot of work on its data centers so that performance is comparable.
If you want to learn more, Ashish's quote comes from this article:
https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/

For Hadoop, which data storage to choose, Amazon S3 or Azure Blob Store?

I am working on a Hadoop project and generating lots of data in my local cluster. Sooner later I will be using cloud based Hadoop solution because my Hadoop cluster is very small comparative to real work load, however I dont have a choice as of now which one I will be using i.e. Windows Azure based, EMR or something else. I am generating lots of data locally and want to store this data to some cloud based storage based on the fact that I will use this data with Hadoop later but very soon.
I am looking for suggestion to decided which cloud store to choose based in someone experience. Thanks in advance.
First of all it is a great question. Let's try to understand "How data is processed in Hadoop":
In Hadoop all the data is processed on Hadoop cluster means when you process any data, that data is copied from its sources to HDFS, which is an essential component of Hadoop.
When data is copied to HDFS only after your run Map/Reduce jobs in it to get your results.
That means it does not matter what and where your data sources is(Amazon S3, Azure Blob, SQL Azure, SQL Server, on premise source etc), you will have to move/transfer/copy your data from source to HDFS, within the limits of Hadoop.
Once data is processed in Hadoop cluster, the result will be stored the location you would have configured in your job. The output data source can be HDFS or an outside location accessible from Hadoop Cluster
Once you have data copied to HDFS you can keep it one HDFS as long as you want but you will have to pay the price to use the Hadoop cluster.
In some cases when you are running Hadoop Job between some interval and data move/copy can be done faster, it is good to have a strategy to 1) acquire Hadoop cluster 2) copy data 3) run job 4) release cluster.
So based on above details, when you choose a data source in Cloud for your Hadoop Cluster you would have to consider the following:
If you have large data (which is normal with Hadoop clusters) to process, consider different data sources and the time it will take to copy/move data from those data source to HDFS because this will be your first step.
You would need to choose a data source which must have the lowest network latency so you can get data in and out, as fast as possible.
You also need to consider how you will move large amount of data from your current location to any cloud store. The best option would be to have a storage where you can send your data disk (HDD/Tape etc) because uploading multiple TB data will take great amount of time.
Amazon EMR (already available), Windows Azure (HadoopOnAzure in CTP) and Google (BigQuery in Preview, based on Google Dremel) provides pre-configured Hadoop clusters in cloud so you can choose where you would want to run your Hadoop job then you can consider the cloud storage.
Even if you choose one cloud data storage and decide to move to other because you want to use other Hadoop cluster in cloud, you sure can transfer the data however consider the time and data transfer support available to you.
For example, with HadooponAzure you can connect various data sources i.e. Amazon S3, Azure Blob Storage, SQL Server and SQL Azure etc so a variety of data sources are the best with any cloud Hadoop cluster.

Resources