One of the selling points of Hadoop is that the data sits with the compute? How does that work with WASB?
When processing a MapReduce job the map and reduce tasks are executed where the blocks of data are resided. This way the data locality is achieved.
But in the case of HDInsight, the data is stored in the wasb. So when the MapReduce is executed does the data is copied from wasb to each of the compute node and then the processing is proceeded? If so, then the single channel to copy data to compute nodes will be a bottleneck.
Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the Azure storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same with disks locally attached to the VMs.

HDInsight clusters are located in any of Azure's regions. The storage accounts that clusters can read from can only be from the same region to avoid high latency. Azure has done a lot of work on its data centers so that performance is comparable.
If you want to learn more, Ashish's quote comes from this article:


We have a scenario to use Storage capabilities of Hive (HDFS underneath) and Computing power of Spark cluster in the cloud environment. Is there a way that we can separate the two layers clearly.
Hive keeps getting data on regular basis (persistence layer). This
can't be deleted/removed at wish.
Processing the data sitting in Hive layer using Spark cluster at any
point. But we don't want to keep the cluster infrastructure in idle
state once computing is finished.
So, we are thinking of having cluster created in cloud just before the processing is needed and delete the spark cluster as soon as the processing is over. Advantage will be in saving cost of keeping the cluster resources .
If we load data onto Hive in one cluster of nodes, then can we read this data for processing in a spark cluster without doing data movement.
Assumption- the datanodes of Hadoop are not using a high end configuration and they are not suitable for doing spark in memory processing (low on CPUs; low on RAM).
what should be the Hadoop cofigurations to be used for 100 gb of csv files for analysis in Spark

I have around 100 GB of data in CSV format on which I intend to do some transformation like aggregation, data splitting and after that do some clustering using ML package of Apache Spark.
I have tried it by uploading data on MYSQ trying to automate the process on python but it's taking too much time to build any solution.
What is the configuration I need to setup and how I should start with the spark?
I am new in spark. I am planning to use cloud services.
I'm going to recommend you learn to use spark locally with a small subset of the data; you can run it standalone with a few tens moving to hundreds of MB. Its limited, but you can learn the tooling without paying. Your first spark dataframe query could be sampling the source data and saving it into a more efficient query format.
CSV isn't a great format for big data; Spark likes Parquet and for 2.3+ ORC). Embrace them for better perf.
Play with "notebooks"; Apache Zeppelin is one you can install and run locally.
Like I say, learn to play with small amounts. Spark is very interactive & working with small datasets is an easy way to learn fast.
There are many ways to do that but it depends on your case. As far as I know, HDFS with default configuration(without any specific tuning) works fine. Majority of Hadoop tuning guides are focused on YARN side. So, let me make a plan like below:
Generally speaking, you can put your (raw) data in HDFS and load them in Apache Spark and save them in Parquet/ORC like below:
from pyspark.sql.types import StructType,StructField,StringType
myschema = StructType([StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True)])
mydf ="com.databricks.spark.csv").option("header","true").schema(myschema).option("delimiter",",").load("hdfs://hadoopmaster:9000/user/hduser/mydata.csv")
newdf ="hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
Finally, compare mydf.count() with newdf.count(). That will run faster than raw format. In addition, your data size will decrease from 100GB to ~24GB.
If you are new to hadoop, spark and interested to setup hadoop environment in cloud. I would suggest you to go with Elastic Map Reduce(EMR) powered by AWS. You can create On demand spark cluster with the user defined configuration to process a wide range of data sets.
You can setup a hadoop cluster on top of EC2 instance or in any cloud platform with the required number of nodes with sufficient RAM and CPU. Storage optimized instances is preferred over here to analyze a large data set.
We do not need to bother about storage cost, For storage optimized instances, AWS offers free ephemeral storage data disk with size 1 - 2TB depends on instance size.
Note: Data in the ephemeral storage will be lost when the VM is rebooted. We can persist the processed data in S3 at the cheapest cost.
When it comes to cluster configuration, the list of things to be checked.
Spark on YARN is preferred
Set minimum and maximum core and memory in yarn node manager container settings for your spark executors.
Enable dynamic memory allocation in spark
Set container size to the maximum and spark memory fraction to maximum to avoid shuffling multiple times and frequent spilling and cached data eviction.
Use kryo serialization to get high performance.
Enable compression for map outputs before shuffling.
Enable spark web UI to track your application tasks and its stages.
How does data locality work with OpenStack Swift on IBM Bluemix?

I'm currently playing around with the Apache Spark Service in IBM Bluemix. Since the IBM Cloud relies on OpenStack Swift as Data Storage for this service I'm wondering if there is any data locality (at least possible) with that architecture.
If I'm right with HDFS the SparkDriver asks the HDFS namenode about the datanodes containing the various blocks of a file and then schedules the work to the SparkWorkers.
So I've checked the Swift API there is a Range parameter which would allow the SparkWorker to at least read only local blocks, but how can the SparkDriver find out these ranges?
Any ideas?
This is the disaggregation of compute and storage. That is, the spark compute nodes are not at all shared with the swift cluster storage nodes. This confers benefits on scalability of compute separate from storage, and vice versa. But in this model, you cannot have data locality ... by definition. So how this works, roughly, is that each spark executor can pull its own range of blocks of the object from the swift cluster, such that each executor does not need to pull in all the object data only operate on its own portion; which would be inefficient. But the blocks are still pulled from the remote swift cluster, then are not local. The only question here is how long it takes to pull the blocks into each executor so that doesn't slow you down. In the case of the Bluemix Apache Spark Service and the Bluemix or Softlayer Object Storage service, there is low latency and a fast network between them.
Whats the best way to mirror a live Cassandra cluster for analytics tasks?

Assuming a live cluster with several DCs, whats the best way to setup some nodes that are dedicated for analytic queries?
Analytic nodes will be hosted in a separate (routed) network and must not write any data back to the production nodes. They also must not be counted against for any CL. This especially applies to EACH_QUORUM that will be used for some writes. Analytics nodes may be offline at any time.
All solutions I've looked into seem to have their own drawbacks.
1) Take snapshots on production and transfer to independent analytics cluster
Significant update delay
IO intensive either on network or disk (e.g. rsync)
Lots of duplicate data due to different replication factors (3:1 prod. vs analytics)
Mismatch in SSTable row ranges and cluster topology on analytics cluster may require to use sstableloader
2) Use write survey mode to establish read-only nodes
Not 100% sure how this could be done for setting up multiple survey nodes to cover the whole ring
Queries can only be executed against each node locally as they could not be part of a coordinated execution
3) Add regular DC dedicated for analytics
EACH_QUORUM will fail in case analytics cluster is not available
Queries on production should not be served from analytics
Would require a way to prevent users on analytics to be able to execute queries or updates on production
For Hadoop, which data storage to choose, Amazon S3 or Azure Blob Store?

I am working on a Hadoop project and generating lots of data in my local cluster. Sooner later I will be using cloud based Hadoop solution because my Hadoop cluster is very small comparative to real work load, however I dont have a choice as of now which one I will be using i.e. Windows Azure based, EMR or something else. I am generating lots of data locally and want to store this data to some cloud based storage based on the fact that I will use this data with Hadoop later but very soon.
I am looking for suggestion to decided which cloud store to choose based in someone experience. Thanks in advance.
First of all it is a great question. Let's try to understand "How data is processed in Hadoop":
In Hadoop all the data is processed on Hadoop cluster means when you process any data, that data is copied from its sources to HDFS, which is an essential component of Hadoop.
When data is copied to HDFS only after your run Map/Reduce jobs in it to get your results.
That means it does not matter what and where your data sources is(Amazon S3, Azure Blob, SQL Azure, SQL Server, on premise source etc), you will have to move/transfer/copy your data from source to HDFS, within the limits of Hadoop.
Once data is processed in Hadoop cluster, the result will be stored the location you would have configured in your job. The output data source can be HDFS or an outside location accessible from Hadoop Cluster
Once you have data copied to HDFS you can keep it one HDFS as long as you want but you will have to pay the price to use the Hadoop cluster.
In some cases when you are running Hadoop Job between some interval and data move/copy can be done faster, it is good to have a strategy to 1) acquire Hadoop cluster 2) copy data 3) run job 4) release cluster.
So based on above details, when you choose a data source in Cloud for your Hadoop Cluster you would have to consider the following:
If you have large data (which is normal with Hadoop clusters) to process, consider different data sources and the time it will take to copy/move data from those data source to HDFS because this will be your first step.
You would need to choose a data source which must have the lowest network latency so you can get data in and out, as fast as possible.
You also need to consider how you will move large amount of data from your current location to any cloud store. The best option would be to have a storage where you can send your data disk (HDD/Tape etc) because uploading multiple TB data will take great amount of time.
Amazon EMR (already available), Windows Azure (HadoopOnAzure in CTP) and Google (BigQuery in Preview, based on Google Dremel) provides pre-configured Hadoop clusters in cloud so you can choose where you would want to run your Hadoop job then you can consider the cloud storage.
Even if you choose one cloud data storage and decide to move to other because you want to use other Hadoop cluster in cloud, you sure can transfer the data however consider the time and data transfer support available to you.
