Processing single file on S3 using Spark - apache-spark

I have a single file located on S3 that I want to process using Spark using multiple nodes. How spark implements that under the hood? Does each of the worker node read a portion of data from S3 (using byte range request)? I'm trying to understand what are the differences between using Spark on HDFS and S3 in terms of parallel processing. Does it matter when I use EMR?

How spark implements that under the hood?
There are many public articles explaining how spark works like this.
I'm trying to understand what are the differences between using Spark on HDFS and S3 in terms of parallel processing. Does it matter when I use EMR?
It depends on what is your use case. In general, it boils down to :
You would choose S3 over HDFS as a persistent storage option which can contain your data beyond your EMR cluster lifetime.
Unlimited (theoretically) storage limit.
High SLA and durability.
Cost. HDFS on EMR is ephemeral. So you do not need to keep clusters running to have data available.
etc.
Vs
HDFS is faster in I/O operations, intermediate/temporary data locations as S3 communication involves API calls over internet.

Related

How to measure the impact of data-movement in my Spark Job?

Some concepts of how to use Apache Spark efficiently with a database are not yet clear to me.
I was reading the book Spark: Big Data made simple and the author states (ch.1 pg.5):
"Data is expensive to move so Spark focuses on performing computations over the data, no matter where it resides."
and
"Although Spark runs well on Hadoop storage, today it is also used broadly in environments for which the Hadoop architecture does not make sense, such as the public cloud (where storage can be purchased separately from computing) or streaming applications."
I understood that, at its philosophy, Spark decouples storage from computing. In practice, this can lead to data movement when the data does not reside in the same physical machine as the spark Workers.
My questions are:
How to measure the impact of data movement in my Job? For example, how to know if the network/database throughput is the bottleneck in my Spark job?
What's the IDEAL (if exists) use of spark? Tightly coupled processing + data storage, with the workers in the same physical machine as the database instances, for minimal data movement? Or can I use a single database instance (with various workers) as long as it can handle a high throughput and network traffic?
With a super-fast network connection, data is no longer costly to move. It was the case 15 years ago but not anymore. Most spark jobs are running nowadays with the data residing in an object store like s3. When spark runs, it fetches the data from s3 and performs the operation. We like this approach because this allows us not to maintain a massive Hadoop long-running cluster. We run the spark job when required.
The minimal data movement hypothesis is no longer valid. The major bottleneck in modern computing is CPU speed, not the data transfer cost.
However, to your question, about how to measure the data transfer cost: You can run two experiments one with data in Hadoop cluster and one with data in an object stores like s3 and check what's the time difference in the spark job.
Important thing to note, it is not always important to run spark job super fast. You need to keep a balance between your workflow SLA requirement and maintainability of the cluster and data.
If you are working with data in s3 IO load at 5K/s read, 3K/s write and inefficiencies of GET requests dominate bandwidth. If you are doing too much S3 IO against the same part of an S3 bucket, you get throttled -and adding more workers just makes it worse.
Also S3 latency isn't great when the input stream needs to drain/abort the current request and start a new GET with a different range.
if you are using the s3a connector and you are using a recent 3.3.3+ hadoop jars (and you should) then you can get it to print lots of stats on S3 IO.
if you call toString() on an input stream it prints the IO it has done: bytes read, discarded, GET calls, latency.
if you set spark.hadoop.fs.iostatistics.logging.level to "info" you get a summary of all IO done against a bucket when a worker process is shut down.

Can we create a Hadoop Cluster on Dataproc with 0%-2% of HDFS?

Is it possible to create a Hadoop cluster on Dataproc with no or very minimal HDFS space by setting dfs.datanode.du.reserved to about 95% or 100% of the total node size? The plan is to use GCS for all persistent storage while the local file system will primarily be used for Spark's shuffle data. Some of the Hive queries may still need the scratch on HDFS which explains the need for minimal HDFS.
I did create a cluster with a 10-90 split and did not notice any issues with my test jobs.
Could there be stability issues with Dataproc if this approach is taken?
Also, are there concerns with deleting the Data Node daemon from Dataproc's worker nodes, thereby using the Primary workers as compute only nodes. The rationale is that Dataproc currently doesn't allow a mix of preemptible and non preemptible secondary workers. So want to check if we can repurpose primary workers as compute only non-PVM nodes while the other secondary workers can be compute only PVM nodes.
I am starting a GCP project and am well-versed enough in AZURE and AWS less so, but know enough there having done a DDD setup.
What you describe is similar to AWS setup and I looked recently here: https://jayendrapatil.com/google-cloud-dataproc/
My impression is you can run without HDFS here as well - 0%. The key point is that performance with a suite of jobs will - like also for AWS & AZURE - benefit from writing to and reading from ephemeral HDFS, as it is faster than Google Cloud Storage. I cannot see stability issues; I can use Spark now without HDFS if I really want.
On the 2nd question, stick to what they have engineered. Why try and force things? On AWS we lived with the limitations on scaling down with Spark.

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?
Spark is distributed data processing engine used for computing huge volumes of data. Let's say I have huge volume of data stored in mysql which I want to perform processing on. Spark reads the data from mysql and perform in-memory (or disk) computation on the cluster nodes itself. I am still not able to understand why is distributed file storage needed to run spark in a clustered mode?
is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode?
Pretty Much
if yes, why?
Because the spark workers take input from a shared table, distribute the computation amongst themselves, then are choreographed by the spark driver to write their data back to another shared table.
If you are trying to work exclusively with mysql you might be able to use the local filesystem ("file://) as the cluster FS. However, if any RDD or stage in a spark query does try to use a shared filesystem as a way of committing work, the output isn't going to propagate from the workers (which will have written to their local filesystem) and the spark driver (which can only read its local filesystem)

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?
Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead

what should be the Hadoop cofigurations to be used for 100 gb of csv files for analysis in Spark

I have around 100 GB of data in CSV format on which I intend to do some transformation like aggregation, data splitting and after that do some clustering using ML package of Apache Spark.
I have tried it by uploading data on MYSQ trying to automate the process on python but it's taking too much time to build any solution.
What is the configuration I need to setup and how I should start with the spark?
I am new in spark. I am planning to use cloud services.
I'm going to recommend you learn to use spark locally with a small subset of the data; you can run it standalone with a few tens moving to hundreds of MB. Its limited, but you can learn the tooling without paying. Your first spark dataframe query could be sampling the source data and saving it into a more efficient query format.
CSV isn't a great format for big data; Spark likes Parquet and for 2.3+ ORC). Embrace them for better perf.
Play with "notebooks"; Apache Zeppelin is one you can install and run locally.
Like I say, learn to play with small amounts. Spark is very interactive & working with small datasets is an easy way to learn fast.
There are many ways to do that but it depends on your case. As far as I know, HDFS with default configuration(without any specific tuning) works fine. Majority of Hadoop tuning guides are focused on YARN side. So, let me make a plan like below:
Generally speaking, you can put your (raw) data in HDFS and load them in Apache Spark and save them in Parquet/ORC like below:
from pyspark.sql.types import StructType,StructField,StringType
myschema = StructType([StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True)])
mydf = spark.read.format("com.databricks.spark.csv").option("header","true").schema(myschema).option("delimiter",",").load("hdfs://hadoopmaster:9000/user/hduser/mydata.csv")
mydf.count()
mydf.repartition(6).write.format("parquet").save("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf = spark.read.parquet("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf.count()
Finally, compare mydf.count() with newdf.count(). That will run faster than raw format. In addition, your data size will decrease from 100GB to ~24GB.
If you are new to hadoop, spark and interested to setup hadoop environment in cloud. I would suggest you to go with Elastic Map Reduce(EMR) powered by AWS. You can create On demand spark cluster with the user defined configuration to process a wide range of data sets.
https://aws.amazon.com/emr/
https://aws.amazon.com/emr/details/spark/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html
Or
You can setup a hadoop cluster on top of EC2 instance or in any cloud platform with the required number of nodes with sufficient RAM and CPU. Storage optimized instances is preferred over here to analyze a large data set.
We do not need to bother about storage cost, For storage optimized instances, AWS offers free ephemeral storage data disk with size 1 - 2TB depends on instance size.
Note: Data in the ephemeral storage will be lost when the VM is rebooted. We can persist the processed data in S3 at the cheapest cost.
When it comes to cluster configuration, the list of things to be checked.
Spark on YARN is preferred
Set minimum and maximum core and memory in yarn node manager container settings for your spark executors.
Enable dynamic memory allocation in spark
Set container size to the maximum and spark memory fraction to maximum to avoid shuffling multiple times and frequent spilling and cached data eviction.
Use kryo serialization to get high performance.
Enable compression for map outputs before shuffling.
Enable spark web UI to track your application tasks and its stages.
Apache Spark Config Reference: https://spark.apache.org/docs/2.1.0/configuration.html

Resources