Apache Spark AWS Glue job versus Spark on Hadoop cluster for transferring data between buckets - apache-spark

Let's say I need to transfer data between two S3 buckets in a manner of ETL and perform an easy transformation on the data during the transportation process (taking only part of the columns and filtering by ID).
The data is parquet files and its size change between 1GB to 100GB.
What should be more efficient in terms of speed and cost - using an Apache Spark Glue job, or Spark on the Hadoop cluster with X machines?

The answer to this is basically the same for any serverless (Glue)/non-serverless (EMR) service equivalents.
The first should be faster to set up, but will be less configurable and probably more expensive. The second will give you more options for optimization (performance and cost) but you should not forget to include the cost of managing the service yourself. You can use AWS pricing calculator if you need some price estimate upfront.
I would definitely start with Glue and move to something more complicated if problems arise. Also, don't forget that there is serverless EMR now also available.

I read this question when determining if it was worthwhile to switch from AWS Glue to AWS EMR.
With configurable EC2 SPOT instances on EMR we drastically reduced a previous Glue job that read 1GB-4TB of csv uncompressed csv data. We were able to use spots instances to leverage much larger and faster Graviton processor EC2s that could load more data into RAM reducing spills to disk. Another benefit was that got rid of the dynamic frames which is very beneficial when you do not know a schema, but was overhead that we did not need. In addition the spot instances which are larger than what is provided by AWS Glue reduced our time to run but not too much. More importantly we cut our costs by 40-75%, yes that is even with the EC2 + EBS + EMR overhead cost per EC2 instance. We went from $25-250 dollars a day on Glue to $2-$60 on EMR. Costs monthly for this process was $1600 in AWS Glue and now is <$500. We run EMR as job_flow_run and TERMINATE when idle so that it essentially acts like Glue serverless.
We did not go with EMR Serverless because there was no spot instances which was probably the biggest benefit.
The only problem is that we did not switch earlier. We are now moving all AWS Glue jobs to AWS EMR.

Related

How to measure the impact of data-movement in my Spark Job?

Some concepts of how to use Apache Spark efficiently with a database are not yet clear to me.
I was reading the book Spark: Big Data made simple and the author states (ch.1 pg.5):
"Data is expensive to move so Spark focuses on performing computations over the data, no matter where it resides."
and
"Although Spark runs well on Hadoop storage, today it is also used broadly in environments for which the Hadoop architecture does not make sense, such as the public cloud (where storage can be purchased separately from computing) or streaming applications."
I understood that, at its philosophy, Spark decouples storage from computing. In practice, this can lead to data movement when the data does not reside in the same physical machine as the spark Workers.
My questions are:
How to measure the impact of data movement in my Job? For example, how to know if the network/database throughput is the bottleneck in my Spark job?
What's the IDEAL (if exists) use of spark? Tightly coupled processing + data storage, with the workers in the same physical machine as the database instances, for minimal data movement? Or can I use a single database instance (with various workers) as long as it can handle a high throughput and network traffic?
With a super-fast network connection, data is no longer costly to move. It was the case 15 years ago but not anymore. Most spark jobs are running nowadays with the data residing in an object store like s3. When spark runs, it fetches the data from s3 and performs the operation. We like this approach because this allows us not to maintain a massive Hadoop long-running cluster. We run the spark job when required.
The minimal data movement hypothesis is no longer valid. The major bottleneck in modern computing is CPU speed, not the data transfer cost.
However, to your question, about how to measure the data transfer cost: You can run two experiments one with data in Hadoop cluster and one with data in an object stores like s3 and check what's the time difference in the spark job.
Important thing to note, it is not always important to run spark job super fast. You need to keep a balance between your workflow SLA requirement and maintainability of the cluster and data.
If you are working with data in s3 IO load at 5K/s read, 3K/s write and inefficiencies of GET requests dominate bandwidth. If you are doing too much S3 IO against the same part of an S3 bucket, you get throttled -and adding more workers just makes it worse.
Also S3 latency isn't great when the input stream needs to drain/abort the current request and start a new GET with a different range.
if you are using the s3a connector and you are using a recent 3.3.3+ hadoop jars (and you should) then you can get it to print lots of stats on S3 IO.
if you call toString() on an input stream it prints the IO it has done: bytes read, discarded, GET calls, latency.
if you set spark.hadoop.fs.iostatistics.logging.level to "info" you get a summary of all IO done against a bucket when a worker process is shut down.

Can we create a Hadoop Cluster on Dataproc with 0%-2% of HDFS?

Is it possible to create a Hadoop cluster on Dataproc with no or very minimal HDFS space by setting dfs.datanode.du.reserved to about 95% or 100% of the total node size? The plan is to use GCS for all persistent storage while the local file system will primarily be used for Spark's shuffle data. Some of the Hive queries may still need the scratch on HDFS which explains the need for minimal HDFS.
I did create a cluster with a 10-90 split and did not notice any issues with my test jobs.
Could there be stability issues with Dataproc if this approach is taken?
Also, are there concerns with deleting the Data Node daemon from Dataproc's worker nodes, thereby using the Primary workers as compute only nodes. The rationale is that Dataproc currently doesn't allow a mix of preemptible and non preemptible secondary workers. So want to check if we can repurpose primary workers as compute only non-PVM nodes while the other secondary workers can be compute only PVM nodes.
I am starting a GCP project and am well-versed enough in AZURE and AWS less so, but know enough there having done a DDD setup.
What you describe is similar to AWS setup and I looked recently here: https://jayendrapatil.com/google-cloud-dataproc/
My impression is you can run without HDFS here as well - 0%. The key point is that performance with a suite of jobs will - like also for AWS & AZURE - benefit from writing to and reading from ephemeral HDFS, as it is faster than Google Cloud Storage. I cannot see stability issues; I can use Spark now without HDFS if I really want.
On the 2nd question, stick to what they have engineered. Why try and force things? On AWS we lived with the limitations on scaling down with Spark.

In which scenario should one prefer to create Spark cluster on EC2 machines instead of using Elastic Map Reduce?

Between processing realtime data using Spark cluster on EC2 machines and using Elastic map reduce, some of the differences are:
In Elastic Map Reduce, one would not have to manage the infrastructure and cluster as compared to Spark cluster on EC2 machines where one has to create the cluster and manage it.
In case of Spark cluster on EC2, one has more control over the cluster as compared to Elastic Map Reduce which is a PAAS component.
I went through the below related link:
Hadoop on EC2 vs Elastic Map Reduce
I understand that going with Elastic Map reduce would give the advantage of not having to manage the infrastructure and cluster. What I want to know is that when should one prefer the other option, that is to create Spark cluster on EC2 machines instead of using Elastic Map Reduce? Thanks.
You and the answer you shared have have summed pretty much the advantages and disadvantages for both. But i would like to mention few things
Someone mentioned in comment on the answer you share (and there is infact impression in people) that EMR adds some cost on top of ec2 nodes (which is underlying master/compute nodes of spark) and provides just the cluster, which isnt the case.
But what elastic map reduce is focused on is elastic and scalability part , meaning to provide scalability for your jobs, where scalability is not just number of node in cluster but different parameters like
Dynamically resizing the cluster with running jobs
Reduces and optimizes spin time , provides efficient resubmitting steps and option like automatic termination on step completion
Configuration, management and updation time. Just as an small you have things like release version that automatically handles spark/hadoop/other-application versions providing you way to easy update the version which you have to do manually with ec2.
the ecosystem availability. EMR ecosystem is growing,it doesnt reflect when you start but for example when your requirements grow, for example when you start to integrate other systems stream processing with flink for example) then it is more easier to just select at time of launching flink, pig , hive and moany more etc if you need to use other things in future.
There are already implementing libraries with AWS SDK like boto3 in python that help you to submit steps, poll for completion etc, which are very helpful when you need to scale. Also, you have integration of emr with orchestration frameworks like airflow where can can sense the state, resubmit, one command spin the cluster within the pipeline.
Expanding on previous point, EMR notebook for example provide you the quick and interactive way to submit spark jobs from Jupiter notebook and see the result, progress of jobs immediately which can boost your productivity.
This point is most important from my experience, Sometimes, scaling up the jobs with more nodes save you more money then long running jobs with low number of nodes. Because the adding node cost sometime cost you low than the normalized hours you will be spending with ec2 or small emr cluster. Just to share my experience, we had a job that used to run for 3 days, we satrted to run it with bigger EMR cluster that reduced it to 6-8 hours and it still was in the same cost and was infact a bit less.

Instance type for AWS Spark EMR Cluster

I am trying to pick an instance type for my Spark EMR clusters. I was wondering if anyone ever runs these types of clusters with EBS-only instances? By this I mean instance types such as r5.2xlarge which does not have local disk. It strikes me as a bad idea, but I thought I'd check here to see if I am missing anything.
I am thinking of using r5d.2xlarge for masters and slaves as sort of a general mix of compute, memory, and local storage for general workloads. Sound reasonable? My use case is to host a jupyter notebook interface for Spark which will do a wide variety of analytics, so I can't really pin down the precise workload beyond that in a description for you to review because I will end up doing ad-hoc analysis with this. Some analysis will involve large joins of two or more data sets, however.
Thanks,
Setjmp
If you need local storage you can rely on r3 instances, they come with pretty large instance storage, which is used for HDFS, and I think they are cheaper. But currently you can store mostly everything on S3. I would recommend to configure S3 persistence for Jupiter notebooks as well.
Even if there is no instance storage, you can easily attach the EBS. During the creation of EMR cluster, there is a step to select the amount of EBS in advanced mode. So the storage is not a problem I think.

what should be the Hadoop cofigurations to be used for 100 gb of csv files for analysis in Spark

I have around 100 GB of data in CSV format on which I intend to do some transformation like aggregation, data splitting and after that do some clustering using ML package of Apache Spark.
I have tried it by uploading data on MYSQ trying to automate the process on python but it's taking too much time to build any solution.
What is the configuration I need to setup and how I should start with the spark?
I am new in spark. I am planning to use cloud services.
I'm going to recommend you learn to use spark locally with a small subset of the data; you can run it standalone with a few tens moving to hundreds of MB. Its limited, but you can learn the tooling without paying. Your first spark dataframe query could be sampling the source data and saving it into a more efficient query format.
CSV isn't a great format for big data; Spark likes Parquet and for 2.3+ ORC). Embrace them for better perf.
Play with "notebooks"; Apache Zeppelin is one you can install and run locally.
Like I say, learn to play with small amounts. Spark is very interactive & working with small datasets is an easy way to learn fast.
There are many ways to do that but it depends on your case. As far as I know, HDFS with default configuration(without any specific tuning) works fine. Majority of Hadoop tuning guides are focused on YARN side. So, let me make a plan like below:
Generally speaking, you can put your (raw) data in HDFS and load them in Apache Spark and save them in Parquet/ORC like below:
from pyspark.sql.types import StructType,StructField,StringType
myschema = StructType([StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True)])
mydf = spark.read.format("com.databricks.spark.csv").option("header","true").schema(myschema).option("delimiter",",").load("hdfs://hadoopmaster:9000/user/hduser/mydata.csv")
mydf.count()
mydf.repartition(6).write.format("parquet").save("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf = spark.read.parquet("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf.count()
Finally, compare mydf.count() with newdf.count(). That will run faster than raw format. In addition, your data size will decrease from 100GB to ~24GB.
If you are new to hadoop, spark and interested to setup hadoop environment in cloud. I would suggest you to go with Elastic Map Reduce(EMR) powered by AWS. You can create On demand spark cluster with the user defined configuration to process a wide range of data sets.
https://aws.amazon.com/emr/
https://aws.amazon.com/emr/details/spark/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html
Or
You can setup a hadoop cluster on top of EC2 instance or in any cloud platform with the required number of nodes with sufficient RAM and CPU. Storage optimized instances is preferred over here to analyze a large data set.
We do not need to bother about storage cost, For storage optimized instances, AWS offers free ephemeral storage data disk with size 1 - 2TB depends on instance size.
Note: Data in the ephemeral storage will be lost when the VM is rebooted. We can persist the processed data in S3 at the cheapest cost.
When it comes to cluster configuration, the list of things to be checked.
Spark on YARN is preferred
Set minimum and maximum core and memory in yarn node manager container settings for your spark executors.
Enable dynamic memory allocation in spark
Set container size to the maximum and spark memory fraction to maximum to avoid shuffling multiple times and frequent spilling and cached data eviction.
Use kryo serialization to get high performance.
Enable compression for map outputs before shuffling.
Enable spark web UI to track your application tasks and its stages.
Apache Spark Config Reference: https://spark.apache.org/docs/2.1.0/configuration.html

Resources