How to measure the impact of data-movement in my Spark Job? - apache-spark

Some concepts of how to use Apache Spark efficiently with a database are not yet clear to me.
I was reading the book Spark: Big Data made simple and the author states (ch.1 pg.5):
"Data is expensive to move so Spark focuses on performing computations over the data, no matter where it resides."
and
"Although Spark runs well on Hadoop storage, today it is also used broadly in environments for which the Hadoop architecture does not make sense, such as the public cloud (where storage can be purchased separately from computing) or streaming applications."
I understood that, at its philosophy, Spark decouples storage from computing. In practice, this can lead to data movement when the data does not reside in the same physical machine as the spark Workers.
My questions are:
How to measure the impact of data movement in my Job? For example, how to know if the network/database throughput is the bottleneck in my Spark job?
What's the IDEAL (if exists) use of spark? Tightly coupled processing + data storage, with the workers in the same physical machine as the database instances, for minimal data movement? Or can I use a single database instance (with various workers) as long as it can handle a high throughput and network traffic?

With a super-fast network connection, data is no longer costly to move. It was the case 15 years ago but not anymore. Most spark jobs are running nowadays with the data residing in an object store like s3. When spark runs, it fetches the data from s3 and performs the operation. We like this approach because this allows us not to maintain a massive Hadoop long-running cluster. We run the spark job when required.
The minimal data movement hypothesis is no longer valid. The major bottleneck in modern computing is CPU speed, not the data transfer cost.
However, to your question, about how to measure the data transfer cost: You can run two experiments one with data in Hadoop cluster and one with data in an object stores like s3 and check what's the time difference in the spark job.
Important thing to note, it is not always important to run spark job super fast. You need to keep a balance between your workflow SLA requirement and maintainability of the cluster and data.

If you are working with data in s3 IO load at 5K/s read, 3K/s write and inefficiencies of GET requests dominate bandwidth. If you are doing too much S3 IO against the same part of an S3 bucket, you get throttled -and adding more workers just makes it worse.
Also S3 latency isn't great when the input stream needs to drain/abort the current request and start a new GET with a different range.
if you are using the s3a connector and you are using a recent 3.3.3+ hadoop jars (and you should) then you can get it to print lots of stats on S3 IO.
if you call toString() on an input stream it prints the IO it has done: bytes read, discarded, GET calls, latency.
if you set spark.hadoop.fs.iostatistics.logging.level to "info" you get a summary of all IO done against a bucket when a worker process is shut down.

Related

Kubernetes Vs Spark Vs Spark on kubernetes

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

Processing single file on S3 using Spark

I have a single file located on S3 that I want to process using Spark using multiple nodes. How spark implements that under the hood? Does each of the worker node read a portion of data from S3 (using byte range request)? I'm trying to understand what are the differences between using Spark on HDFS and S3 in terms of parallel processing. Does it matter when I use EMR?
How spark implements that under the hood?
There are many public articles explaining how spark works like this.
I'm trying to understand what are the differences between using Spark on HDFS and S3 in terms of parallel processing. Does it matter when I use EMR?
It depends on what is your use case. In general, it boils down to :
You would choose S3 over HDFS as a persistent storage option which can contain your data beyond your EMR cluster lifetime.
Unlimited (theoretically) storage limit.
High SLA and durability.
Cost. HDFS on EMR is ephemeral. So you do not need to keep clusters running to have data available.
etc.
Vs
HDFS is faster in I/O operations, intermediate/temporary data locations as S3 communication involves API calls over internet.

what should be the Hadoop cofigurations to be used for 100 gb of csv files for analysis in Spark

I have around 100 GB of data in CSV format on which I intend to do some transformation like aggregation, data splitting and after that do some clustering using ML package of Apache Spark.
I have tried it by uploading data on MYSQ trying to automate the process on python but it's taking too much time to build any solution.
What is the configuration I need to setup and how I should start with the spark?
I am new in spark. I am planning to use cloud services.
I'm going to recommend you learn to use spark locally with a small subset of the data; you can run it standalone with a few tens moving to hundreds of MB. Its limited, but you can learn the tooling without paying. Your first spark dataframe query could be sampling the source data and saving it into a more efficient query format.
CSV isn't a great format for big data; Spark likes Parquet and for 2.3+ ORC). Embrace them for better perf.
Play with "notebooks"; Apache Zeppelin is one you can install and run locally.
Like I say, learn to play with small amounts. Spark is very interactive & working with small datasets is an easy way to learn fast.
There are many ways to do that but it depends on your case. As far as I know, HDFS with default configuration(without any specific tuning) works fine. Majority of Hadoop tuning guides are focused on YARN side. So, let me make a plan like below:
Generally speaking, you can put your (raw) data in HDFS and load them in Apache Spark and save them in Parquet/ORC like below:
from pyspark.sql.types import StructType,StructField,StringType
myschema = StructType([StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True)])
mydf = spark.read.format("com.databricks.spark.csv").option("header","true").schema(myschema).option("delimiter",",").load("hdfs://hadoopmaster:9000/user/hduser/mydata.csv")
mydf.count()
mydf.repartition(6).write.format("parquet").save("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf = spark.read.parquet("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf.count()
Finally, compare mydf.count() with newdf.count(). That will run faster than raw format. In addition, your data size will decrease from 100GB to ~24GB.
If you are new to hadoop, spark and interested to setup hadoop environment in cloud. I would suggest you to go with Elastic Map Reduce(EMR) powered by AWS. You can create On demand spark cluster with the user defined configuration to process a wide range of data sets.
https://aws.amazon.com/emr/
https://aws.amazon.com/emr/details/spark/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html
Or
You can setup a hadoop cluster on top of EC2 instance or in any cloud platform with the required number of nodes with sufficient RAM and CPU. Storage optimized instances is preferred over here to analyze a large data set.
We do not need to bother about storage cost, For storage optimized instances, AWS offers free ephemeral storage data disk with size 1 - 2TB depends on instance size.
Note: Data in the ephemeral storage will be lost when the VM is rebooted. We can persist the processed data in S3 at the cheapest cost.
When it comes to cluster configuration, the list of things to be checked.
Spark on YARN is preferred
Set minimum and maximum core and memory in yarn node manager container settings for your spark executors.
Enable dynamic memory allocation in spark
Set container size to the maximum and spark memory fraction to maximum to avoid shuffling multiple times and frequent spilling and cached data eviction.
Use kryo serialization to get high performance.
Enable compression for map outputs before shuffling.
Enable spark web UI to track your application tasks and its stages.
Apache Spark Config Reference: https://spark.apache.org/docs/2.1.0/configuration.html

auto scale spark cluster

I have a spark streaming job running on a cluster. Spark job pulls messages from Kafka and do the required processing before dumping the processed data to database. I have sized my cluster as per the current load. But this load requirement may go up/down in the future. I want to know the techniques to facilitate this auto scaling without restarting the job. Scaling becomes more complicated if kakfa is being used (as in my case) as I won't like the partitions to be moved around in stateful streaming. Currently the cluster is completely in house but I won't mind migrating to cloud if that assists the scaling use case.
it is not an answer. Just some notes
"in stateful streaming". What did you mean by that? All state in spark is distributed. And you should not rely on local system, as if some task failed, it can be send to any other executor.
do you speak about increasing size of cluster or resources dedicated for your spark job in cluster?
If the first one, you need to monitor each node (memory, cpu) and when it's time (hit some threshold) add more nodes.
If the second one: we didn't find nice solution. Spark provides 'autoscaling' feature, however it doesn't work properly with kafka streaming.

Can somebody give a high-level, simple explanation to a beginner about how Hadoop works?

I know how memcached works. How does Hadoop work?
Hadoop consists of a number of components which are each subprojects of the Apache Hadoop project. Two of the main ones are the Hadoop Distributed File System (HDFS) and the MapReduce framework.
The idea is that you can network together a number of of-the-shelf computers to create a cluster. The HDFS runs on the cluster. As you add data to the cluster it is split into large chunks/blocks (generally 64MB) and distributed around the cluster. HDFS allows data to be replicated to allow recovery from hardware failures. It almost expects hardware failures since it is meant to work with standard hardware. HDFS is based on the Google paper about their distributed file system GFS.
The Hadoop MapReduce framework runs over the data stored on the HDFS. MapReduce 'jobs' aim to provides a key/value based processing ability in a highly paralleled fashion. Because the data is distributed over the cluster a MapReduce job can be split-up to run many parallel processes over the data stored on the cluster. The Map parts of MapReduce only run on the data they can see, ie the data blocks on the particular machine its running on. The Reduce brings together the output from the Maps.
The result is a system that provides a highly-paralleled batch processing capability. The system scales well, since you just need to add more hardware to increase its storage capability or decrease the time a MapReduce job takes to run.
Some links:
Word Count introduction to Hadoop MapReduce
The Google File System
MapReduce: Simplified Data Processing on large clusters

Resources