My use case is to write to DynamoDB from a Spark application. As I have limited write capacity for DynamoDB and do not want to increase it for cost implications, how can I limit the Spark application to write at a regulated speed?
Can this be achieved by reducing the partitions to 1 and then executing foreachPartition()?
I already have auto-scaling enabled but don't want to increase it any further.
Please suggest other ways of handling this.
EDIT: This needs to be achieved when the Spark application is running on a multi-node EMR cluster.
Bucket scheduler
The way I would do this is to create a token bucket scheduler in your Spark application. A token bucket pattern is a common to design to ensure an application does not breach API limits. I have used this design successfully in very similar situations. You may find someone has written a library you can use for this purpose.
DynamoDB retry
Another (less attractive), option would be to increase the retry times on your DynamoDB connection. When your write does not succeed due to throughput provision exceeded, you can essentially instruct your DyanmoDB SDK to keep retrying for as long as you like. Details in this answer. This option may appeal if you want a 'quick and dirty' solution.
Related
Some concepts of how to use Apache Spark efficiently with a database are not yet clear to me.
I was reading the book Spark: Big Data made simple and the author states (ch.1 pg.5):
"Data is expensive to move so Spark focuses on performing computations over the data, no matter where it resides."
and
"Although Spark runs well on Hadoop storage, today it is also used broadly in environments for which the Hadoop architecture does not make sense, such as the public cloud (where storage can be purchased separately from computing) or streaming applications."
I understood that, at its philosophy, Spark decouples storage from computing. In practice, this can lead to data movement when the data does not reside in the same physical machine as the spark Workers.
My questions are:
How to measure the impact of data movement in my Job? For example, how to know if the network/database throughput is the bottleneck in my Spark job?
What's the IDEAL (if exists) use of spark? Tightly coupled processing + data storage, with the workers in the same physical machine as the database instances, for minimal data movement? Or can I use a single database instance (with various workers) as long as it can handle a high throughput and network traffic?
With a super-fast network connection, data is no longer costly to move. It was the case 15 years ago but not anymore. Most spark jobs are running nowadays with the data residing in an object store like s3. When spark runs, it fetches the data from s3 and performs the operation. We like this approach because this allows us not to maintain a massive Hadoop long-running cluster. We run the spark job when required.
The minimal data movement hypothesis is no longer valid. The major bottleneck in modern computing is CPU speed, not the data transfer cost.
However, to your question, about how to measure the data transfer cost: You can run two experiments one with data in Hadoop cluster and one with data in an object stores like s3 and check what's the time difference in the spark job.
Important thing to note, it is not always important to run spark job super fast. You need to keep a balance between your workflow SLA requirement and maintainability of the cluster and data.
If you are working with data in s3 IO load at 5K/s read, 3K/s write and inefficiencies of GET requests dominate bandwidth. If you are doing too much S3 IO against the same part of an S3 bucket, you get throttled -and adding more workers just makes it worse.
Also S3 latency isn't great when the input stream needs to drain/abort the current request and start a new GET with a different range.
if you are using the s3a connector and you are using a recent 3.3.3+ hadoop jars (and you should) then you can get it to print lots of stats on S3 IO.
if you call toString() on an input stream it prints the IO it has done: bytes read, discarded, GET calls, latency.
if you set spark.hadoop.fs.iostatistics.logging.level to "info" you get a summary of all IO done against a bucket when a worker process is shut down.
Let's say I need to transfer data between two S3 buckets in a manner of ETL and perform an easy transformation on the data during the transportation process (taking only part of the columns and filtering by ID).
The data is parquet files and its size change between 1GB to 100GB.
What should be more efficient in terms of speed and cost - using an Apache Spark Glue job, or Spark on the Hadoop cluster with X machines?
The answer to this is basically the same for any serverless (Glue)/non-serverless (EMR) service equivalents.
The first should be faster to set up, but will be less configurable and probably more expensive. The second will give you more options for optimization (performance and cost) but you should not forget to include the cost of managing the service yourself. You can use AWS pricing calculator if you need some price estimate upfront.
I would definitely start with Glue and move to something more complicated if problems arise. Also, don't forget that there is serverless EMR now also available.
I read this question when determining if it was worthwhile to switch from AWS Glue to AWS EMR.
With configurable EC2 SPOT instances on EMR we drastically reduced a previous Glue job that read 1GB-4TB of csv uncompressed csv data. We were able to use spots instances to leverage much larger and faster Graviton processor EC2s that could load more data into RAM reducing spills to disk. Another benefit was that got rid of the dynamic frames which is very beneficial when you do not know a schema, but was overhead that we did not need. In addition the spot instances which are larger than what is provided by AWS Glue reduced our time to run but not too much. More importantly we cut our costs by 40-75%, yes that is even with the EC2 + EBS + EMR overhead cost per EC2 instance. We went from $25-250 dollars a day on Glue to $2-$60 on EMR. Costs monthly for this process was $1600 in AWS Glue and now is <$500. We run EMR as job_flow_run and TERMINATE when idle so that it essentially acts like Glue serverless.
We did not go with EMR Serverless because there was no spot instances which was probably the biggest benefit.
The only problem is that we did not switch earlier. We are now moving all AWS Glue jobs to AWS EMR.
Is it possible to create a Hadoop cluster on Dataproc with no or very minimal HDFS space by setting dfs.datanode.du.reserved to about 95% or 100% of the total node size? The plan is to use GCS for all persistent storage while the local file system will primarily be used for Spark's shuffle data. Some of the Hive queries may still need the scratch on HDFS which explains the need for minimal HDFS.
I did create a cluster with a 10-90 split and did not notice any issues with my test jobs.
Could there be stability issues with Dataproc if this approach is taken?
Also, are there concerns with deleting the Data Node daemon from Dataproc's worker nodes, thereby using the Primary workers as compute only nodes. The rationale is that Dataproc currently doesn't allow a mix of preemptible and non preemptible secondary workers. So want to check if we can repurpose primary workers as compute only non-PVM nodes while the other secondary workers can be compute only PVM nodes.
I am starting a GCP project and am well-versed enough in AZURE and AWS less so, but know enough there having done a DDD setup.
What you describe is similar to AWS setup and I looked recently here: https://jayendrapatil.com/google-cloud-dataproc/
My impression is you can run without HDFS here as well - 0%. The key point is that performance with a suite of jobs will - like also for AWS & AZURE - benefit from writing to and reading from ephemeral HDFS, as it is faster than Google Cloud Storage. I cannot see stability issues; I can use Spark now without HDFS if I really want.
On the 2nd question, stick to what they have engineered. Why try and force things? On AWS we lived with the limitations on scaling down with Spark.
So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.
We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.