Streaming ETL jobs in AWS Glue, How does it cost? - apache-spark

I want to use AWS Glue job that streams data from kafka source (it runs continuously). How much does it cost ? (let's suppose that we're using the minimum possible DPUs)
Any idea please

The minimum possible DPUs are 2. AWS charges $0.44 per DPU hour. Therefore, running a streaming job would cost you 21.12 per day (60 * 24 * 0.44).

Related

Apache Spark AWS Glue job versus Spark on Hadoop cluster for transferring data between buckets

Let's say I need to transfer data between two S3 buckets in a manner of ETL and perform an easy transformation on the data during the transportation process (taking only part of the columns and filtering by ID).
The data is parquet files and its size change between 1GB to 100GB.
What should be more efficient in terms of speed and cost - using an Apache Spark Glue job, or Spark on the Hadoop cluster with X machines?
The answer to this is basically the same for any serverless (Glue)/non-serverless (EMR) service equivalents.
The first should be faster to set up, but will be less configurable and probably more expensive. The second will give you more options for optimization (performance and cost) but you should not forget to include the cost of managing the service yourself. You can use AWS pricing calculator if you need some price estimate upfront.
I would definitely start with Glue and move to something more complicated if problems arise. Also, don't forget that there is serverless EMR now also available.
I read this question when determining if it was worthwhile to switch from AWS Glue to AWS EMR.
With configurable EC2 SPOT instances on EMR we drastically reduced a previous Glue job that read 1GB-4TB of csv uncompressed csv data. We were able to use spots instances to leverage much larger and faster Graviton processor EC2s that could load more data into RAM reducing spills to disk. Another benefit was that got rid of the dynamic frames which is very beneficial when you do not know a schema, but was overhead that we did not need. In addition the spot instances which are larger than what is provided by AWS Glue reduced our time to run but not too much. More importantly we cut our costs by 40-75%, yes that is even with the EC2 + EBS + EMR overhead cost per EC2 instance. We went from $25-250 dollars a day on Glue to $2-$60 on EMR. Costs monthly for this process was $1600 in AWS Glue and now is <$500. We run EMR as job_flow_run and TERMINATE when idle so that it essentially acts like Glue serverless.
We did not go with EMR Serverless because there was no spot instances which was probably the biggest benefit.
The only problem is that we did not switch earlier. We are now moving all AWS Glue jobs to AWS EMR.

Azure Data Explorer High Ingestion Latency with Streaming

We are using stream ingestion from Event Hubs to Azure Data Explorer. The Documentation states the following:
The streaming ingestion operation completes in under 10 seconds, and
your data is immediately available for query after completion.
I am also aware of the limitations such as
Streaming ingestion performance and capacity scales with increased VM
and cluster sizes. The number of concurrent ingestion requests is
limited to six per core. For example, for 16 core SKUs, such as D14
and L16, the maximal supported load is 96 concurrent ingestion
requests. For two core SKUs, such as D11, the maximal supported load
is 12 concurrent ingestion requests.
But we are currently experiencing ingestion latency of 5 minutes (as shown on the Azure Metrics) and see that data is actually available for querying 10 minutes after ingestion.
Our Dev Environment is the cheapest SKU Dev(No SLA)_Standard_D11_v2 but given that we only ingest ~5000 Events per day (per metric "Events Received") in this environment this latency is very high and not usable in the streaming scenario where we need to have the data available < 1 minute for queries.
Is this the latency we have to expect from the Dev Environment or are the any tweaks we can apply in order to achieve lower latency also in those environments?
How will latency behave with a production environment like Standard_D12_v2? Do we have to expect those high numbers there as well or is there a fundamental difference in behavior between Dev/test and Production Environments in this concern?
Did you follow the two steps needed to enable the streaming ingestion for the specific table, i.e. enabling streaming ingestion on the cluster and on the table?
In general, this is not expected, the Dev/Test cluster should exhibit the same behavior as the production cluster with the expected limitations around the size and scale of the operations, if you test it with a few events and see the same latency it means that something is wrong.
If you did follow these steps, and it still does not work please open a support ticket.

best way to run 300+ concurrent spark jobs in Dataproc?

I have a Dataproc cluster with 2 worker nodes (n1s2). There is an external server which submits around 360 spark jobs within an hour (with a couple of minutes spacing between each submission). The first job completes successfully but the subsequent ones get stuck and do not proceed at all.
Each job crunches some timeseries numbers and writes to Cassandra. And the time taken is usually 3-6 minutes when the cluster is completely free.
I feel this can be solved by just scaling up the cluster, but would become very costly for me.
What would be the other options to best solve this use case?
Running 300+ concurrent jobs on a 2 worker nodes cluster doesn't sound like feasible. You want to first estimate how much resource (CPU, memory, disk) each job needs then make a plan for the cluster size. YARN metrics like available CPU, available memory, especially pending memory would be helpful for identifying the situation where it is lack of resources.

Spark: Writing to DynamoDB, limited write capacity

My use case is to write to DynamoDB from a Spark application. As I have limited write capacity for DynamoDB and do not want to increase it for cost implications, how can I limit the Spark application to write at a regulated speed?
Can this be achieved by reducing the partitions to 1 and then executing foreachPartition()?
I already have auto-scaling enabled but don't want to increase it any further.
Please suggest other ways of handling this.
EDIT: This needs to be achieved when the Spark application is running on a multi-node EMR cluster.
Bucket scheduler
The way I would do this is to create a token bucket scheduler in your Spark application. A token bucket pattern is a common to design to ensure an application does not breach API limits. I have used this design successfully in very similar situations. You may find someone has written a library you can use for this purpose.
DynamoDB retry
Another (less attractive), option would be to increase the retry times on your DynamoDB connection. When your write does not succeed due to throughput provision exceeded, you can essentially instruct your DyanmoDB SDK to keep retrying for as long as you like. Details in this answer. This option may appeal if you want a 'quick and dirty' solution.

Streaming Analytics job is only able to use up to 6 Streaming Units

I am load testing my Azure IoT Suite solution. I am currently testing with 3,000 messages / second. My Azure Stream Analytics job pulls data from IoT Hub and stores it into a partitioned DocumentDB database. I have calculated that I will require 12 Streaming units. When I start the job I get the following error:
Streaming Analytics job '' is only able to use up to 6
Streaming Units based on the provided query. Please adjust the number
of Streaming Units or update your query and restart the job
What do I update in my query to get past the 6 Streaming Units limit?
You need to add PARTITION BY clause to be able to use more than 6 SUs. Please take a look at this article: https://azure.microsoft.com/en-us/documentation/articles/stream-analytics-scale-jobs/

Resources