Difference in counts on a Delta Table immediately after write operation - apache-spark

I've a Databricks job the writes to a certain Delta Table. After the write has been completed, the job is calling another function that reads and calculates some metrics(counts to be specific) on top of the same Delta table.
But the counts/metrics being calculated are less than the actual counts for the the given partition_dates.
Just to verify that the function calculating the metrics is working fine or not, I called the function after the Databricks job got completed and found that the counts were correct in that run.
I feel Delta table is not completely updated by the time I try to read from it, even though the write operation had completed successfully.
Any help would be appreciated.

Related

How to check if the DAG is complete within Given time or not?

I have a Dag A, It runs at a time let's say 10 Am, and typically completes within 15-20 mins, but sometimes it takes more time and due to some tables in the Database it goes into an endless running state, how can I know that if my DAG is completed within a given time frame and if not it should send email Alerts that it's not completed in this time and you need to check.
My thought process:
To build a parallel DAg or process within the same DAG and then write a python function in it which just checks the start time and match it with the Current time and then keeps subtracting it unless it reaches some fixed value lets say 10 mins and then shoots an email that it has not been completed.
Please correct me if I am wrong or what are the other ways to check it
It sounds like you just need to define an SLA. You can find an example here.

Executing Azure Function back to back

If I schedule a timer triggered Azure function to run every second and my function is taking 2 seconds to execute, will I just get back-to-back executions or will some execution queue eventually overflow?
Background:
We have an timer triggered Azure function that is currently executing every 30 seconds and is checking for new rows in a database table. If there are new rows, the data will be processed and the rows will be marked as handled.
If there are no new rows the execution is very fast. If there are 500 new rows (which is the max we are fetching at the moment) the execution takes about 20-25 seconds.
We would like to decrease the interval to one second to reduce the latency or row processing.
Update: I want back-to-back executions and I want to avoid overlapping executions.
Multiple azure functions can run concurrently. This is means you can still trigger the function again while the previous triggered function is still running. They will both run concurrently. They will only queue up if you setup options to run only 1 function at a time on 1 instance but doesn't look like you want that.
With concurrency, this means that 2 functions will read the same table on the DB at the same time. So you should read your table with UPDLOCK option LINK. This will prevent the subsequent triggered function from reading the same rows that were read in the previous function.
In short, the answer to your question is neither. If your functions overlap, by default, you will get multiple functions running at the same time. LINK
To achieve back to back execution for time triggers, set WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT and FUNCTIONS_WORKER_PROCESS_COUNT as 1 in the application settings configuration. This will ensure only 1 function executes runs at a time . See this LINK.

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

Stream Analytics Query working but no output to table

I got a problem with my Stream Analytics job. I'm pulling events from an IoT Hub and grouping them in timewindows based on their custom timestamps; I've already written a query that does this correctly. But the problem is that it just doesn't write anything into my output table (being a NoSQL table on my Storage Account).
The query runs without problems in the query editor (when testing with a sample input file) and produces the correct output, but when running 'for real', it doesn't output anything (the output table remains empty). I've even tried renaming the table and outputting to a blob storage, but no dice. Here's the query:
SELECT
'general' AS partitionKey,
MIN(ID_frame) AS rowKey,
DATEADD(second, 1, DATEADD(hour, -3, System.TimeStamp)) AS window_start,
System.TimeStamp AS window_end,
COUNT(ID_frame) AS device_count
INTO
[IoT-Hub-output-table]
FROM
[IoT-Hub-input] TIMESTAMP BY custom_timestamp
GROUP BY TumblingWindow(Duration(hour, 3), Offset(second, -1))
The interesting part is that, if I omit any windowing in my query, then the table output works just fine.
I've been beating my head against the wall about this for a few days now, so I think I've already tried most of the obvious things.
As you are using a TumblingWindow of 3 hours, it means you will get a single output every 3 hours which contains an aggregate of all the events within that period.
So did you already wait for 3 hours for the first output to be generated?
I would try and set the window smaller, and try again to see if the output works correctly.
Turns out the query did output into my table, but with an amount of delay I didn't expect; I was waiting for 20-30 minutes at max. while the first insertions would began after a little later than half an hour. Thus I was cancelling the Analytics job before any output was produced and falsely assuming it just wouldn't output anything.
I found this to be the case afer I noted that 'sometimes' (when the job was running for long enough) there appeared to be some output. And in those output records I noticed the big delay between my custom timestamp field and the general timestamp field (which the engine uses to remember when the entity was updated for the last time)

Azure Data Factory Data-Set Slicing

I have some trouble understanding slicing (Dataset Availability) in Azure Data Factory. Let's say I have a source dataset which never changes. Then I for some reason set up hourly slicing for my source data set. Will each slice then be identical? What is the point of using slices at all in such case (i.e. why is it Required)?
Or another case, let's say my source dataset is appended with new data continuously (for example an event log). And each morning I want to do some analysis on all history of that log. Should I then set up daily slicing? Will each slice include the full history or just the last day?
The slices are the intervals in which the pipeline is executed within the period defined in the start and end properties of the pipeline.
If you have a fix source and you execute an activity more than once, it will always use the same source (because it does not change). Lets say you set the start time and end time to be a day, and set the frequency to be 1 hour - the activity will be executed 24 times. You will have 24 slices, all using the same data source.
For your second scenario, if the data keeps changing, you can set the frequency to once a day. What will be processed depends on the activity you define in the pipeline - lets say that the pipeline deletes the old source once it finish processing, or there's logic in the activity the takes only the new data.

Resources