I am trying to copy a backup file that is provided from a HTTP source, the download URL is only valid for 60 seconds and the Copy step is timing out before it can complete (timeout set to 1 min 1 second). It will complete on occasion but is very inconsistent. When it completes the step queues for around 40 seconds, other times it will be queued for over a minute and the link has expired when it eventually gets to downloading the file. It is one zipped JSON file that is being downloaded, less than 100 KB.
Both Source and Sink datasets are using a Managed VNet IR we have created (must be used on Sink due to company policy), using the AutoResolve IR on the Source and it takes even longer queueing.
I've tried all variations of 'Max concurrent connections', 'DIU' and 'Degree of copy parallelism' in the Copy activity I can think of and none seem to have any effect. It appears to be random if the queue time is short enough for the download to succeed.
Is there any way to speed up the queue process to try and get more consistent successful downloads?
Both Source and Sink datasets are using a Managed VNet IR we have
created (must be used on Sink due to company policy), using the
AutoResolve IR on the Source and it takes even longer queueing.
This is pretty much confusing. If your Source and Sink are in Managed VNet, your IR also should be in same Managed VNet for better security and performance.
As per this official document:
By design, Managed VNet IR takes longer queue time than Azure IR as we
are not reserving one compute node per service instance, so there is a
warm up for each copy activity to start, and it occurs primarily on
VNet join rather than Azure IR.
Since there is no reserved node, there isn't any possible way to speed up the queue process.
Related
I am reading 10 million records from BigQuery and doing some transformation and creating the .csv file, the same .csv stream data I am uploading to SFTP server using Node.JS.
This job taking approximately 5 to 6 hrs to complete the request locally.
Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.
Please find below configuration of GCP Cloud Run.
Autoscaling: Up to 1 container instances
CPU allocated: default
Memory allocated: 2Gi
Concurrency: 10
Request timeout: 900 seconds
Is GCP Cloud Run is good option for long running background process?
You can use a VM instance with your container deployed and perform you job on it. At the end kill or stop your VM.
But, personally, I prefer serverless solution and approach, like Cloud Run. However, Long running job on Cloud Run will come, a day! Until this, you have to deal with the limit of 60 minutes or to use another service.
As workaround, I propose you to use Cloud Build. Yes, Cloud Build for running any container in it. I wrote an article on this. I ran a Terraform container on Cloud Build, but, in reality, you can run any container.
Set the timeout correctly, take care of default service account and assigned role, and, thing not yet available on Cloud Run, choose the number of CPUs (1, 8 or 32) for the processing and speed up your process.
Want a bonus? You have 120 minutes free per day and per billing account (be careful, it's not per project!)
Update: 2021-Oct
Cloudrun supports background activities.
Configure CPU to be always-allocated if you use background activities
Background activity is anything that happens after your HTTP response has been delivered. To determine whether there is background activity in your service that is not readily apparent, check your logs for anything that is logged after the entry for the HTTP request.
Configure CPU to be always-allocated
If you want to support background activities in your Cloud Run service, set your Cloud Run service CPU to be always allocated so you can run background activities outside of requests and still have CPU access.
Is GCP Cloud Run is good option for long running background process?
Not a good option because your container is 'brought to life' by incoming HTTP request and as soon as the container responds (e.g. sends something back), Google assumes the processing of the request is finished and cuts the CPU off.
Which may explain this:
Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.
You can try using an Apache Beam pipeline deployed via Cloud Dataflow. Using Python, you can perform the task with the following steps:
Stage 1. Read the data from BigQuery table.
beam.io.Read(beam.io.BigQuerySource(query=your_query,use_standard_sql=True))
Stage 2. Upload Stage 1 result into a CSV file on a GCS bucket.
beam.io.WriteToText(file_path_prefix="", \
file_name_suffix='.csv', \
header='list of csv file headers')
Stage 3. Call a ParDo function which will then take CSV file created in Stage 2 and upload it to the SFTP server. You can refer this link.
You may consider a serverless, event-driven approach:
configure google storage trigger cloud function running transformation
extract/export BigQuery to CF trigger bucker - this is the fastest way to get BigQuery data out
Sometimes exported data in that way may be too large not be suitable in that form for Cloud Function processing, due to restriction like max execution time (9 min currently) or memory limitation 2GB,
In that case, you can split the original data file to smaller pieces and/or push then to Pub/Sub with storage mirror
All that said we've used CF to process a billion records from building bloom filters to publishing data to aerospike under a few minutes end to end.
I will try to use Dataflow for creating .csv file from Big Query and will upload that file to GCS.
I have a pipeline with a few copy activities. Some of those activities are in charge of copying large amounts of data from a storage account to the same storage account but in a compressed manner (I'm talking about a few TB of data).
After running the pipeline for a few hours I noticed that some activities show "Queue" time on the monitoring blade and I was wondering what can be the reason for that "Queue" time. And more importantly if I'm being billed for that time also because from what I understand my ADF is not doing anything.
Can someone shed some light? :)
(Posting this as an answer because of the comment chars limit)
After a long discussion with Azure Support and reaching out to someone at the ADF product team I got some answers:
1 - The queue time is not being billed.
2 - Initially, the orchestration ADF system puts the job in a queue and it gets "queue time" until the infrastructure picks it up and start the processing part.
3 - In my case the queue time was increasing after the job started because of a bug in the underlying backend executor (it uses Azure Batch). Apparently the executors were crashing and my job was suffering from "re-pickup" time, thus increasing the queue time. This explained why after some time I started to see that the execution time and the transferred data were decreasing. The ETA for this bugfix is at the end of the month. Additionally the job that I was executing timed out (after 7 days) and after checking the billing I confirmed that I wasn't charged a dime for it.
Based on the the chart in this ADF Monitor, you could find the same metrics in the example.
In fact,it's metrics in the executionDetails parameter.Queue Time+ Transfer Time= Duration Time.
More details on the stages copy activity goes through, and the
corresponding steps, duration, used configurations, etc. It's not
recommended to parse this section as it may change.
Please refer to the Parallel Copy, copy activity will create parallel tasks to transfer data internally. Activities are all in active state in both queue time and transfer time, never stop in queue time so that it's billed during the whole duration time. I think it's inevitable loss in data transfer process and has been digested by adf internally. You could try to adjust parallelCopies param to see if anything changes.
If you do concern the cost, you could submit feedback here to ask for statements from Azure Team.
Situation:
A user with a TB worth of files on our Azure blob storage and gigabytes of storage in our Azure databases decides to leave our services. At this point, we need to export all his data into 2GB packages and deposit them on the blob storage for a short period (two weeks or so).
This should happen very rarely, and we're trying to cut costs. Where would it be optimal to implement a task that over the course of a day or two downloads the corresponding user's blobs (240 KB files) and zips them into the packages?
I've looked at a separate webapp running a dedicated continuous webjob, but webjobs seem to shut down when the app unloads, and I need this to hibernate and not use resources when not up and running, so "Always on" is out. Plus, I can't seem to find a complete tutorial on how to implement the interface, so that I may cancel the running task and such.
Our last resort is abandoning webapps (three of them) and running it all on a virtual machine, but this comes up to greater costs. Is there a method I've missed that could get the job done?
This sounds like a job for a serverless model on Azure Functions to me. You get the compute scale you need without paying for idle resources.
I don't believe that there are any time limits on running the function (unlike AWS Lambda), but even so you'll probably want to implement something to split the job up first so it can be processed in parallel (and to provide some resilience to failures). Queue these tasks up and trigger the function off the queue.
It's worth noting that they're still in 'preview' at the moment though.
Edit - have just noticed your comment on file size... that might be a problem, but in theory you should be able to use local storage rather than doing it all in memory.
I'm currently trying to divide a processor-intensive simulation task into a few hundred chunks that are processed in parallel within Azure. I thought that Azure WebSites which offer an easy-to-setup dedicated virtual machine and WebJobs with their easy-to-use abstraction over a Storage Queue would fit my bill perfectly.
I have the following Azure setup that gets freshly created by my code each time I run it
A single storage account
One storage queue with job descriptions
A storage container with static data
A storage container for the results (unique files per job)
n (for example 8) "Standard" WebSites, meaning there are n different *.azurewebsites.net URIs
One WebJob on each WebSite running continuously (8 WebJobs in the example) using the WebJobs SDK (JobHost)
Each job description is <1k
Each job consists of about 100k of Blob-input-data
Each result is about 100k of Blob-output-data
With the current scaling, each job runs for about one and a half minutes
Here is the signature of the job.
public static void RunGeant4Simulation(
[QueueTrigger("simulationjobs")] JobDescription jobDescription,
[Blob("input/{Archive}", FileAccess.Read)] Stream archive,
[Blob("result/{Name}-{Energy}-output.zip", FileAccess.Write)] Stream output,
[Blob("result/{Name}-{Energy}-log.dat")] TextWriter debug
)
The code then goes ahead to setup a WebSite-local, job-specific directory, extracts the zip-archive containing an executable, runs this executable with Process.Start and writes the captured output to the blob. Everything the Process accesses is available on the machine.
The debug TextWriter is for capturing timing information within the job.
What I expected to see was that each WebSite would take a job from the queue, run it, post the results into the container and take the next job.
What I'm actually seeing is that only a single WebSite is actually running jobs while the remaining ones just idle, although the WebJob is reported as being started and running on each site. The net result is the same number of jobs finished per minute as with one WebSite.
Here is a log of a run, where two WebSites "decided" to participate in running jobs: simulation-log.zip. The storage account mentioned in the connection strings is already deleted, so I did not delete the access keys from the logs.
I have added some timing instrumentation to the WebJob and from that I can see that sometimes running the executable takes twice or thrice (pretty much exactly) the time it would take in a "normal" run
stopwatch.Start();
using (var process = Process.Start(processStartInfo))
{
debug.WriteLine("After Starting Process: {0}", DateTime.UtcNow);
var outputData = process.StandardOutput.ReadToEnd();
process.WaitForExit();
stopwatch.Stop();
debug.WriteLine("Process Finished: {0} {1}", DateTime.UtcNow, stopwatch.Elapsed);
outputBytes = Encoding.UTF8.GetBytes(outputData);
}
The stopwatch shows times of 1:15, 2:27, 3:43, etc.
But some of the jobs that take longer than expected also show an expected time for the stopwatch.
However, in both cases, jobs on another WebSite run instead and in the storage's result container, results show up.
In the end, the number of jobs finished per minute does not change.
Update
Today, I went one step further and created a separate storage account per WebSite and distributed the jobs manually between 8 queues in 8 storage accounts each for one of 8 WebSites. That means from my outside perspective, nothing had anything in common besides running the same code by accident.
This did not help.
It still looks like I have one single processor that has to run all WebJobs on whatever WebSite I create no matter how independent they are. I have created an image of the CPU Time as shown in the portal:
My thinking about Azure WebSites was actually wrong and that's why I got confused:
In non-Free WebSites, there are two things that scale completely independently
Computing power available for all those WebSites (a "ServerFarm" in the SDK). This means you select a machine size (Small to Large) and a number of those ("Instances") and these are responsible to run all your Basic or Standard WebSites.
Software running on an URI like ASP.NET, PHP, or WebJobs
In my thinking, WebSites were directly linked to virtual machine(s) backing them up, but there is no direct connection.
I have now a ServerFarm with n Large instances.
In this ServerFarm, there are n WebSites.
Each WebSite has 5 WebJobs, so that the 4 Processors in a Large instance can be used thoroughly.
Now, everything scales as expected.
I'm trying to figure out the best performing approach when writing thousands of small Blobs to Azure Storage.
The application scenario is the following:
thousands of files are being created or overwritten by a constantly
running windows service installed on a Windows Azure VM
Writing to the Temporary Storage available to the VM, the service can reach more
than 9,000 file creations per second
file sizes range between 1 KB and 60 KB
on other VMs with same sw running, other files are being created with same rate and criteria
given the need to build and keep updated a central repository, another service running on each VM copies the newly created files from the Temporary Storage to Azure Blobs
other servers should then read the Azure Blobs in their more recent version
Please note that for many constraints that I'm not listing for shortness, it's not currently possible to modify the main service to directly create Blobs instead of files on Temporary file system. ...and from what I' currently seeing it would mean a slower rate of creation, not acceptable per original requirements.
This copy operation, that I'm testing in a tight loop on 10,000 files, seems to be limited at 200 blob creations per second. I've been able to reach this result after tweaking the sample code named "Windows Azure ImportExportBlob" found here: http://code.msdn.microsoft.com/windowsazure/Windows-Azure-ImportExportB-9d30ddd5 with the async suggestions found in this answer: Using Parallel.Foreach in a small azure instance
I obtained this apparent maximum of 200 blob creations per second on an extralarge VM with 8 cores and setting the "maxConcurrentThingsToProcess" Semaphore accordingly. The network utilization during the test is max 1% of the available 10Gb shown in task manager. This means roughly 100 Mb of the 800 Mb that should be available on that VM size.
I see that the total size copied during the elapsed time gives me around 10 MB/sec.
Is there some limitation on the Azure Storage traffic you can generate or should I use a different approach when writing so many and small files ?
#breischl Thank you for the scalability targets. After reading that post, I started searching for more target figures possibly prepared by Microsoft and found 4 posts (too many for my "reputation" to be posted here, the other 3 are part 2, 3 and 4 of the same series):
http://blogs.microsoft.co.il/blogs/applisec/archive/2012/01/04/windows-azure-benchmarks-part-1-blobs-read-throughput.aspx
the first post contains an important hint: "You may have to increase the ServicePointManager.DefaultConnectionLimit for multiple threads to establish more than 2 concurrent connections with the storage."
I've set this to 300 , rerun the test and seen an important increase in the MB/s. As I previously wrote, I was thinking to be hitting a limit in the underlying blob service when "too many" threads are writing blobs. This is the confirmation of my worries. Thus, I removed all the changes made to the code to work with a semaphore and replaced it again with a parallel.for to start as many blob upload operations as possible. The result has been awesome: 61 MB/s writing blobs and 65 MB/s reading.
The scalability target is 60 MB/s and I'm finally happy with the result.
Thank you all again for your answers.