I am reading 10 million records from BigQuery and doing some transformation and creating the .csv file, the same .csv stream data I am uploading to SFTP server using Node.JS.
This job taking approximately 5 to 6 hrs to complete the request locally.
Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.
Please find below configuration of GCP Cloud Run.
Autoscaling: Up to 1 container instances
CPU allocated: default
Memory allocated: 2Gi
Concurrency: 10
Request timeout: 900 seconds
Is GCP Cloud Run is good option for long running background process?
You can use a VM instance with your container deployed and perform you job on it. At the end kill or stop your VM.
But, personally, I prefer serverless solution and approach, like Cloud Run. However, Long running job on Cloud Run will come, a day! Until this, you have to deal with the limit of 60 minutes or to use another service.
As workaround, I propose you to use Cloud Build. Yes, Cloud Build for running any container in it. I wrote an article on this. I ran a Terraform container on Cloud Build, but, in reality, you can run any container.
Set the timeout correctly, take care of default service account and assigned role, and, thing not yet available on Cloud Run, choose the number of CPUs (1, 8 or 32) for the processing and speed up your process.
Want a bonus? You have 120 minutes free per day and per billing account (be careful, it's not per project!)
Update: 2021-Oct
Cloudrun supports background activities.
Configure CPU to be always-allocated if you use background activities
Background activity is anything that happens after your HTTP response has been delivered. To determine whether there is background activity in your service that is not readily apparent, check your logs for anything that is logged after the entry for the HTTP request.
Configure CPU to be always-allocated
If you want to support background activities in your Cloud Run service, set your Cloud Run service CPU to be always allocated so you can run background activities outside of requests and still have CPU access.
Is GCP Cloud Run is good option for long running background process?
Not a good option because your container is 'brought to life' by incoming HTTP request and as soon as the container responds (e.g. sends something back), Google assumes the processing of the request is finished and cuts the CPU off.
Which may explain this:
Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.
You can try using an Apache Beam pipeline deployed via Cloud Dataflow. Using Python, you can perform the task with the following steps:
Stage 1. Read the data from BigQuery table.
beam.io.Read(beam.io.BigQuerySource(query=your_query,use_standard_sql=True))
Stage 2. Upload Stage 1 result into a CSV file on a GCS bucket.
beam.io.WriteToText(file_path_prefix="", \
file_name_suffix='.csv', \
header='list of csv file headers')
Stage 3. Call a ParDo function which will then take CSV file created in Stage 2 and upload it to the SFTP server. You can refer this link.
You may consider a serverless, event-driven approach:
configure google storage trigger cloud function running transformation
extract/export BigQuery to CF trigger bucker - this is the fastest way to get BigQuery data out
Sometimes exported data in that way may be too large not be suitable in that form for Cloud Function processing, due to restriction like max execution time (9 min currently) or memory limitation 2GB,
In that case, you can split the original data file to smaller pieces and/or push then to Pub/Sub with storage mirror
All that said we've used CF to process a billion records from building bloom filters to publishing data to aerospike under a few minutes end to end.
I will try to use Dataflow for creating .csv file from Big Query and will upload that file to GCS.
Related
I have been recently challenged with an architectural problem. Basically, I developed a Node.js application that fetches three zip files from Census.gov (13 MB, 1.2 MB and 6.7 GB) which takes about 15 to 20 minutes. After the files are downloaded the application unzips these files and extracts needed data to an AWS RDS Database. The issue for me is that this application needs to run only one time each year. What would be the best solution for this kind of task? Also, the zip files are deleted after the processing is done.
I would use a cron job. You can use this website (https://crontab.guru/every-year) to determine the correct settings for the crontab.
0 0 1 12 1
This setting will run “At 00:00 on day-of-month 1 and on Monday in December.”
To run the nodeJS program you simply put node yourcode.js aftewards. So it would look like the code below. Where node is you may need to put the path to the node program, and where yourprogram.js is you simply need to add the path there as well.
0 0 1 12 1 node yourprogram.js
Hei, I would give u suggestion. But according what Services do you use. In example if using Google Cloud with Google Scheduller. If using Openshift or another u can use Cronjob. But it worst case configuration I think where u need make some yaml file deployment that need trigger to publisher/subscriber:
Make some subscriber, on services which can trigger by Google PubSub by Topic to do your task and after all executed publish to the broker (Google PubSuB) again.
And than make another subscriber to trigger deleting file after receive a publisher message if all task execute.
The Idea i suggest because the process like that, it best practices if using the Asyncrhrounus process.
Thanks,
I would look into AWS Batch service which can run a scheduled job on an EC2 instance (virtual machine) or Fargate (serverless container runner).
Alternative #2: Use AWS Lambda serverless function to execute a NodeJS script (no need to set up an EC2 Instance or Fargate). Lambda functions can be triggered by EventBridge Rules using cron expressions. With Lambda, you pay for number of executions and the execution time in 1ms increments, however this use case could be covered within the AWS Free Tier Lambda pricing. AWS Free Tier
Note on Lambda limits: Lambda execution time is limited to 15 minutes and 10GB of local storage maximum (source: Lambda Quotas). Lambda CPU is allocated in proportion to memory configuration, you may need to increase it to improve execution time. Lambda Memory Configuration
Alternative #3: You can build a state machine using AWS Step Functions to trigger Lambda functions in steps.
For example, a state machine can trigger three Lambda functions in parallel where each function downloads its corresponding .zip file from census.gov and stores it to an Amazon S3 bucket. When all functions complete, the state machine can progress to next step and trigger a fourth function to grab data from S3 for processing and loading into the database. Once the data has been processed and loaded, the final step function can delete the .zip files from S3 if you no longer need them. EventBridge can also be used here to execute the state machine using a cron expression. You can also use Amazon SNS to publish notifications (email/sms/http endpoint) to alert if any step fails/completes.
The simple solution is to Schedule AWS Lambda Functions Using CloudWatch Events
So, you will have an AWS lambda function that will download the .zip files in the S3 buckets, unzip it and extract the data to database. After that, the same function can empty the S3 buckets.
This function will be yearly trigger by CloudWatch Events.
For more information, check out this tutorial here
I need an Azure VM (Ubuntu) to do some task (java application) every 10 minutes. Because the task lasts usually less than a minute I would save money if could start the machine every 10 minutes and stop it when the task accomplishes. I learned that I can schedule start and stop times in automation account, but more optimal would be to stop the VM in the very moment that task is completed. Is there a simple way to do that?
This really sounds like a job for Azure Batch. If you are looking for an IaaS solution, Azure Batch will do the job for you. Have a look at it: https://azure.microsoft.com/en-gb/services/batch/#overview.
It allows you to use VM's with your preferred OS (in Azure Batch it is called a node), and run a set of tasks. Once finished, the VM will be de-allocated.
So each node runs a set of pools, in each pool you have a job, and in each job you can have tasks. A task can be for example a cmd line that runs a specific app. So for instance you could just run example.exe 1 2 on a windows OS or the equivalent command line for an Ubuntu OS.
The power here is that it will allocate the tasks to run on the VM when you add them to the job, and then the VM will be disposed off once finished, and you would only pay for the compute time.
The disadvantages of this is that it is a stateless VM, therefore anything that you need installing or storing you would have to use alternative methods. Azure Batch allows you to pre-install a program (for example your Java application) each time it initiates. Also if you are using files and/or expecting files to be created, you would need a blob storage to support this. So if you are expecting it to use a certain amount of files, store them on blob storage and then write back to the blob storage if your program is doing this.
Finally your scheduler, this really depends on how you want to deal with it, if you have a local server or a server on Azure that is already running 24/7 you can add a scheduled job to the scheduler and run a program that will add the task to the Azure Batch. Or if you don't mind using Azure Functions, you can just add a timer Azure Function that will add a task to the job. There are multiple ways of dealing with this, you may already have an existing solution.
Hope you find this useful!
This is what our WebApp for containers service do:
Take in multiple pdf files
Send them to Azure OCR service (which is rate-limited at 10 per sec)
Get the results back, do some processing and send back the response
Issue:
Throughout the process, we use multiprocessing to send the files to the OCR service and receive files back (Azure OCR is async in our case). Sometimes when scale hits, we tend to cross the 10 RPS limit.
One thing which can be done is, add a delay. However, our WebApp for container service autoscales on load.
So, the more it autoscales, the higher the delay need to be, as the number of hits increase due to an increase in server instances.
So, what is a good way to handle this issue?
Language: Python
I have a python code which reads data from one cloud system via rest api using the requests module and then writes data back to another cloud system via rest api . This code runs anywhere from 1 to 4 hours every week. Is there a place in Google Cloud Platform , I can execute this code on a periodic basis. Sort of like a scheduled batch job . Is there a serverless option to do this in App Engine . I know about the App engine cron service but seems like it is only for calling a URL regularly . Any thoughts ? Appreciate your help.
Google Cloud Scheduler could be the tool you are looking for. As it is mentioned in its documentation:
Cloud Scheduler is a fully managed enterprise-grade cron job scheduler. It allows you to schedule virtually any job, including batch, big data jobs, cloud infrastructure operations, and more. You can automate everything, including retries in case of failure to reduce manual toil and intervention.
Here you have the quickstart for Cloud Scheduler, and also another tutorial for Cron jobs.
You can use the Google Genomics API pipelines.run endpoint to run a long-running job on a Google Compute Engine virtual machine and then it will destroy the machine when it's done. If your job will run for less than 24 hours and it can handle a failure, then you can use a Preemptible VM to save cost.
Pipelines: Run
https://cloud.google.com/genomics/reference/rest/v2alpha1/pipelines/run
Preemptible Virtual Machines
https://cloud.google.com/preemptible-vms/
You could use Cloud Scheduler to kick off the job
Pipelines may be preferred to trying to use one of the serverless technologies because they don't tend to handle the long running jobs as well.
You can use AI Platform Training to run any arbitrary Python package — it doesn’t have to be a machine learning job.
I have a C# console application which extracts 15GB FireBird database file on a server location to multiple files and loads the data from files to SQLServer database. The console application uses System.Threading.Tasks.Parallel class to perform parallel execution of the dataload from files to sqlserver database.
It is a weekly process and it takes 6 hours to complete.
What is best option to move this (console application) process to azure cloud - WebJob or WorkerRole or Any other cloud service ?
How to reduce the execution time (6 hrs) after moving to cloud ?
How to implement the suggested option ? Please provide pointers or code samples etc.
Your help in detail comments is very much appreciated.
Thanks
Bhanu.
let me give some thought on this question of yours
"What is best option to move this (console application) process to
azure cloud - WebJob or WorkerRole or Any other cloud service ?"
First you can achieve the task with both WebJob and WorkerRole, but i would suggest you to go with WebJob.
PROS about WebJob is:
Deployment time is quicker, you can turn your console app without any change into a continues running webjob within mintues (https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/)
Build in timer support, where WorkerRole you will need to handle on your own
Fault tolerant, when your WebJob fail, there is built-in resume logic
You might want to check out Azure Functions. You pay only for the processing time you use and there doesn't appear to be a maximum run time (unlike AWS Lambda).
They can be set up on a schedule or kicked off from other events.
If you are already doing work in parallel you could break out some of the parallel tasks into separate azure functions. Aside from that, how to speed things up would require specific knowledge of what you are trying to accomplish.
In the past when I've tried to speed up work like this, I would start by spitting out log messages during the processing that contain the current time or that calculate the duration (using the StopWatch class). Then find out which areas can be improved. The slowness may also be due to slowdown on the SQL Server side. More investigation would be needed on your part. But the first step is always capturing metrics.
Since Azure Functions can scale out horizontally, you might want to first break out the data from the files into smaller chunks and let the functions handle each chunk. Then spin up multiple parallel processing of those chunks. Be sure not to spin up more than your SQL Server can handle.