I have a pipeline built with luigi. I have one luigi task which downloads data from an external service, based on a txt file with the information to fetch. As there are over 3000 of requests to the external service, the pipeline often fails, because of the large time it takes for the task to finish.
What could be done to improve the scalability of the pipeline and to make sure the pipeline doesn't fail? -threading, multiprocessing? What solution can be optimal to make sure the pipeline doesn't fail at handling big tasks, which take a long time?
I didn't provide any code example because I would need a general approach, not an example based one.
Related
I'm moving my first steps with prefect, and I'm trying to see what its degrees of freedom are. To this end, I'm investigating whether prefect supports running different tasks on different schedules in the same python process. For example, Task A might have to run every 5 minutes, while Task B might run twice a day with a Cron scheduler.
It seems to me that schedules are associated with a Flow, not with a task, so to do the above, one would have to create two distinct one-task Flows, each with its own schedule. But even as that, given that running a flow is a blocking operation, I can't see how to "start" both flows concurrently (or pseudo-concurrently, I'm perfectly aware the flows won't execute on separate threads).
Is there a built-in way of getting the tasks running on their independent schedules? I'm under the impression that there is a way to achieve this, but given my limited experience with prefect, I'm completely missing it.
Many thanks in advance for any pointers.
You are right that schedules are associated with Flows and not Tasks, so the only place to add a schedule is a Flow. Running a Flow is a blocking operation if you are using the open source Prefect core only. For production use cases, it's recommended running your Flows against Prefect Cloud or Prefect Server. Cloud is the managed offering and Server is when you host it yourself. Note that Cloud has a very generous free tier.
When using a backend, you will use an agent that will kick off the flow run in a new process. This will not be blocking.
To start with using a backend, you can check the docs here
This Prefect Discourse topic discusses a very similar problem and shows how you could solve it using a flow-of-flows orchestrator pattern.
One way to approach it is to leverage Caching to avoid recomputation of certain tasks that require lower-frequency scheduling than the main flow.
The deployment of the company product has several tasks to finish. For example,
task 1 to copy some build files to server A
task 2 to copy some build files to server B
task 1 or 2 could fail and we need to redeploy only the failed task because each task takes a long time to finish.
I can split the tasks into different stages but we have a long tasks list and if we include staging and production it will be difficult to manage.
so my question is
is there an easy way to redeploy partial tasks without editing and disabling the tasks in the stage?
or a better way to organize multiple stages into one group like 'Staging' or 'Production' so I can have a better visualization of the release stages
thanks
Update:
Thanks #jessehouwing
Found there is an option when I click redeploy. See screenshot below.
You can group each stage with one or more jobs. You can easily retry jobs without having to run the whole stage. You will get the overhead of each job fetching sources or downloading artifacts and to use the output of a previous job you need to publish the result. One advantage is that jobs can run in parallel, your overall duration may actually be shorter that way.
I have a hard time searching for this as I'm getting lot of results regarding the paralerism of the steps inside the pipeline itself, which is not my problem (as I'm concerned about parallelism one level above the pipeline steps). I was looking through google/so and Atlassian documentation, but probably I'm searching for it under the wrong term.
I have two steps in my pipeline, build HTML files and deploy them. The deployment just does git push of the final HTML files to the final reposity. This works very well, but my concern is that if I would do by accident multiple commits and pushes quickly one after the other. Then depending on their content, they might finish in a different order than they started and doing an out-of-order deployment, which I want to avoid.
There might be more robust ways of deployment, but because this is a fairly simple project, I wouldn't want to overcomplicate it and I would like to keep deployment as it is. And just limit my pipeline CI to running one job/task at the time and if I will push faster than it can build then just block/wait for the previous one to finish.
In essence, I want my CI queue size to be just 1 job to make the incoming jobs triggered by commits blocking instead of asynchronous. Is there some way or workaround to achieve something like that and make the jobs blocking?
I have two pipelines (also called "build definitions") in azure pipelines, one is executing system tests and one is executing performance tests. Both are using the same test environment. I have to make sure that the performance pipeline is not triggered when the system test pipeline is running and vice versa.
What I've tried so far: I can access the Azure DevOps REST-API to check whether a build is running for a certain definition. So it would be possible for me to implement a job executing a script before the actual pipeline runs. The script then just checks for the build status of the other pipeline by checking the REST-API each second and times out after e.g. 1 hour.
However, this seems quite hacky to me. Is there a better way to block a build pipeline while another one is running?
If your project is private, the Microsoft-hosted CI/CD parallel job limit is one free parallel job that can run for up to 60 minutes each time, until you've used 1,800 minutes (30 hours) per month.
The self-hosted CI/CD parallel job limit is one self-hosted parallel job. Additionally, for each active Visual Studio Enterprise subscriber who is a member of your organization, you get one additional self-hosted parallel job.
And now, there isn't such setting to control different agent pool parallel job limit.But there is a similar problem on the community, and the answer has been marked. I recommend you can check if the answer is helpful for you. Here is the link.
Few questions regarding HDInsight jobs approach.
1) How to schedule HDInsight job? Is there any ready solution for it? For example if my system will constantly get a large number of new input files collected that we need to run map/reduce job upon, what is the recommended way to implemented on-going processing?
2) From the price perspective, it is recommended to remove the HDInsight cluster for the time when there is no job running. As I understand there is no way to automate this process if we decide to run the job daily? Any recommendations here?
3) Is there a way to ensure that the same files are not processed more than once? How do you solve this issue?
4) I might be mistaken, but it looks like every hdinsight job requires a new output storage folder to store reducer results into. What is the best practice for merging of those results so that reporting always works on the whole data set?
Ok, there's a lot of questions in there! Here are I hope a few quick answers.
There isn't really a way of scheduling job submission in HDInsight, though of course you can schedule a program to run the job submissions for you. Depending on your workflow, it may be worth taking a look at Oozie, which can be a little awkward to get going on HDInsight, but should help.
On the price front, I would recommend that if you're not using the cluster, you should destroy it and bring it back again when you need it (those compute hours can really add up!). Note that this will lose anything you have in the HDFS, which should be mainly intermediate results, any output or input data held in the asv storage will persist in and Azure Storage account. You can certainly automate this by using the CLI tools, or the rest interface used by the CLI tools. (see my answer on Hadoop on Azure Create New Cluster, the first one is out of date).
I would do this by making sure I only submitted the job once for each file, and rely on Hadoop to handle the retry and reliability side, so removing the need to manage any retries in your application.
Once you have the outputs from your initial processes, if you want to reduce them to a single output for reporting the best bet is probably a secondary MapReduce job with the outputs as its inputs.
If you don't care about the individual intermediate jobs, you can just chain these directly in the one MapReduce job (which can contain as many map and reduce steps as you like) through Job chaining see Chaining multiple MapReduce jobs in Hadoop for a java based example. Sadly the .NET api does not currently support this form of job chaining.
However, you may be able to just use the ReducerCombinerBase class if your case allows for a Reducer->Combiner approach.