This sounds counter intuitive, but what would be the pros and cons of updating the airflow database by deploying a job to airflow?
I am considering this as an option to set up role based accesses by directly making updates to the database, and because Airflow is a scheduler, it would make sense to do schedule this process on Airflow.
Thanks
We actually do this to purge down the logs table periodically along with some other general Airflow housekeeping. The downsides aren't too bad assuming you tested your code elsewhere first and you're not running the process on an extremely small schedule.
I would recommend that you read the airflow.models module and classes, and how they're used, and that you leverage them as examples for your process; it'll help to make sure you're doing things correctly and save you from needless duplication.
Related
I'm moving my first steps with prefect, and I'm trying to see what its degrees of freedom are. To this end, I'm investigating whether prefect supports running different tasks on different schedules in the same python process. For example, Task A might have to run every 5 minutes, while Task B might run twice a day with a Cron scheduler.
It seems to me that schedules are associated with a Flow, not with a task, so to do the above, one would have to create two distinct one-task Flows, each with its own schedule. But even as that, given that running a flow is a blocking operation, I can't see how to "start" both flows concurrently (or pseudo-concurrently, I'm perfectly aware the flows won't execute on separate threads).
Is there a built-in way of getting the tasks running on their independent schedules? I'm under the impression that there is a way to achieve this, but given my limited experience with prefect, I'm completely missing it.
Many thanks in advance for any pointers.
You are right that schedules are associated with Flows and not Tasks, so the only place to add a schedule is a Flow. Running a Flow is a blocking operation if you are using the open source Prefect core only. For production use cases, it's recommended running your Flows against Prefect Cloud or Prefect Server. Cloud is the managed offering and Server is when you host it yourself. Note that Cloud has a very generous free tier.
When using a backend, you will use an agent that will kick off the flow run in a new process. This will not be blocking.
To start with using a backend, you can check the docs here
This Prefect Discourse topic discusses a very similar problem and shows how you could solve it using a flow-of-flows orchestrator pattern.
One way to approach it is to leverage Caching to avoid recomputation of certain tasks that require lower-frequency scheduling than the main flow.
I have a web application, where users can schedule different jobs.
I'am not sure how to proceed with this.
All the nodejs schedulers out there basically reads the schedule from within the code. I can of course implement this cron like schedule to be read from a database, but I'am not sure if its the most effective way?
If I back the solution with a database I would need to query that database, let say each second, to see if there is any schduled jobs that needs to be handled. I can't read them once a day, because new jobs might be added on a regular basis.
Keeping them in memory dosen't seem very efficient either?
Am I looking for a different kind of technology to handle this, than a scheduler+database?
We are talkning around 10.000 jobs for the time being (as a maximum). They are mostly related to sending emails and/or giving notifications within the application itself.
I have a team where many member has permission to submit Spark tasks to YARN (the resource management) by command line. It's hard to track who is using how much cores, who is using how much memory...e.g. Now I'm looking for a software, framework or something could help me monitor the parameters that each member used. It will be a bridge between client and YARN. Then I could used it to filter the submit commands.
I did take a look at mlflow and I really like the MLFlow Tracking but it was designed for ML training process. I wonder if there is an alternative for my purpose? Or there is any other solution for the problem.
Thank you!
My recommendation would be to build such a tool yourself as its not too complicated,
have a wrapper script to spark submit which logs the usage in a DB and after the spark job finishes the wrapper will know to release information. could be done really easily.
In addition you can even block new spark submits if your team already asked for too much information.
And as you build it your self its really flexible as you can even create "sub teams" or anything you want.
Few questions regarding HDInsight jobs approach.
1) How to schedule HDInsight job? Is there any ready solution for it? For example if my system will constantly get a large number of new input files collected that we need to run map/reduce job upon, what is the recommended way to implemented on-going processing?
2) From the price perspective, it is recommended to remove the HDInsight cluster for the time when there is no job running. As I understand there is no way to automate this process if we decide to run the job daily? Any recommendations here?
3) Is there a way to ensure that the same files are not processed more than once? How do you solve this issue?
4) I might be mistaken, but it looks like every hdinsight job requires a new output storage folder to store reducer results into. What is the best practice for merging of those results so that reporting always works on the whole data set?
Ok, there's a lot of questions in there! Here are I hope a few quick answers.
There isn't really a way of scheduling job submission in HDInsight, though of course you can schedule a program to run the job submissions for you. Depending on your workflow, it may be worth taking a look at Oozie, which can be a little awkward to get going on HDInsight, but should help.
On the price front, I would recommend that if you're not using the cluster, you should destroy it and bring it back again when you need it (those compute hours can really add up!). Note that this will lose anything you have in the HDFS, which should be mainly intermediate results, any output or input data held in the asv storage will persist in and Azure Storage account. You can certainly automate this by using the CLI tools, or the rest interface used by the CLI tools. (see my answer on Hadoop on Azure Create New Cluster, the first one is out of date).
I would do this by making sure I only submitted the job once for each file, and rely on Hadoop to handle the retry and reliability side, so removing the need to manage any retries in your application.
Once you have the outputs from your initial processes, if you want to reduce them to a single output for reporting the best bet is probably a secondary MapReduce job with the outputs as its inputs.
If you don't care about the individual intermediate jobs, you can just chain these directly in the one MapReduce job (which can contain as many map and reduce steps as you like) through Job chaining see Chaining multiple MapReduce jobs in Hadoop for a java based example. Sadly the .NET api does not currently support this form of job chaining.
However, you may be able to just use the ReducerCombinerBase class if your case allows for a Reducer->Combiner approach.
I am in the process of beginning to write a worker queue for node using node's cluster API and mongoose.
I noticed that a lot of libs exist that already do this but using redis and forking. Is there a good reason to fork versus using the cluster API?
edit and now i also find this: https://github.com/xk/node-threads-a-gogo -- too many options!
I would rather not add redis to the mix since I already use mongo. Also, my requirements are very loose, I would like persistence but could go without it for the first version.
Part two of the question:
What are the most stable/used nodejs worker queue libs out there today?
Wanted to follow up on this. My solution ended up being a roll your own cluster impl where some of my cluster workers are dedicated job workers (ie they just have code to work on jobs).
I use agenda for job scheduling.
Cron type jobs are scheduled by the cluster master. The rest of the jobs are created in the non-worker clusters as they are needed. (verification emails etc)
Before that I was using kue but dropped it because the rest of my app uses mongodb and I didnt like having to use redis just for job scheduling.
Have u tried https://github.com/rvagg/node-worker-farm?
It is very light weight and doesn't require a separate server.
I personally am partial to cluster-master.
https://github.com/isaacs/cluster-master
The reason I like cluster master is because it does very little besides add in logic for forking your process, and give you the ability to manage the number of process you're running, and a little bit of logging/recovery to boot! I find overly bloated process management libraries tend to be unstable, and sometimes even slow things down.
This library will be good for you if the following are true:
Your module is largely asynchronous
You don't have a huge amount of different types of events triggering
The events that fire have small amounts of work to do, but you have lots of similar events firing(things like web servers)
The reason for the above list, is the reason why threads-a-gogo may be good for you, for the opposite reasons. If you have a few spots in your code, where there is a lot of work to do within your event loop, something like threads-a-gogo that launches a "thread" specifically for this work is awesome, because you aren't determining ahead of time how many workers to spawn, but rather spawning them to do work when needed. Note: this can also be bad if there is the potential for a lot of them to spawn, if you start launching too many processes things can actually bog down, but I digress.
To summarize, if your module is largely asynchronous already, what you really want is a worker pool. To minimize the down time when your process is not listening for events, and to maximize the amount of processor you can use. Unless you have a very busy syncronous call, a single node event loop will have troubles taking advantage of even a single core of a processor. Under this circumstance, you are best off with cluster-master. What I recommend is doing a little benchmarking, and see how much of a single core your program can use under the "worst case scenario". Let's say this is 33% of one core. If you have a quad core machine, you then tell cluster master to launch you 12 workers.
Hope this helped!