I have a concourse pipeline for a node js application with multiple jobs (unit test etc). Currently, I am doing a yarn install on every job. I would prefer to be able to do it in just one job and then pass those node modules to jobs as needed. Is there a way to do this without having to pass the modules to an S3 bucket?
I'll ask your question in a slightly different way: is there a reason you need to have multiple jobs? Would they logically make sense to be just different tasks in the same job? If you did that, you can share outputs between tasks.
Related
I am having difficulty figuring out how to run cron jobs in single pod and not in all of them. I also have a mongodb db-trigger that listens for any changes in db and based on that notifications are send to users, which is also getting executed multiple times.
I came across below solutions that does not fit my requirements,
Use Queue with help of Redis or RabbitMQ
Create a separate microservice and run those jobs in single pod
Thank You in advance
I ended up using bee-queue node package to push all jobs in redis and process it.
I am looking at migrating several nodejs jobs to apache airflow.
These jobs log to the standard output. I am new to Airflow, and have set it up running in docker. Ideally, we would updated these jobs to use connections provided by airflow, but i'm not sure that will be possible.
We have succeeded in running the job by installing nodejs into the bash operator:
t1 = BashOperator(
task_id='task_1',
bash_command='/usr/bin/nodejs /usr/local/airflow/dags/test.js',
dag=dag)
Would this be a good approach? Or would writing a nodejs operator be a better approach?
I also thought of putting the node code behind an HTTP service which would be my preferred approach, but then we loose the logs.
Any thoughts on how best to architect this in Airflow?
The bash approach is feasible, but it is going to be very hard to maintain the nodejs dependencies.
I would migrate the code to containers and use docker_operator / KubernetesPodOperator afterwards.
I am trying to run several EMR steps in parallel.
I saw other questions regarding this issue on SO, as well as googled options.
so things i have tried:
Configure CapacityScheduler with set of queues
Configure FairScheduler
Try to use AWS data pipelines with PARALLEL_FAIR_SCHEDULING, PARALLEL_CAPACITY_SCHEDULING
this wasn't worked for me, yarn was created all queues properly, and submission was done on different queues. But EMR still ran just a single step at once (one step was RUNNING rest of them PENDING)
I also saw from one of the answers that step is meant to be sequential, but you can put several jobs inside single step. I wasn't managed to find a way to do this, and according to UI there is no option for this.
I wasn't tried to submit jobs to yarn cluster directly Submit Hadoop Jobs Interactively, i wanted to submit jobs from AWS API, and i havent found a way to do this from API
This is my configuration for CapacityScheduler CapacityScheduler
This is steps configuration StepsConfiguration
Might be late, but hope this would be helpful.
Spark provides an option that specifying whether the caller (step) will wait or not for spark application completion after submission. You can set this value as false then, AWS emr step will submit and will return immediately.
spark.yarn.submit.waitAppCompletion: "false"
I have a number of spark batch jobs each of which need to be run every x hours. I'm sure this must be a common problem but there seems to be relatively little on the internet as to what the best practice is here for setting this up. My current setup is as follows:
Build system (sbt) builds a tar.gz containing a fat jar + a script that will invoke spark-submit.
Once tests have passed, CI system (Jenkins) copies the tar.gz to hdfs.
I set up a chronos job to unpack the tar.gz to the local filesystem and run the script that submits to spark.
This setup works reasonably well, but there are some aspects of step 3) that I'm not fond of. Specifically:
I need a separate script (executed by chronos) that copies from hdfs, unpacks and runs the spark-submit task. As far as I can tell chrons can't run scripts from hdfs so I have to have a copy of this script on every mesos worker which makes deployment more complex that it would be if everything just lived on hdfs.
I have a feeling that I have too many moving parts. For example I was wondering if I could create an executable jar that could submit itself (args would be the spark master and the main class) in which case I would do away with at least one of the wrapper scripts. Unfortunately I haven't found a good way of doing this
As this is a problem that everyone faces I was wondering if anyone could give a better solution.
To download and extract archive you can use Mesos fetcher by specifying it in Chronos job config by setting uris field.
To do the same procedure on executors side you can set spark.executor.uri parameter in default Spark conf
I have created an Azure HDInsight cluster using PowerShell. Now I need to install some custom software on the worker nodes that is required for the mappers I will be running using Hadoop streaming. I haven't found any PowerShell command that could help me with this task. I can prepare a custom job that will setup all the workers, but I'm not convinced that this is the best solution. Are there better options?
edit:
With AWS Elastic MapReduce there is an option to install additional software in a bootstrap action that is defined when you create a cluster. I was looking for something similar.
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data.
from: Create Bootstrap Actions to Install Additional Software
The short answer is that you don't. It's not ideal from a caching perspective, but you ought to be able to bundle all your job dependencies into the map reduce jar which is distributed across the cluster for you by YARN (part of Hadoop). This is broadly speaking transparent to the end user, as it's all handled through the job submission process.
If you need something large which is a shared dependency across many jobs, and you don't want it copied out every time, you can keep it on wasb:// storage, and reference that in a class path, but that might cause you complexity if you are for instance using the .NET Streaming API.
I've just heard from a collage that I need to update my Azure PS because recently a new Cmdlet Add-AzureHDInsightScriptAction was added and it does just that.
Customize HDInsight clusters using Script Action