Migrating nodejs jobs to Airflow - node.js

I am looking at migrating several nodejs jobs to apache airflow.
These jobs log to the standard output. I am new to Airflow, and have set it up running in docker. Ideally, we would updated these jobs to use connections provided by airflow, but i'm not sure that will be possible.
We have succeeded in running the job by installing nodejs into the bash operator:
t1 = BashOperator(
task_id='task_1',
bash_command='/usr/bin/nodejs /usr/local/airflow/dags/test.js',
dag=dag)
Would this be a good approach? Or would writing a nodejs operator be a better approach?
I also thought of putting the node code behind an HTTP service which would be my preferred approach, but then we loose the logs.
Any thoughts on how best to architect this in Airflow?

The bash approach is feasible, but it is going to be very hard to maintain the nodejs dependencies.
I would migrate the code to containers and use docker_operator / KubernetesPodOperator afterwards.

Related

Running spark application as scheduled job

We have written a program to fetch data from different sources, make modifications and write the modified data into a MySQL database. The program uses Apache spark for the ETL process, and makes use of spark Java API for this. Will be deploying the live application in Yarn or Kubernetes.
I need to run the program as a scheduled job, say with an interval of five minutes. Did some research and got different suggestions including this from blogs and articles, like plain cron job, AWS glue, Apache Airflow etc for scheduling a spark application. From my reading, it seems I can't run my code (Spark java API) using AWS Glue as it supports only Python and Scala.
Can someone provide insights or suggestions on this? Which is the best option for running a spark application (in Kubernates or Yarn) as a scheduled job?
Is there an option for this in Amazon EMR? Thanks in advance.
The best option I think and I used before is a cronjob, either :
From inside your container with crontab -e with logging seting in case of failure such as:
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/spark/bin
40 13 * * * . /PERSIST_DATA/02_SparkPi_Test_Spark_Submit.sh > /PERSIST_DATA/work-dir/cron_logging/cronSpark 2>&1
OR with a Kubernetes Cronjob, see here for different settings
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

How to pass node modules between jobs in Concourse

I have a concourse pipeline for a node js application with multiple jobs (unit test etc). Currently, I am doing a yarn install on every job. I would prefer to be able to do it in just one job and then pass those node modules to jobs as needed. Is there a way to do this without having to pass the modules to an S3 bucket?
I'll ask your question in a slightly different way: is there a reason you need to have multiple jobs? Would they logically make sense to be just different tasks in the same job? If you did that, you can share outputs between tasks.

How to deploy spark job to EMR yarn cluster from Jenkins?

I have several spark jobs on a EMR cluster using yarn that must run on a regular basis and are submitted from Jenkins. Currently the Jenkins machine will ssh into the master node on EMR where a copy of the code is ready in a folder to be executed. I would like to be able to clone my repo into the jenkins workspace and submit the code from Jenkins to be executed on the cluster. Is there a simple way to do this? What is the best way to deploy spark from Jenkins?
You can use this rest api to call http requests from Jenkins to Start/Stop the jobs
If you have Python in Jenkins, implement script using Boto3 is a good, easy, flexible and powerful option.
You can manage EMR (So Spark) creating the full cluster or adding jobs to an existing one.
Also, using the same library, you can manage all AWS services.

Scheduler for jobs executing Apache Spark SQL on Bluemix

I am using Apache Spark in Bluemix.
I want to implement scheduler for sparksql jobs. I saw this link to a blog that describes scheduling. But its not clear how do I update the manifest. Maybe there is some other way to schedule my jobs.
The manifest file is to guide the deployment of cloud foundry (cf) apps. So in your case, sounds like you want to deploy your cf app that acts as a SparkSQL scheduler and use the manifest file to declare that your app doesn't need any of the web app routing stuff, or anything else for user-facing apps, because you just want to run a background scheduler. This is all well and good, and the cf docs will help you make that happen.
However, you cannot run a SparkSQL scheduler for the Bluemix Spark Service today because it only supports Jupyter notebooks through the Data-Analytics section of Bluemix; i.e., only a notebook UI. You need a Spark API you could drive from your scheduler cf app; e.g. spark-submit type thing where you can create your Spark context and then run programs, like SparkSQL you mention. This API is supposed to be coming to the Apache Spark Bluemix service.
UPDATE: spark-submit was made available sometime around the end of 1Q16. It is a shell script, but inside it makes REST calls via curl. REST API doesn't seem to yet be supported, but either you could call the script in your scheduler, or take the risk of calling the REST API directly and hope it doesn't changes and break you.

Apache Spark application deployment best practices

I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!

Resources