how to schedule spark job using crontab - apache-spark

I am trying to schedule spark job using crontab, yes oozie is there made for this purpose but for certain challenge not able to use it.
I have put the script as sh file and executing that in the cron.
Cron:
/bin/sh /home/user/job.sh
job.sh
source /etc/hadoop/conf/hadoop.sh
source /hadoop/spark/conf/spark.sh
./sparksubmit .py file
Getting below Error:
An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.

Related

How to return/get a variable from spark job to shell script?

I have a spark job which I run using a shell script. My requirement is that, I return a couple of variables from spark job/script, which I need to validate in the shell script and decide whether to exit 0 or exit 1 based on my validation. I'm unable to receive variables returned from spark. How do I do this?
I'm using "sh <script_name.sh>" to run the shell script and "spark-submit" to execute the spark job on linux OS.

Databricks init scripts not working sometimes

Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts
cluster has the init script , which is in a file (in dbfs)
basically this
dbfs:/databricks/init-scripts/custom-cert.sh
Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.
I have the sh script in a file named
custom-cert.sh
with the same contents as above, i.e.
#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"
but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.
Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference
i.e.
%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh
shows the same contents in both the scenarios. What is the problem with the 2nd case?
EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here
%sh
ls /databricks/init_scripts/
Looking at the err file in that location, it seems there is an error
sudo: update-ca-certificates
: command not found
Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?
EDIT 2: In response to the first answer. After running the command
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?
As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash
command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on
case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .
Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram
It seems that my observations I made in the comments section of my question is the way to go.
I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.
The notebook has the commands
dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.
I then call this Databricks job in the release pipeline using again a Powershell script.
This creates the file
/databricks/init-scripts/custom-cert.sh
I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).
I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.
It simply boils down to how the init script gets created.
But I get my job done. Just if it helps someone get their job done too.
I have raised a support case though to understand the reason.

Puppet task run throwing invalid json error

My Puppet sever is PE 2017.3.1
My Agent node is on 5.5 version
I am facing an error while executing the command
/opt/puppetlabs/puppet/bin/puppet-task run sample --nodes puppet-agent
My Sample file is a bash file which contains : -
#!/usr/bin/env bash
hostnamectl
I was able to list my task using cli
The above puppet-task command throws an error on command line: -
1. Starting Job
2. Invalid Json
Puppet access login token expired. So after recreating it, I was able to perform the task.

How to run a nodejs script every second

I need to run my nodejs script for every second ,Similar to PHP cron jobs. I have tried some nodejs cron libraries like https://github.com/ncb000gt/node-cron but the issue was first run should be manual i:e I have to run the file with cron script for first time manually.
But in php cron jobs, they run by the server so if the apache server running script will automatically start and even if the script return an error for a cycle then script will run again from the beginning from the next cycle
So is there any way to achieve this in nodejs ?
You have two options:
using Node as a daemon, with something like Supervisord to run your node-cron script. This alternative is wasteful on resources such as RAM because Node and Supervisord are running all the time.
using the system's crontab, you can run your script like calling Node on the command line, such as * * * * node /path/to/your/script.js. This alternative is highly efficient but lacks some control, like being able to log the output in case of an error, although you could just pipe the output to a file: node script.js > logfile

How to schedule sqoop job command using oozie in azure hdinsight(remote machine)

I am trying to schedule sqoop job using oozie in hdinsight(remote machine). I have executed the sqoop job command in cmd and trying to schedule "job --exec job1" in oozie. But this is not working. I seen logs in azure blob and oozie logs there is no error. I seen success in yarn app and oozie shows failure/killed1 error.
if i am running the same command in cmd it works well.
This is my sqoop job command :
sqoop job --create job1 -- import --connect "jdbc:sqlserver://<ip:port>;database=<dbname>;username=<name>;password=<pwd>" --table tablename --target-dir /example/sqoopoutput --incremental append --check-column latestdate --last-value "1991-01-01 00:00:00.000"
I am getting this error :
[org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
i tried with all jar files in sqoop and oozie share lib jars.
while scheduling sqoop job in oozie you do not leave space in date value you passed, instead of space use T
change your command as follows :
job --create job1 -- import --connect "jdbc:sqlserver://<ip:port>;database=<dbname>;username=<name>;password=<pwd>" --table tablename --target-dir /example/sqoopoutput --incremental append --check-column latestdate --last-value "1991-01-01T00:00:00.000"

Resources