exec sh from PySpark - apache-spark

I'm trying to run a .sh file loading from a .py file in a PySpark's job but I receive a message always saying that .sh file is not found
This is my code:
test.py:
import os,sys
os.system("sh ./check.sh")
and my gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster mserver file:///home/myuser/test.py
test.py file is loaded well but the system can't find check.sh file
I figure out that is something related with the file's path but not sure
I tried also with os.system("sh home/myuser/check.sh") and same result
I think that this should be easy to do so ... ideas?

The "current working directory" used by Dataproc jobs submitted through the API is a temporary directory with a unique name for each job; if the file wasn't uploaded with the job itself, you'll have to access it using your absolute path.
If you indeed added the check.sh file manually to /home/myuser/check.sh, then you should be able to call it using the fully qualified path, os.system("sh /home/myuser/check.sh"); make sure to start your absolute path with a /.

Related

Pyspark local env path handling from a cloned pyspark code

I'm new to pyspark and have a query around env configs handling in pyspark.
I would be reading a file in pyspark in hdfs like:
airportDf = spark.read.format('csv')\
.option('sep', ',')\
.option('header', 'false')\
.schema(airportSchema)\
.load(configs['airport_dat'])
here,
configs['airport_dat'] = "/mnt/g/pythonProject/data/airports.DAT"
which is configured in a config.json and would be the HDFS path
However,when i clone this repo in local and want to load file from local windows path, i have to manually edit this file with windows path.
Wanted to know if this is the correct approach or is there any guideline to how to handle such env specific configurations so as to avoid such manual edit in config files when running in local system.
Any link or any article or sample repo guiding the approach will be really helpful.
One more issue I am facing here is , the code works fine in local on IDE , however on the spark-submit on cluster it is not able to find the config file.
My command to read config file is
with open("config/config.json", "r") as config_file:
configs = json.load(config_file)
and my directory structure is
config
-- config.json
main.py
jobs
__init__.py
load_country.py
i have packaged all the files in a packages.zip and passing it as a py-files parameter.
my spark-submit command is
$SPARK_HOME/bin/spark-submit --py-files /mnt/g/pythonProject/sparkproject/packages.zip /mnt/g/pythonProject/sparkproject/main.py
the error i am getting is
Traceback (most recent call last):
File "/mnt/g/pythonProject/sparkproject/main.py", line 20, in <module>
with open("config/config.json", "r") as config_file:
FileNotFoundError: [Errno 2] No such file or directory: 'config/config.json'
I am not getting although file is in zip file still why it is not discoverable to spark-submit.Am i need to add anything else to make the json file discoverable.
P.S. the other functions in job folder are accessible although.

Databricks init scripts not working sometimes

Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts
cluster has the init script , which is in a file (in dbfs)
basically this
dbfs:/databricks/init-scripts/custom-cert.sh
Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.
I have the sh script in a file named
custom-cert.sh
with the same contents as above, i.e.
#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"
but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.
Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference
i.e.
%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh
shows the same contents in both the scenarios. What is the problem with the 2nd case?
EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here
%sh
ls /databricks/init_scripts/
Looking at the err file in that location, it seems there is an error
sudo: update-ca-certificates
: command not found
Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?
EDIT 2: In response to the first answer. After running the command
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?
As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash
command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on
case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .
Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram
It seems that my observations I made in the comments section of my question is the way to go.
I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.
The notebook has the commands
dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.
I then call this Databricks job in the release pipeline using again a Powershell script.
This creates the file
/databricks/init-scripts/custom-cert.sh
I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).
I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.
It simply boils down to how the init script gets created.
But I get my job done. Just if it helps someone get their job done too.
I have raised a support case though to understand the reason.

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

sc.addFile throwing error=2, No such file or directory in cluster mode

I am trying to run an R script buy piping out an RDD to an Rscript using Spark's pipe(). I am using sc.addFile() to copy the Rscript to the executor's memory.
sc.addFile(rScript) and using SparkFiles.get(rName) to get the file name.
While running the job cluster mode, I am getting below error
Cannot run program "/data/tmp/spark-b8b8053e-0110-4ddb-91a3-
ae6f0f633c68/userFiles-78ed11c0-483b-4615-88eb-
8d1c97571997/RSCRIPT_NAME.R": error=2, No such file
or directory
But the file is getting copied to /data/tmp/spark-b8b8053e-0110-4ddb-91a3-
ae6f0f633c68/userFiles-78ed11c0-483b-4615-88eb-
8d1c97571997 location.
Not sure how to fix this issue.
I think you are trying to execute it as
rdd.pipe("scriptName.R")
Please add "./" before you call the script.
rdd.pipe("./scriptName.R")

Where is Spark writing SaveAsTextFile in cluster?

I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.

Resources