Pyspark local env path handling from a cloned pyspark code - apache-spark

I'm new to pyspark and have a query around env configs handling in pyspark.
I would be reading a file in pyspark in hdfs like:
airportDf = spark.read.format('csv')\
.option('sep', ',')\
.option('header', 'false')\
.schema(airportSchema)\
.load(configs['airport_dat'])
here,
configs['airport_dat'] = "/mnt/g/pythonProject/data/airports.DAT"
which is configured in a config.json and would be the HDFS path
However,when i clone this repo in local and want to load file from local windows path, i have to manually edit this file with windows path.
Wanted to know if this is the correct approach or is there any guideline to how to handle such env specific configurations so as to avoid such manual edit in config files when running in local system.
Any link or any article or sample repo guiding the approach will be really helpful.
One more issue I am facing here is , the code works fine in local on IDE , however on the spark-submit on cluster it is not able to find the config file.
My command to read config file is
with open("config/config.json", "r") as config_file:
configs = json.load(config_file)
and my directory structure is
config
-- config.json
main.py
jobs
__init__.py
load_country.py
i have packaged all the files in a packages.zip and passing it as a py-files parameter.
my spark-submit command is
$SPARK_HOME/bin/spark-submit --py-files /mnt/g/pythonProject/sparkproject/packages.zip /mnt/g/pythonProject/sparkproject/main.py
the error i am getting is
Traceback (most recent call last):
File "/mnt/g/pythonProject/sparkproject/main.py", line 20, in <module>
with open("config/config.json", "r") as config_file:
FileNotFoundError: [Errno 2] No such file or directory: 'config/config.json'
I am not getting although file is in zip file still why it is not discoverable to spark-submit.Am i need to add anything else to make the json file discoverable.
P.S. the other functions in job folder are accessible although.

Related

MLflow saves models to relative place instead of tracking_uri

sorry if my question is too basic, but cannot solve it.
I am experimenting with mlflow currently and facing the following issue:
Even if I have set the tracking_uri, the mlflow artifacts are saved to the ./mlruns/... folder relative to the path from where I run mlfow run path/to/train.py (in command line). The mlflow server searches for the artifacts following the tracking_uri (mlflow server --default-artifact-root here/comes/the/same/tracking_uri).
Through the following example it will be clear what I mean:
I set the following in the training script before the with mlflow.start_run() as run:
mlflow.set_tracking_uri("file:///home/#myUser/#SomeFolders/mlflow_artifact_store/mlruns/")
My expectation would be that mlflow saves all the artifacts to the place I gave in the registry uri. Instead, it saves the artifacts relative to place from where I run mlflow run path/to/train.py, i.e. running the following
/home/#myUser/ mlflow run path/to/train.py
creates the structure:
/home/#myUser/mlruns/#experimentID/#runID/artifacts
/home/#myUser/mlruns/#experimentID/#runID/metrics
/home/#myUser/mlruns/#experimentID/#runID/params
/home/#myUser/mlruns/#experimentID/#runID/tags
and therefore it doesn't find the run artifacts in the tracking_uri, giving the error message:
Traceback (most recent call last):
File "train.py", line 59, in <module>
with mlflow.start_run() as run:
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/fluent.py", line 204, in start_run
active_run_obj = client.get_run(existing_run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/client.py", line 151, in get_run
return self._tracking_client.get_run(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/_tracking_service/client.py", line 57, in get_run
return self.store.get_run(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/store/tracking/file_store.py", line 524, in get_run
run_info = self._get_run_info(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/store/tracking/file_store.py", line 544, in _get_run_info
"Run '%s' not found" % run_uuid, databricks_pb2.RESOURCE_DOES_NOT_EXIST
mlflow.exceptions.MlflowException: Run '788563758ece40f283bfbf8ba80ceca8' not found
2021/07/23 16:54:16 ERROR mlflow.cli: === Run (ID '788563758ece40f283bfbf8ba80ceca8') failed ===
Why is that so? How can I change the place where the artifacts are stored, this directory structure is created? I have tried mlflow run --storage-dir here/comes/the/path, setting the tracking_uri, registry_uri. If I run the /home/path/to/tracking/uri mlflow run path/to/train.py it works, but I need to run the scripts remotely.
My endgoal would be to change the artifact uri to an NFS drive, but even in my local computer I cannot do the trick.
Thanks for reading it, even more thanks if you suggest a solution! :)
Have a great day!
This issue was solved by the following:
I have mixed the tracking_uri with the backend_store_uri.
The tracking_uri is where the MLflow related data (e.g. tags, parameters, metrics, etc.) are saved, which can be a database. On the other hand, the artifact_location is where the artifacts (other, not MLflow related data belonging to the preprocessing/training/evaluation/etc. scripts).
What led me to mistakes is that by running mlflow server from command line one should set up for the --backend-store-uri the tracking_uri (also in the script by setting the mlflow.set_tracking_uri()) and for --default-artifact-location the location of the artifacts. Somehow I didn't get that the tracking_uri = backend_store_uri.
Here's my solution
Launch the server
mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWORD#DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME
Set the the tracking uri to an HTTP URI like
mlflow.set_tracking_uri("http://my-tracking-server:5000/")

No such file or directory in spark cluster mode

I am writing a spark-streaming application using pyspark which basically process the data.
Inshort packaging overview:
This application contains several modules and some config files which are non .py files (ex:.yaml or .json).
I am packaging this entire application in package.zip file and submitting this package.zip to spark.
Now the problem is when i issue the spark-submit command in yarn cluster mode. I get IOError. Below is stacktrace
Traceback (most recent call last):
File "main/main.py", line 10, in <module>
import logger.logger
File "package.zip/logger/logger.py", line 36, in get_logger
IOError: [Errno 2] No such file or directory: 'logger/config.yaml'
Spark-Command :
spark-submit --master yarn-cluster --py-files package.zip main/main.py
But when I am submitting job in yarn-client mode the application works as expected.
My understanding:
When I submit the job in client mode the spark driver runs in same machine where I have issued the command. And the package is distributed across all nodes.
And when I issue the command in cluster mode the both spark driver and application master runs in single node(which is not client who submitted code.) and still package is distribute to all nodes in cluster.
In both the cases package.zip is available to all nodes then why is that only py files are getting loaded and non py files are failed to load in cluster mode.
Can any one please help me to understand the situation here and resolve the problem?
Updated--
Observations
In Client Mode The zipped package is unzipped in the path where driver script is running.
Where as in Cluster Mode the zip package shared across all node but not unzipped.
Here do I need to unzip package in all nodes ?
Is there any way to tell spark to unzip package in worker node?
You can pass your extra files with --files option.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-submit.html

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()\
df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)
When I run the script raises the following error message:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv
Then, I found out that I have to add file:// in the file path so it can read the file locally:
df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)
But this time, the above approach raised a different error:
Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1):
java.io.FileNotFoundException: File
file:/home/hadoop/observations_temp.csv does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.
Do you know how can I read the csv file and make it available to all the other nodes?
You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.
Here is the official documentation Ref. External Datasets.
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
So basically you have two solutions :
You copy your file into each worker before starting the job;
Or you'll upload in HDFS with something like : (recommended solution)
hadoop fs -put localfile /user/hadoop/hadoopfile.csv
Now you can read it with :
df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)
It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)
Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.
Alternatively, I would stick with HDFS (or any distributed file system).
I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this
spark = SparkSession \
.builder \
.master("local") \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
and then read the file in the same way you have been doing
df = spark.read.csv('file:///home/hadoop/observations_temp.csv')
this should solve the problem...
Might be useful for someone running zeppelin on mac using Docker.
Copy files to custom folder : /Users/my_user/zeppspark/myjson.txt
docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0
On Zeppelin you can run this to get your file:
%pyspark
json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

exec sh from PySpark

I'm trying to run a .sh file loading from a .py file in a PySpark's job but I receive a message always saying that .sh file is not found
This is my code:
test.py:
import os,sys
os.system("sh ./check.sh")
and my gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster mserver file:///home/myuser/test.py
test.py file is loaded well but the system can't find check.sh file
I figure out that is something related with the file's path but not sure
I tried also with os.system("sh home/myuser/check.sh") and same result
I think that this should be easy to do so ... ideas?
The "current working directory" used by Dataproc jobs submitted through the API is a temporary directory with a unique name for each job; if the file wasn't uploaded with the job itself, you'll have to access it using your absolute path.
If you indeed added the check.sh file manually to /home/myuser/check.sh, then you should be able to call it using the fully qualified path, os.system("sh /home/myuser/check.sh"); make sure to start your absolute path with a /.

Resources