Copy src code ZIP to Dataproc cluster from GCS in Spark-Submit - apache-spark

I am trying to run a spark job on the Dataproc cluster in GCP. Where all my src code is zipped and stored in the GCS bucket. Additionally, I have the main python file and additional jars in the GCS bucket itself.
Now, when I try to do spark-submit, the main python file and jars are copied to the cluster except the src code file(.zip)
Here is the spark-submit command I am using
gcloud dataproc jobs submit pyspark gs://gcs-bucket/spark-submit/main_file.py \
--project XYZ-data \
--cluster=ABC-v1 \
--region=us-central1 \
--jars gs://qc-dmart/tmp/gcs-connector-hadoop3-2.2.2-shaded.jar,gs://qc-dmart/tmp/spark-bigquery-with-dependencies_2.12-0.24.2.jar
--archives gs://gcs-bucket/spark-submit/src/pyfiles.zip
-- /bin/sh -c "gsutil cp gs://gcs-bucket/spark-submit/src/pyfiles.zip . && unzip -n pyfiles.zip && chmod +x"
-- --config-path=../configs env=dev
Here I tried using
--archives and --files arguments separately but no luck
Additionally based on the StackOverflow answer, I also tried to copy the files directly using gsutil as well. You can see how I am using this in the command above
None of the above trails is fruitful.
Here is the error is thrown from the main python file
File "/tmp/b1f7408ed1444754909e368cc1dba47f/promo_roi.py", line 10, in <module>
from src.promo_roi.compute.spark.context import SparkContext
ModuleNotFoundError: No module named 'src'
Any help would be really appreciated.

Related

How to get basic Spark program running on Kubernetes

I'm trying to get off the ground with Spark and Kubernetes but I'm facing difficulties. I used the helm chart here:
https://github.com/bitnami/charts/tree/main/bitnami/spark
I have 3 workers and they all report running successfully. I'm trying to run the following program remotely:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://<master-ip>:<master-port>").getOrCreate()
df = spark.read.json('people.json')
Here's the part that's not entirely clear. Where should the file people.json actually live? I have it locally where I'm running the python code and I also have it on a PVC that the master and all workers can see at /sparkdata/people.json.
When I run the 3rd line as simply 'people.json' then it starts running but errors out with:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
If I run it as '/sparkdata/people.json' then I get
pyspark.sql.utils.AnalysisException: Path does not exist: file:/sparkdata/people.json
Not sure where I go from here. To be clear I want it to read files from the PVC. It's an NFS share that has the data files on it.
Your people.json file needs to be accessible to your driver + executor pods. This can be achieved in multiple ways:
having some kind of network/cloud drive that each pod can access
mounting volumes on your pods, and then uploading the data to those volumes using --files in your spark-submit.
The latter option might be the simpler to set up. This page discusses in more detail how you could do this, but we can shortly go to the point. If you add the following arguments to your spark-submit you should be able to get your people.json on your driver + executors (you just have to choose sensible values for the $VAR variables in there):
--files people.json \
--conf spark.kubernetes.file.upload.path=$SOURCE_DIR \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
You can always verify the existence of your data by going inside of the pods themselves like so:
kubectl exec -it <driver/executor pod name> bash
(now you should be inside of a bash process in the pod)
cd <mount-path-you-chose>
ls -al
That last ls -al command should show you a people.json file in there (after having done your spark-submit of course).
Hope this helps!

Spark --archives file not found error, exception from executor

I am submitting a Spark job using Dataproc Serverless. My Spark code uses a few .yaml files as configuration and I pass them as --archives to the code.
Command to run the code:
gcloud dataproc batches submit pyspark src/mapper.py \
--project=$PROJECT_ID \
--region=$REGION \
--deps-bucket=$DEPS_BUCKET \
--container-image=$CONTAINER_IMAGE \
--service-account=$SERVICE_ACCOUNT \
--subnet=$SUBNETWORK_URI \
--py-files=dist/src.zip \
--archives=dist/config.zip \
-- --arg1="value1"
But I am getting an error with the below message:
An exception was thrown from the Python worker. Please see the stack trace below.
FileNotFoundError: [Errno 2] No such file or directory: '/var/tmp/spark/work/config/config.yaml'
Code used to access config:
with open("config/config.yaml", "r") as f:
COUNTRY_CODES = yaml.load(f, yaml.SafeLoader)
How can I submit dependent files to Dataproc so that they will be available inside /var/tmp/spark/work/ folder inside the executor?
When you access files in the archive that are passed via --archives parameter to Spark job, you do not need to specify full path to these files, instead you need to use current working directory (.). In your specific case it probably will be ./config/config.yaml (depends on folder structure inside your archive).
You can read more about Python package management in Spark docs: https://spark.apache.org/docs/3.3.1/api/python/user_guide/python_packaging.html

Spark external jars and files on hdfs

I have a spark job that I run using the spark-submit command.
The jar that I use is hosted on hdfs and I call it from there directly in the spark-submit query using its hdfs file path.
With this same logic, I'm trying to do the same when for the --jars options, the files options and also the extraClassPath option (in the spark.conf) but it seems that there is an issue with the fact that it point to a hdfs file path.
My command looks like this:
spark-submit \
--class Main \
--jars 'hdfs://path/externalLib.jar' \
--files 'hdfs://path/log4j.xml' \
--properties-file './spark.conf' \
'hdfs://path/job_name.jar
So not only when I call a method that refers the externalLib.jar, spark raises an exception telling me that it doesn't find the method but also from the starts I have the warning logs:
Source and destination file systems are the same. Not copying externalLib.jar
Source and destination file systems are the same. Not copying log4j.xml
It must come from the fact that I precise a hdfs path because it works flawlessly when I refers to those jar in the local file system.
Maybe it isn't possible ? What can I do ?

Passing multiple typesafe config files to a yarn cluster mode application

I'm struggling a bit on trying to use multiple (via include) Typesafe config files in my Spark Application that I am submitting to a YARN queue in cluster mode. I basically have two config files and file layouts are provided below:
env-common.properties
application-txn.conf (this file uses an "include" to reference the above one)
Both of the above files are external to my application.jar, so I pass them to yarn using the "--files" (can be seen below)
I am using the Typesafe config library to parse my "application-main.conf" and in this main conf, I am trying to use a property from the env.properties file via substitution, but the variable name does not get resolved :( and I'm not sure why.
env.properties
txn.hdfs.fs.home=hdfs://dev/1234/data
application-txn.conf:
# application-txn.conf
include required(file("env.properties"))
app {
raw-data-location = "${txn.hdfs.fs.home}/input/txn-raw"
}
Spark Application Code:
//propFile in the below block maps to "application-txn.conf" from the app's main method
def main {
val config = loadConfig("application-txn.conf")
val spark = SparkSession.builkder.getOrCreate()
//Code fails here:
val inputDF = spark.read.parquet(config.getString("app.raw-data-location"))
}
def loadConf(propFile:String): Config = {
ConfigFactory.load()
val cnf = ConfigFactory.parseResources(propFile)
cnf.resolve()
}
Spark Submit Code (called from a shell script):
spark-submit --class com.nic.cage.app.Transaction \
--master yarn \
--queue QUEUE_1 \
--deploy-mode cluster \
--name MyTestApp \
--files application-txn.conf,env.properties \
--jars #Typesafe config 1.3.3 and my app.jar go here \
--executor-memory 2g \
--executor-cores 2 \
app.jar application-txn.conf
When I run the above, I am able to parse the config file, but my app fails on trying to read the files from HDFS because it cannot find a directory with the name:
${txn.hdfs.fs.home}/input/txn-raw
I believe that the config is actually able to read both files...or else it would fail because of the "required" keyword. I verified this by adding another include statement with a dummy file name, and the application failed on parsing of the config. Really not sure what's going on right now :(.
Any ideas what could be causing this resolution to fail?
If it helps: When I run locally with multiple config files, the resolution works fine
The syntax in application-txn.conf is wrong.
The variable should be outside the string, like so:
raw-data-location = ${txn.hdfs.fs.home}"/input/txn-raw"

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()\
df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)
When I run the script raises the following error message:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv
Then, I found out that I have to add file:// in the file path so it can read the file locally:
df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)
But this time, the above approach raised a different error:
Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1):
java.io.FileNotFoundException: File
file:/home/hadoop/observations_temp.csv does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.
Do you know how can I read the csv file and make it available to all the other nodes?
You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.
Here is the official documentation Ref. External Datasets.
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
So basically you have two solutions :
You copy your file into each worker before starting the job;
Or you'll upload in HDFS with something like : (recommended solution)
hadoop fs -put localfile /user/hadoop/hadoopfile.csv
Now you can read it with :
df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)
It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)
Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.
Alternatively, I would stick with HDFS (or any distributed file system).
I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this
spark = SparkSession \
.builder \
.master("local") \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
and then read the file in the same way you have been doing
df = spark.read.csv('file:///home/hadoop/observations_temp.csv')
this should solve the problem...
Might be useful for someone running zeppelin on mac using Docker.
Copy files to custom folder : /Users/my_user/zeppspark/myjson.txt
docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0
On Zeppelin you can run this to get your file:
%pyspark
json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

Resources