ModuleNotFoundError: No module named 'pyspark.sql' when using with EMR Serverless - apache-spark

I'm getting the following error when I'm trying to run a job on EMR serverless -
ModuleNotFoundError: No module named 'pyspark.sql'. Please refer to user guide on how to use python libraries with EMR Serverless.
It happens when I try to import pyspark.sql in a python file located within a zip package.
The file -
pyspark.zip
|--__init__.py
|--spark.py
The content -
#__init__.py
from .spark import *
#spark.py
from pyspark.sql import SparkSession
def run():
print("Create Spark Session")
spark_session = SparkSession\
.builder\
.appName("First pyspark project")\
.getOrCreate()
The spark property I gave to the job -
--conf spark.submit.pyFiles=s3://my-bucket/pyspark.zip
--conf spark.executorEnv.PYSPARK_PYTHON=python
I am afraid I missed something. Should I install it or something?
All I did was compress the project into a zip file and upload it to S3.

Related

JavaPackage object is not callable error for pydeequ constraint suggestion

I'm getting a "JavaPackage object is not callable" error while trying to run the PyDeequ constraint suggestion method on databricks.
I have tried running this code on Apache Spark 3.1.2 cluster as well as Apache Spark 3.0.1 cluster but no luck.
suggestionResult = ConstraintSuggestionRunner(spark).onData(df).addConstraintRule(DEFAULT()).run()
print(suggestionResult)
Please refer to the second screenshot attached for the expanded error status.
PyDeequ error screenshot
Expanded PyDeequ error screenshot
I was able to combine some solutions found here, as well as other solutions, to get past the above JavaPackage error in Azure Databricks. Here are the details, if helpful for anyone.
From this link, I downloaded the appropriate JAR file to match my Spark version. In my case, that was deequ_2_0_1_spark_3_2.jar. I then installed this file using the JAR type under Libraries in my cluster configurations.
The following then worked, ran in different cells in a notebook.
%pip install pydeequ
%sh export SPARK_VERSION=3.2.1
df = spark.read.load("abfss://container-name#account.dfs.core.windows.net/path/to/data")
from pyspark.sql import SparkSession
import pydeequ
spark = (SparkSession
.builder
.getOrCreate())
from pydeequ.analyzers import *
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("column_name")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

How to upload files to Amazon EMR?

My code is as follows:
# test2.py
from pyspark import SparkContext, SparkConf, SparkFiles
conf = SparkConf()
sc = SparkContext(
appName="test",
conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
with open(SparkFiles.get("test_warc.txt")) as f:
print("opened")
sc.stop()
It works when I run it locally with:
spark-submit --deploy-mode client --files ../input/test_warc.txt test2.py
But after adding step to EMR claster:
spark-submit --deploy-mode cluster --files s3://brand17-stock-prediction/test_warc.txt s3://brand17-stock-prediction/test2.py
I am getting error:
FileNotFoundError: [Errno 2] No such file or directory:
'/mnt1/yarn/usercache/hadoop/appcache/application_1618078674774_0001/spark-e7c93ba0-7d30-4e52-8f1b-14dda6ff599c/userFiles-5bb8ea9f-189d-4256-803f-0414209e7862/test_warc.txt'
Path to the file is correct, but it is not uploading from s3 for some reason.
I tried to load from executor:
from pyspark import SparkContext, SparkConf, SparkFiles
from operator import add
conf = SparkConf()
sc = SparkContext(
appName="test",
conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
def f(_):
a = 0
with open(SparkFiles.get("test_warc.txt")) as f:
a += 1
print("opened")
return a#test_module.test()
count = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(count) # printing 2
sc.stop()
And it works without errors.
Looking like --files argument uploading files to executors only. How can I upload to master ?
Your understanding is correct.
--files argument is uploading files to executors only.
See this in the spark documentation
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP
file server, and every executor pulls the file from the driver HTTP
server.
You can read more about this at advanced-dependency-management
Now coming back to your second question
How can I upload to master?
There is a concept of bootstrap-action in EMR. From the official documentation it means the following:
You can use a bootstrap action to install additional software or
customize the configuration of cluster instances. Bootstrap actions
are scripts that run on cluster after Amazon EMR launches the instance
using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions
run before Amazon EMR installs the applications that you specify when
you create the cluster and before cluster nodes begin processing data.
How do I use it in my case?
While spawning the cluster you can specify your script in BootstrapActions JSON Something like the following along with other custom configurations:
BootstrapActions=[
{'Name': 'Setup Environment for downloading my script',
'ScriptBootstrapAction':
{
'Path': 's3://your-bucket-name/path-to-custom-scripts/download-file.sh'
}
}]
The content of the download-file.sh should look something like below:
#!/bin/bash
set -x
workingDir=/opt/your-path/
sudo mkdir -p $workingDir
sudo aws s3 cp s3://your-bucket/path-to-your-file/test_warc.txt $workingDir
Now in your python script, you can use the file workingDir/test_warc.txt to read the file.
There is also an option to execute your bootstrap action on the master node only/task node only or a mix of both. bootstrap-actions/run-if is the script that we can use for this case. More reading on this can be done at emr-bootstrap-runif

How to spark-submit a script directly on terminal instead of using a script file?

I have some jobs that use a the following command to execute some tasks:
pyspark --master yarn --deploy-mode cluster --py-files file.py --name file file.py
The script on my python file is very simple:
from pyspark import SparkContext;
from pyspark.sql import HiveContext;
sc =SparkContext();
hive_context = HiveContext(sc);
table_1 = hive_context.sql("SELECT * FROM table_1");
table_1.write.insertInto("table_to_insert", overwrite=True);
My question is: can I run this command directly with the script instead of using a file? Something like:
"pyspark --master yarn --deploy-mode cluster --py-script 'from pyspark import SparkContext; from pyspark.sql import HiveContext; sc =SparkContext(); hive_context = HiveContext(sc); table_1 = hive_context.sql("SELECT * FROM table_1"); table_1.write.insertInto("table_to_insert", overwrite=True);'"
Is this possible?
Many thanks for your support!

Submit an application to a standalone spark cluster running in GCP from Python notebook

I am trying to submit a spark application to a standalone spark(2.1.1) cluster 3 VM running in GCP from my Python 3 notebook(running in local laptop) but for some reason spark session is throwing error "StandaloneAppClient$ClientEndpoint: Failed to connect to master sparkmaster:7077".
Environment Details: IPython and Spark Master are running in one GCP VM called "sparkmaster". 3 additional GCP VMs are running Spark workers and Cassandra Clusters. I connect from my local laptop(MBP) using Chrome to GCP VM IPython notebook in "sparkmaster"
Please note that terminal works:
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 --master spark://sparkmaster:7077 ex.py 1000
Running it from Python Notebook:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 pyspark-shell'
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.master("spark://sparkmaster:7077").appName('somatic').getOrCreate() #This step works if make .master('local')
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092") \
.option("subscribe", "gene") \
.load()
so far I have tried these:
I have tried to change spark master node spark-defaults.conf and spark-env.sh to add SPARK_MASTER_IP.
Tried to find the STANDALONE_SPARK_MASTER_HOST=hostname -f setting so that I can remove "-f". For some reason my spark master ui shows FQDN:7077 not hostname:7077
passed FQDN as param to .master() and os.environ["PYSPARK_SUBMIT_ARGS"]
Please let me know if you need more details.
After doing some more research I was able to resolve the conflict. It was due to a simple environment variable called SPARK_HOME. In my case it was pointing to Conda's /bin(pyspark was running from this location) whereas my spark setup was present in a diff. path. The simple fix was to add
export SPARK_HOME="/home/<<your location path>>/spark/" to .bashrc file( I want this to be attached to my profile not to the spark session)
How I have done it:
Step 1: ssh to master node in my case it was same as ipython kernel/server VM in GCP
Step 2:
cd ~
sudo nano .bashrc
scroll down to the last line and paste the below line
export SPARK_HOME="/home/your/path/to/spark-2.1.1-bin-hadoop2.7/"
ctrlX and Y and enter to save the changes
Note: I have also added few more details to the environment section for clarity.

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()\
df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)
When I run the script raises the following error message:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv
Then, I found out that I have to add file:// in the file path so it can read the file locally:
df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)
But this time, the above approach raised a different error:
Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1):
java.io.FileNotFoundException: File
file:/home/hadoop/observations_temp.csv does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.
Do you know how can I read the csv file and make it available to all the other nodes?
You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.
Here is the official documentation Ref. External Datasets.
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
So basically you have two solutions :
You copy your file into each worker before starting the job;
Or you'll upload in HDFS with something like : (recommended solution)
hadoop fs -put localfile /user/hadoop/hadoopfile.csv
Now you can read it with :
df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)
It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)
Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.
Alternatively, I would stick with HDFS (or any distributed file system).
I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this
spark = SparkSession \
.builder \
.master("local") \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
and then read the file in the same way you have been doing
df = spark.read.csv('file:///home/hadoop/observations_temp.csv')
this should solve the problem...
Might be useful for someone running zeppelin on mac using Docker.
Copy files to custom folder : /Users/my_user/zeppspark/myjson.txt
docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0
On Zeppelin you can run this to get your file:
%pyspark
json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

Resources