AWS EMR - ModuleNotFoundError: No module named 'arrow' - apache-spark

i'm running into this issue when trying to upgrade to Python3.9 for our EMR jobs using Pyspark 3.0.1/ EMR release 6.2.1. I've created the EMR Cluster using a bootstrap script and here are spark environment variables that were set:
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip
I've installed all the application dependency libs using a shell script and are located in /home/ec2-user. However, when I try to spark submit a job with following command by user hadoop, i'm seeing the "ModuleNotFoundError".
Spark-submit cmd:
/bin/sh -c "MYAPP_ENV=dev PYSPARK_PYTHON=/usr/local/bin/python3 PYTHONHASHSEED=0 SETUPTOOLS_USE_DISTUTILS=stdlib spark-submit --master yarn --deploy-mode client --jars /home/hadoop/ext_lib/*.jar --py-files /home/hadoop/myapp.zip --conf spark.sql.parquet.compression.codec=gzip --conf spark.executorEnv.MYAPP_ENV=dev /home/hadoop/myapp/oasis/etl/spark/daily/run_daily_etl.py '--lookback_days' '1' '--s3_file_system' 's3'"
Error: ModuleNotFoundError: No module named 'arrow'
However, the same works when we use the EMR cluster settings with "EMR Release label:emr-5.28.0 and Spark 2.4.4.
Can someone provide help on identifying the cause as I'm fully stuck with this. I suspect it may be due to the access of ec2-user home folder from hadoop user.
Thanks

Related

Running Spark Job on Zeppelin

I have written a custom spark library in scala. I am able to run this successfully as a spark-submit step by spawning the cluster and running the following commands. Here I first get my 2 jars by -
aws s3 cp s3://jars/RedshiftJDBC42-1.2.10.1009.jar .
aws s3 cp s3://jars/CustomJar .
and then i run my spark job as
spark-submit --deploy-mode client --jars RedshiftJDBC42-1.2.10.1009.jar --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,com.databricks:spark-avro_2.11:3.2.0 --class com.activities.CustomObject CustomJar.jar
This runs my CustomObject successfully. I want to run the similar thing in Zeppelin, But I do not know how to add jars and then run a spark-submit step?
You can add these dependencies to the Spark interpreter within Zeppelin:
Go to "Interpreter"
Choose edit and add the jar file
Restart the interpreter
More info here
EDIT
You might also want to use the %dep paragraph in order to access the zvariable (which is an implicit Zeppeling context) in order to do something like this:
%dep
z.load("/some_absolute_path/myjar.jar")
It depend how you run Spark. Most of the time, the Zeppelin interpreter will embed the Spark driver.
The solution is to configure the Zeppelin interpreter instead:
ZEPPELIN_INTP_JAVA_OPTS will configure java options
SPARK_SUBMIT_OPTIONS will configure spark options

Apache Toree to connect to a remote spark cluster

Is there a way to connect Apache Toree to a remote spark cluster? I see the common command is
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
How can I go about using spark on a remote server without having to install locally?
There is indeed a way of getting Toree to connect to a remote Spark cluster.
The easiest way I've discovered is to clone the existing Toree Scala/Python kernel, and create a new Toree Scala/Python Remote kernel. That way you can have the choice of running locally or remotely.
Steps:
Make a copy of the existing kernel. On my particular Toree install, the path to the Kernels was located at: /usr/local/share/jupyter/kernels/, so I performed the following command:
cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/
Edit the new kernel.json file in /usr/local/share/jupyter/kernels/apache_toree_scala_remote/ and add the requisite Spark options to the __TOREE_SPARK_OPTS__ variable. Technically, only --master <path> is required, but you can also add --num-executors, --executor-memory, etc to the variable as well.
Restart Jupyter.
My kernel.json file looks like this:
{
"display_name": "Toree - Scala Remote",
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
"--profile",
"{connection_file}"
],
"language": "scala",
"env": {
"PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
"SPARK_HOME": "/opt/spark",
"DEFAULT_INTERPRETER": "Scala",
"PYTHON_EXEC": "python",
"__TOREE_OPTS__": "",
"__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
}
}
This is a possible example with some intuitive details for ANY remote cluster install. For my remote cluster, which is a Cloudera 5.9.2 these are specific steps. (You can also use this example to install with non-Cloudera clusters with some smart edits.)
With OS/X to build CDH version (skip if using a distribution):
Goto https://github.com/Myllyenko/incubator-toree and clone this repo
Download Docker
Setup 'signing' - It's been a some time since I set this up - you'll need to sign the build above. TBD
'new branch git', edit the .travis.xml, README.md, and build.sbt files to change 5.10.x to 5.9.2
Start Docker, cd within the make release dir, build with make release, wait, wait, sign 3 builds
Copy the file ./dist/toree-pip/toree-0.2.0-spark-1.6.0-cdh5.9.2.tar.gz to your spark-shell machine that can reach your YARN-controlled Spark cluster
Merge, commit, etc your repo to your master repo if this will be mission critical
Spark Machine Installs:
Warning: Some steps may need to be done as root as a last resort
Install pip / anaconda (see other docs)
Install Jupyter sudo pip install jupyter
Install toree sudo pip install toree-0.2.0-spark-1.6.0-cdh5.9.2 or use the apache-toree distribution
Configure Toree to run with Jupyter (example):
Edit & add to ~/.bash_profile
echo $PATH
PATH=$PATH:$HOME/bin
export PATH
echo $PATH
export CDH_SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/spark
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
PATH=$PATH:$SPARK_HOME/bin
export PATH
echo $PATH
export SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
com.databricks:spark-csv_2.10:1.5.0
END
)
export SPARK_JARS=$(cat << END | xargs echo | sed 's/ /,/g'
/home/mymachine/extras/someapp.jar
/home/mymachine/extras/jsoup-1.10.3.jar
END
)
export TOREE_JAR="/usr/local/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.2.0-spark-1.6.0-cdh5.9.2-incubating.jar"
export SPARK_OPTS="--master yarn-client --conf spark.yarn.config.gatewayPath=/opt/cloudera/parcels --conf spark.scheduler.mode=FAIR --conf spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.historyServer.address=http://yourCDHcluster.net:18088 --conf spark.default.parallelism=20 --conf spark.driver.maxResultSize=1g --conf spark.driver.memory=1g --conf spark.executor.cores=4 --conf spark.executor.instances=5 --conf spark.executor.memory=1g --packages $SPARK_PKGS --jars $SPARK_JARS"
function jti() {
jupyter toree install \
--replace \
--user \
--kernel_name="CDH 5.9.2 Toree" \
--debug \
--spark_home=${SPARK_HOME} \
--spark_opts="$SPARK_OPTS" \
--log-level=0
}
function jn() {
jupyter notebook --ip=127.0.0.1 --port=8888 --debug --log-level=0
}
If you want a different port to hit Toree - now is your chance to edit 8888
Log out of your Toree / spark-shell machine
ssh back to that machine ssh -L 8888:localhost:8888 toreebox.cdhcluster.net (assuming that 8888 is the port in the bash file)
I expect as a user (not root) you can type jti to install Toree into Jupyter (Note: understanding this step may help to install other kernels into Jupyter - sidebar: #jamcom mentioned
the produced file, but this step automatically produces this part. The file is buried in your home dir's tree as a user rather than root.
As user, type jn to start a Jupyter Notebook. Wait a few seconds until the browser url is available and paste that URL into your browser.
You now have Jupyter running and so pick a new CDH 5.9.2 Toree or the version you installed. This launches a new browser window. Since you have some Toree experience, pick something like sc.getConf.getAll.sortWith(_._1 < _._1).foreach(println) in order to get the lazily instantiated spark context going. Be really patient as your jobs is submitted to the cluster and your may have to wait a long time if your cluster is busy or a little while for your job to process in the cluster.
Tips and Tricks:
I ran into an issue on the first run and the subsequent runs never saw that issue. (The issue issue might be fixed in the github)
Sometimes, I have to kill the old 'Apache Toree' app on YARN to start a new Toree.
Sometimes, my VM can has an orphaned JVM. If you get memory errors starting a Jupyter Notebook/ Toree or have unexpectedly disconnected, check your process list with top. And ... kill the extra JVM (be careful ID-ing your lost process).

spark-submit classNotFoundException

I'm building a spark app with maven (with shade plugin) and scp'ing it to a data node for execution with spark-submit --deploy-mode cluster (since launching right from the build system with --deploy-mode client doesn't work because of asymmetric network not under my control).
Here's my launch command
spark-submit
--class Test
--master yarn
--deploy-mode cluster
--supervise
--verbose
jarName.jar
hdfs:///somePath/Test.txt
hdfs:///somePath/Test.out
The job quickly fails with a ClassNotFoundException for Test$1; one of the anonymous classes java creates from my main class
6/03/18 12:59:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
0.0 (TID 0, dataNode3): java.lang.ClassNotFoundException: Test$1
I've seen this error mentioned many times (google) and most recommendations boil down to calling conf.setJars(jarPaths) or similar.
I really don't see why this is needed when the missing class is definitely (I've checked) available in jarName.jar , why specifying this at compile time is preferable to doing it at run time with --jar as a spark-submit argument, and in either case, what path I should provide for the jar. I've been copying it to my home directory on the datanode from target/jarName.jar on the build system but it seems spark-submit copies it to hdfs somewhere that's hard to nail down into a hard-coded path name at either compile time or launch time.
And most of all, why isn't spark-submit handling this automatically based on the someJar.jar argument, and if not, what should I do to fix it?
Check the answer from here
spark submit java.lang.ClassNotFoundException
spark-submit --class Test --master yarn --deploy-mode cluster --supervise --verbose jarName.jar hdfs:///somePath/Test.txt hdfs:///somePath/Test.out
Try to use, also you could check the absolute path in your project
--class com.myclass.Test
I had the same issue with my Scala Spark application when I tried to run it in "cluster" mode:
--master yarn --deploy-mode cluster
I found the solution on this page. Basically what I was missing (that is missing also in your command) is the "--jars" parameter that allows you to distribute the application jars to your cluster.
Suggestion: to be able to troubleshooting this kind of error you could use the following command:
yarn logs --applicationId yourApplicationId
where yourApplicationId shoould be in your yarn exception log.

Spark driver always fail to bind to submit host in cluster mode

Hi I'm trying to deploy Spark streaming job using standalone cluster. All the jars are installed locally on each node and I run spark-submit inside one of the nodes. The driver is then started in one of the workers randomly but always try to bind to the node where I submitted the job. And if it happens to be on a different node, the driver always fails. I tried to set spark.driver.host to different values but didn't help.
Anyone with the same problem? Or is there any better ways to submit spark jobs, ideally in Standalone cluster.
spark-env.sh
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_LOCAL_HOSTNAME=local_host_name
export SPARK_LOG_DIR=/var/log/spark
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOCAL_DIRS=/var/run/spark/tmp
export STANDALONE_SPARK_MASTER_HOST=master_host_name
spark-defaults.conf
spark.master spark://master_host_name:6066
spark.io.compression.codec lz4
I run it with spark-submit --deploy-mode cluster --supervise
Thanks a lot

Apache Spark Multi Node Clustering

I am currently working on logger analyse by using apache spark. I am new for Apache Spark. I have tried to use apache spark standalone mode. I can run my code by submitting jar with deploy-mode on the client. But I can not run with multi node cluster. I have used worker nodes are different machine.
sh spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:7077 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
when i Run this command, it shows following error
15/10/20 18:04:23 ERROR ClientEndpoint: Exception from cluster was: java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
at org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
at org.spark-project.guava.io.Files.copy(Files.java:436)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
As my understanding, The driver program sends the data and application code to worker node. I don't know my understanding is correct or not. So Please help me to run application on a cluster.
I have tried to run jar on cluster and Now there is no exception but why the task is not assigned to worker node?
I have tried without clustering. Its working fine. shown in following figure
Above image shows, Task assigned to worker nodes. But I have one more problem to analyse the log file. Actually, I have log files in master node which is in a folder (ex: '/home/visva/log'). But the worker node searching the file on their own file system.
I met same problem.
My solution was that I uploaded my .jar file on the HDFS.
Enter the command line like this:
spark-submit --class com.example.RunRecommender --master spark://Hadoop-NameNode:7077 --deploy-mode cluster --executor-memory 6g --executor-cores 3 hdfs://Hadoop-NameNode:9000/spark-practise-assembly-1.0.jar
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
If you use the cluster model in spark-submit , you need use the 6066 port(the default port of rest in spark) :
spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:6066 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
In my case, i upload the jar of app to every node in cluster because i do not know how does the spark-submit to transfer the app automatically and i don't know how to specify a node as driver node .
Note: The jar path of app is a path that in the any node of cluster.
There are two deploy modes in Spark to run the script.
1.client (default): In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.(Master node)
2.cluster : If your application is submitted from a machine far from the worker machines, it is common to use cluster mode to minimize network latency between the drivers and the executors.
Reference Spark Documentation For Submitting JAR

Resources