spark (pyspark) UnicodeEncodeError on yarn cluster mode - apache-spark

I'm running some simple pyspark code that tries to print a "shrug" at the end. I'm running pyspark with Yarn and when I run my code in client mode everything works properly. When I run my code in cluster mode (--deploy-mode cluster) then I get a UnicodeEncodeError error.
Obviously I can not print the "shrug" but I feel like I'm missing something major here. Should I be worried my code isn't handling unicode properly?
Attempted solution:
I've set PYTHONIOENCODING=utf8 and PYSPARK_PYTHON=python3.
Code:
#!/usr/bin/env python3
# Spark Imports
from pyspark import SparkContext
from pyspark.sql import SparkSession
...
if __name__ == '__main__':
sc = SparkContext()
session = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
...
print("¯\_(ツ)_/¯")
No error: ./bin/spark-submit --master yarn myApp.py
Throws error: ./bin/spark-submit --master yarn --deploy-mode clustersparkcluster myApp.py
Error:
Traceback (most recent call last):
File "myApp.py", line 435, in <module>
print("\xaf\_(\u30c4)_/\xaf")
UnicodeEncodeError: 'ascii' codec can't encode character '\xaf' in position 46: ordinal not in range(128)
Settings:
I have the following in my .zshrc:
alias sparkclient="/opt/mapr/spark/spark-2.0.1/bin/spark-submit --master yarn
alias sparkcluster="/opt/mapr/spark/spark-2.0.1/bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 6g
export SPARK_HOME=/opt/mapr/spark/spark-2.0.1
export PYSPARK_PYTHON=python3
export PYTHONHASHSEED=0
export PYTHONIOENCODING=utf8
export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

Related

Getting error while setting spark.ext.h2o.backend.cluster.mode=external in pysparkling standalone cluster

Code:
import pandas as pd
from pyspark.sql import SparkSession
from pysparkling import *
import h2o
from pysparkling.ml import H2OAutoML
spark = SparkSession.builder.appName('SparkApplication').getOrCreate()
hc = H2OContext.getOrCreate()
Spark-submit Command:
spark-submit --master spark://local:7077 --py-files
sparkling-water-3.36.1.3-1-3.2/py/h2o_pysparkling_3.2-3.36.1.3-1-3.2.zip
--conf "spark.ext.h2o.backend.cluster.mode=external" --conf spark.ext.h2o.external.start.mode="auto" --conf
spark.ext.h2o.external.h2o.driver="/home/whiz/spark/h2odriver-3.36.1.3.jar"
--conf spark.ext.h2o.external.cluster.size=2 spark_h20/h2o_script.py
Error Logs:
py4j.protocol.Py4JJavaError: An error occurred while calling o58.getOrCreate.
: java.io.IOException: Cannot run program "hadoop": error=2, No such file or directory**
the automatic start of SW external backend is only support in Hadoop or K8s environments. In a standalone deployment, you need to deploy the external backend manually according to the tutorial in SW documentation.

spark.yarn.jars - py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:

I am trying to run a spark job using a spark2-submit on command. The version of the spark installed on the cluster is cloudera's spark2.1.0 and I am specifying my jars for version 2.4.0 using conf spark.yarn.jars as shown below -
spark2-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/virtualenv/path/bin/python \
--conf spark.yarn.jars=hdfs:///some/path/spark24/*\
--conf spark.yarn.maxAppAttempts=1\
--conf spark.task.cpus=2\
--executor-cores 2\
--executor-memory 4g\
--driver-memory 4g\
--archives /virtualenv/path \
--files /etc/hive/conf/hive-site.xml \
--name my_app\
test.py
This is the code I have in test.py -
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Spark Session created")
On running the submit command, I see messages like below -
yarn.Client: Source and destination file systems are the same. Not copying hdfs:///some/path/spark24/some.jar
And then I get this error on the line where spark session is being created -
spark = SparkSession.builder.getOrCreate()
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 310, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 259, in _ensure_initialized
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/java_gateway.py", line 117, in launch_gateway
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 175, in java_import
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:
Authentication error: unexpected command.
the py4j in the error is coming from the existing spark and not the versions in my jar. Were my spark24 jars not picked up? The same code runs ok if I remove the conf for jars but probably from the existing spark version 2.1.0. Any clues on how to fix this?
Thanks.
The problem turned out to be that python was running from the wrong place. I had to submit from correct place this way -
PYTHONPATH=./${virtualenv}/venv/lib/python3.6/site-packages/ spark2-submit

Spark-submit in cluster mode

I'm facing a problem launching a Spark Application in cluster mode
This is the .sh :
export SPARK_MAJOR_VERSION=2
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 8G \
--executor-memory 8G \
--total-executor-cores 4 \
--num-executors 4 \
/home/hdfs/spark_scripts/ETL.py &> /home/hdfs/spark_scripts/log_spark.txt
In YARN logs, I found out that there's an Import Error related to a .py file that I need in "ETL.py". In other words, in "ETL.py", I',ve got a line in which I do this :
import AppUtility
AppUtilit.py is in the same path of ETL.py
In local mode,it works
This is the YARN log:
20/04/28 10:59:59 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Container: container_e64_1584554814241_22431_02_000001 on ftpandbit02.carte.local_45454
LogAggregationType: AGGREGATED
LogType:stdout
LogLastModifiedTime:Tue Apr 28 10:57:10 +0200 2020
LogLength:138
LogContents:
Traceback (most recent call last):
File "ETL.py", line 8, in
import AppUtility
ImportError: No module named AppUtility
End of LogType:stdout
End of LogType:prelaunch.err
It depends on either client mode or cluster mode.
If you use Spark in Yarn client mode,
you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.
Using Spark with Yarn cluster mode, is a different story. You can distribute python dependencies with
spark-submit ./bin/spark-submit --py-files AppUtility.py
/home/hdfs/spark_scripts/ETL.py
The --py-files directive sends the file to the Spark workers but does not add it to the PYTHONPATH.
To add the dependencies to the PYTHONPATH to fix the ImportError, add the following line to the Spark job, ETL.py
sc.addPyFile(PATH)
PATH: AppUtility.py (It can be either a local file, a file in HDFS,zip or an HTTP, HTTPS or FTP URI)

Not able to access the local file in pyspark

I am trying to read the local file in client mode on Yarn framework. I was not able to access the local file in client mode also.
import os
import pyspark.sql.functions as F
from os import listdir, path
from pyspark import SparkConf, SparkContext
import argparse
from pyspark import SparkFiles
from pyspark.sql import SparkSession
def main():
spark = SparkSession \
.builder \
.appName("Spark File load example") \
.config("spark.jars","/u/user/someuser/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://******.hpc.****.com:9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.addFile("/u/user/vikrant/testdata/EMPFILE1.csv")
inputfilename=getinputfile(spark)
print("input file path is:",inputfilename)
data = processfiledata(spark,inputfilename)
data.show()
spark.stop()
def getinputfile(spark):
spark_files_dir = SparkFiles.getRootDirectory()
print("spark_files_dir:",spark_files_dir)
inputfile = [filename
for filename in listdir(spark_files_dir)
if filename.endswith('EMPFILE1.csv')]
if len(inputfile) != 0:
path_to_input_file = path.join(spark_files_dir, inputfile[0])
else:
print("file path not found",path_to_input_file)
print("inputfile name:",inputfile)
return path_to_input_file
def processfiledata(spark,inputfilename):
dataframe= spark.read.format("csv").option("header","false").load(inputfilename)
return dataframe
if __name__ == "__main__":
main()
Below is my shell script-->
spark-submit --master yarn --deploy-mode client PysparkMainModulenew.py --files /u/user/vikrant/testdata/EMPFILE1.csv
Below is the error message-->
('spark_files_dir:',
u'/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2')
('inputfile name:', [u'EMPFILE1.csv'])
('input file path is:', u'/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2/EMPFILE1.csv')
Traceback (most recent call last):
File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 57, in
main()
File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 31, in main
data = processfiledata(spark,inputfilename)
File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 53, in processfiledata
dataframe = spark.read.format("csv").option("header","false").load(inputfilename)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 166, in load
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
line 1160, in call
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 69, in deco pyspark.sql.utils.AnalysisException: u'Path does not
exist:
hdfs://hdd2cluster/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2/EMPFILE1.csv;'
You have something like this. This won't work because you need to put PysparkMainModulenew.py after --files option. So, this
spark-submit --master yarn --deploy-mode client PysparkMainModulenew.py --files /u/user/vikrant/testdata/EMPFILE1.csv
Should be,
spark-submit --master yarn --deploy-mode client --files /u/user/vikrant/testdata/EMPFILE1.csv PysparkMainModulenew.py
And, No need to use addFile in that case. You can copy both PysparkMainModulenew.py and EMPFILE1.csv to the same folder. And, everything should be after --files option.
spark-submit --master yarn --deploy-mode client --files /u/user/vikrant/testdata/EMPFILE1.csv /u/user/vikrant/testdata/PysparkMainModulenew.py
Alternatively, you can use --py-files option too.
You can read local file only in "local" mode. If you cant to read local file in "yarn" mode then that file has to be present on all data nodes, So that when container get initiated on any of data node that file would be available to the container on that data node.
IMHO It's always better to mention technology stack version(s) and Hadoop distribution you are using in order to get swift help.
Your default path is might be HDFS home path so for getting file from local machine you have to add file:// in path.
df=spark.read.format("csv").option("header","false").load("file:///home/inputfilename")
df= sqlContext.read.format("csv").option("header","true").load(file:///home/inputfilename)

--files option in pyspark not working

I tried sc.addFile option (working without any issues) and --files option from the command line (failed).
Run 1 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))
external package: external_package.py
class external(object):
def __init__(self):
pass
def fun(self,input):
return input*2
readme.txt
MY TEXT HERE
spark-submit command
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
/local-pgm-path/spark_distro.py \
1000
Output: Working as expected
['MY TEXT HERE']
But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing.
Like below.
Run 2 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))
external_package.py Same as above
spark submit
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
--files /local-path/readme.txt#readme.txt \
/local-pgm-path/spark_distro.py \
1000
Output:
Traceback (most recent call last):
File "/local-pgm-path/spark_distro.py", line 31, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'
Is sc.addFile and --file used for same purpose? Can someone please share your thoughts.
I have finally figured out the issue, and it is a very subtle one indeed.
As suspected, the two options (sc.addFile and --files) are not equivalent, and this is (admittedly very subtly) hinted at the documentation (emphasis added):
addFile(path, recursive=False)
Add a file to be downloaded with this Spark job on every node.
--files FILES
Comma-separated list of files to be placed in the working
directory of each executor.
In plain English, while files added with sc.addFile are available to both the executors and the driver, files added with --files are available only to the executors; hence, when trying to access them from the driver (as is the case in the OP), we get a No such file or directory error.
Let's confirm this (getting rid of all the irrelevant --py-files and 1000 stuff in the OP):
test_fail.py:
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
Test:
spark-submit --master yarn \
--deploy-mode client \
--files /home/ctsats/readme.txt \
/home/ctsats/scripts/SO/test_fail.py
Result:
[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'
In the above script test_fail.py, it is the driver program that requests access to the file readme.txt; let's change the script, so that access is requested for the executors (test_success.py):
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())
Test:
spark-submit --master yarn \
--deploy-mode client \
--files /home/ctsats/readme.txt \
/home/ctsats/scripts/SO/test_success.py
Result:
[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']
Notice also that here we don't need SparkFiles.get - the file is readily accessible.
As said above, sc.addFile will work in both cases, i.e. when access is requested either by the driver or by the executors (tested but not shown here).
Regarding the order of the command line options: as I have argued elsewhere, all Spark-related arguments must be before the script to be executed; arguably, the relative order of --files and --py-files is irrelevant (leaving it as an exercise).
Tested with both Spark 1.6.0 & 2.2.0.
UPDATE (after the comments): Seems that my fs.defaultFS setting points to HDFS, too:
$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020
But let me focus on the forest here (instead of the trees, that is), and explain why this whole discussion is of academic interest only:
Passing files to be processed with the --files flag is bad practice; in hindsight, I can now see why I could find almost no use references online - probably nobody uses it in practice, and with good reason.
(Notice that I am not talking for --py-files, which serves a different, legitimate role.)
Since Spark is a distributed processing framework, running over a cluster and a distributed file system (HDFS), the best thing to do is to have all files to be processed into the HDFS already - period. The "natural" place for files to be processed by Spark is the HDFS, not the local FS - although there are some toy examples using the local FS for demonstration purposes only. What's more, if you want some time in the future to change the deploy mode to cluster, you'll discover that the cluster, by default, knows nothing of local paths and files, and rightfully so...

Resources