--files option in pyspark not working

--files option in pyspark not working - apache-spark

I tried sc.addFile option (working without any issues) and --files option from the command line (failed).
Run 1 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))
external package: external_package.py
class external(object):
def __init__(self):
pass
def fun(self,input):
return input*2
readme.txt
MY TEXT HERE
spark-submit command
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
/local-pgm-path/spark_distro.py \
1000
Output: Working as expected
['MY TEXT HERE']
But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing.
Like below.
Run 2 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))
external_package.py Same as above
spark submit
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
--files /local-path/readme.txt#readme.txt \
/local-pgm-path/spark_distro.py \
1000
Output:
Traceback (most recent call last):
File "/local-pgm-path/spark_distro.py", line 31, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'
Is sc.addFile and --file used for same purpose? Can someone please share your thoughts.

I have finally figured out the issue, and it is a very subtle one indeed.
As suspected, the two options (sc.addFile and --files) are not equivalent, and this is (admittedly very subtly) hinted at the documentation (emphasis added):
addFile(path, recursive=False)
Add a file to be downloaded with this Spark job on every node.
--files FILES
Comma-separated list of files to be placed in the working
directory of each executor.
In plain English, while files added with sc.addFile are available to both the executors and the driver, files added with --files are available only to the executors; hence, when trying to access them from the driver (as is the case in the OP), we get a No such file or directory error.
Let's confirm this (getting rid of all the irrelevant --py-files and 1000 stuff in the OP):
test_fail.py:
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
Test:
spark-submit --master yarn \
--deploy-mode client \
--files /home/ctsats/readme.txt \
/home/ctsats/scripts/SO/test_fail.py
Result:
[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'
In the above script test_fail.py, it is the driver program that requests access to the file readme.txt; let's change the script, so that access is requested for the executors (test_success.py):
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())
Test:
spark-submit --master yarn \
--deploy-mode client \
--files /home/ctsats/readme.txt \
/home/ctsats/scripts/SO/test_success.py
Result:
[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']
Notice also that here we don't need SparkFiles.get - the file is readily accessible.
As said above, sc.addFile will work in both cases, i.e. when access is requested either by the driver or by the executors (tested but not shown here).
Regarding the order of the command line options: as I have argued elsewhere, all Spark-related arguments must be before the script to be executed; arguably, the relative order of --files and --py-files is irrelevant (leaving it as an exercise).
Tested with both Spark 1.6.0 & 2.2.0.
UPDATE (after the comments): Seems that my fs.defaultFS setting points to HDFS, too:
$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020
But let me focus on the forest here (instead of the trees, that is), and explain why this whole discussion is of academic interest only:
Passing files to be processed with the --files flag is bad practice; in hindsight, I can now see why I could find almost no use references online - probably nobody uses it in practice, and with good reason.
(Notice that I am not talking for --py-files, which serves a different, legitimate role.)
Since Spark is a distributed processing framework, running over a cluster and a distributed file system (HDFS), the best thing to do is to have all files to be processed into the HDFS already - period. The "natural" place for files to be processed by Spark is the HDFS, not the local FS - although there are some toy examples using the local FS for demonstration purposes only. What's more, if you want some time in the future to change the deploy mode to cluster, you'll discover that the cluster, by default, knows nothing of local paths and files, and rightfully so...

Related

Pyspark version 3.x, repartition not working as expected for large JSON data

We have a hadoop cluster of two nodes with around 40 cores and 80 GB RAM. We have to simply digest a large multiline JSON into Elastic Search (ES) cluster. The size of json was 120 GB and after bz2 compression, it is reduced to 2 GB only. We have setup following code for data indexing is ES
....
def start_job():
warehouse_location = abspath('spark-warehouse')
# Create a spark session
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# Configurations
spark.conf.set("spark.sql.caseSensitive", "true")
df = spark.read.option("multiline", "true").json(data_path)
df = df.repartition(20)
#Tranformations
df = df.drop("_id")
df.write.format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', ES_Nodes
).option(
'es.port', ES_PORT
).option(
'es.resource', ES_RESOURCE,
).save()
if __name__ == '__main__':
# ES Setting
ES_Nodes = "hadoop-master"
ES_PORT = 9200
ES_RESOURCE = "myIndex/type"
# Data absolute path
data_path = "/dss_data/mydata.bz2"
start_job()
print("Job has been finished")
The problem is that only one executor is running as total tasks are one. I was expecting, there should be 20 tasks as I have repartition the data to 20. The Spark UI image is given below. Where is the problem. I am running following command to run the job on cluster
spark-submit --class org.apache.spark.examples.SparkPi --jars elasticsearch-spark-30_2.12-7.14.1.jar --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 4g --num-executors 20 --executor-cores 2 myscript.py
We are using Hadoop and Spark version 3.x.
Further, we are also getting following trace in the Hadoop logs
df.write.format(
File "/usr/local/leads/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1107, in save
File "/usr/local/leads/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/leads/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

Not able to access the local file in pyspark

I am trying to read the local file in client mode on Yarn framework. I was not able to access the local file in client mode also.
import os
import pyspark.sql.functions as F
from os import listdir, path
from pyspark import SparkConf, SparkContext
import argparse
from pyspark import SparkFiles
from pyspark.sql import SparkSession
def main():
spark = SparkSession \
.builder \
.appName("Spark File load example") \
.config("spark.jars","/u/user/someuser/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://******.hpc.****.com:9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sparkContext.addFile("/u/user/vikrant/testdata/EMPFILE1.csv")
inputfilename=getinputfile(spark)
print("input file path is:",inputfilename)
data = processfiledata(spark,inputfilename)
data.show()
spark.stop()
def getinputfile(spark):
spark_files_dir = SparkFiles.getRootDirectory()
print("spark_files_dir:",spark_files_dir)
inputfile = [filename
for filename in listdir(spark_files_dir)
if filename.endswith('EMPFILE1.csv')]
if len(inputfile) != 0:
path_to_input_file = path.join(spark_files_dir, inputfile[0])
else:
print("file path not found",path_to_input_file)
print("inputfile name:",inputfile)
return path_to_input_file
def processfiledata(spark,inputfilename):
dataframe= spark.read.format("csv").option("header","false").load(inputfilename)
return dataframe
if __name__ == "__main__":
main()
Below is my shell script-->
spark-submit --master yarn --deploy-mode client PysparkMainModulenew.py --files /u/user/vikrant/testdata/EMPFILE1.csv
Below is the error message-->
('spark_files_dir:',
u'/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2')
('inputfile name:', [u'EMPFILE1.csv'])
('input file path is:', u'/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2/EMPFILE1.csv')
Traceback (most recent call last):
File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 57, in
main()
File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 31, in main
data = processfiledata(spark,inputfilename)
File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 53, in processfiledata
dataframe = spark.read.format("csv").option("header","false").load(inputfilename)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 166, in load
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
line 1160, in call
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 69, in deco pyspark.sql.utils.AnalysisException: u'Path does not
exist:
hdfs://hdd2cluster/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2/EMPFILE1.csv;'

You have something like this. This won't work because you need to put PysparkMainModulenew.py after --files option. So, this
spark-submit --master yarn --deploy-mode client PysparkMainModulenew.py --files /u/user/vikrant/testdata/EMPFILE1.csv
Should be,
spark-submit --master yarn --deploy-mode client --files /u/user/vikrant/testdata/EMPFILE1.csv PysparkMainModulenew.py
And, No need to use addFile in that case. You can copy both PysparkMainModulenew.py and EMPFILE1.csv to the same folder. And, everything should be after --files option.
spark-submit --master yarn --deploy-mode client --files /u/user/vikrant/testdata/EMPFILE1.csv /u/user/vikrant/testdata/PysparkMainModulenew.py
Alternatively, you can use --py-files option too.

You can read local file only in "local" mode. If you cant to read local file in "yarn" mode then that file has to be present on all data nodes, So that when container get initiated on any of data node that file would be available to the container on that data node.
IMHO It's always better to mention technology stack version(s) and Hadoop distribution you are using in order to get swift help.

Your default path is might be HDFS home path so for getting file from local machine you have to add file:// in path.
df=spark.read.format("csv").option("header","false").load("file:///home/inputfilename")

df= sqlContext.read.format("csv").option("header","true").load(file:///home/inputfilename)

spark (pyspark) UnicodeEncodeError on yarn cluster mode

I'm running some simple pyspark code that tries to print a "shrug" at the end. I'm running pyspark with Yarn and when I run my code in client mode everything works properly. When I run my code in cluster mode (--deploy-mode cluster) then I get a UnicodeEncodeError error.
Obviously I can not print the "shrug" but I feel like I'm missing something major here. Should I be worried my code isn't handling unicode properly?
Attempted solution:
I've set PYTHONIOENCODING=utf8 and PYSPARK_PYTHON=python3.
Code:
#!/usr/bin/env python3
# Spark Imports
from pyspark import SparkContext
from pyspark.sql import SparkSession
...
if __name__ == '__main__':
sc = SparkContext()
session = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
...
print("¯\_(ツ)_/¯")
No error: ./bin/spark-submit --master yarn myApp.py
Throws error: ./bin/spark-submit --master yarn --deploy-mode clustersparkcluster myApp.py
Error:
Traceback (most recent call last):
File "myApp.py", line 435, in <module>
print("\xaf\_(\u30c4)_/\xaf")
UnicodeEncodeError: 'ascii' codec can't encode character '\xaf' in position 46: ordinal not in range(128)
Settings:
I have the following in my .zshrc:
alias sparkclient="/opt/mapr/spark/spark-2.0.1/bin/spark-submit --master yarn
alias sparkcluster="/opt/mapr/spark/spark-2.0.1/bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 6g
export SPARK_HOME=/opt/mapr/spark/spark-2.0.1
export PYSPARK_PYTHON=python3
export PYTHONHASHSEED=0
export PYTHONIOENCODING=utf8
export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

Spark output when submitting via Yarn cluster vs. client

I am new to Spark and just got it running on my cluster (Spark 2.0.1 on a 9 node cluster running Community version of MapR). I submit the wordcount example via
./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md
and get the following output
17/04/07 13:21:34 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
: 68
help: 1
when: 1
Hadoop: 3
...
Looks like everything is working properly. When I add --deploy-mode cluster I get the following output:
17/04/07 13:23:52 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
So no errors but I am not seeing the wordcount results. What am I missing? I see the job in my History Server and it says it completed successfully. Also I checked my user directory in DFS but no new files were written except for this empty directory: /user/myuser/.sparkStaging
Code (wordcount.py example shipped with Spark):
from __future__ import print_function
import sys
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop()

The reason for your output not printing is:
When you run in spark-client mode then the node on which you are initiating the job is the DRIVER and when you collect the result it is collected on that node and you print it.
In yarn-cluster mode your driver is some other node not the one through which you initiated the job. So when you call the .collect function the result is collected on that and printed on that node. You can find the result being printed in the sys-out of the driver.
A better approach would be to write the output somewhere in HDFS.
The reason for your spark.yarn.jars warning is:
In order to run a spark job yarn needs some binaries available on all the nodes of the cluster if these binaries are not available then as a part of job preparation, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
To solve this :
By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable(chmod 777) location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set spark.yarn.jars to hdfs:///some/path.
After placing your jars run your code like :
./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md --conf spark.yarn.jars="hdfs:///some/path"
Source : http://spark.apache.org/docs/latest/running-on-yarn.html

Spark SQL RDD loads in pyspark but not in spark-submit: "JDBCRDD: closed connection"

I have the following simple code for loading a table from my Postgres database into an RDD.
# this setup is just for spark-submit, will be ignored in pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("GA")#.setMaster("localhost")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# func for loading table
def get_db_rdd(table):
url = "jdbc:postgresql://localhost:5432/harvest?user=postgres"
print(url)
lower = 0
upper = 1000
ret = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("partitionColumn", "id") \
.option("numPartitions", 1024) \
.option("lowerBound", lower) \
.option("upperBound", upper) \
.option("password", "password") \
.load()
ret = ret.rdd
return ret
# load table, and print results
print(get_db_rdd("mytable").collect())
I run ./bin/pyspark then paste that into the interpreter, and it prints out the data from my table as expected.
Now, if I save that code to a file named test.py then do ./bin/spark-submit test.py, it starts to run, but then I see these messages spam my console forever:
17/02/16 02:24:21 INFO Executor: Running task 45.0 in stage 0.0 (TID 45)
17/02/16 02:24:21 INFO JDBCRDD: closed connection
17/02/16 02:24:21 INFO Executor: Finished task 45.0 in stage 0.0 (TID 45). 1673 bytes result sent to driver
Edit: This is on a single machine. I haven't started any masters or slaves; spark-submit is the only command I run after system start. I tried with the master/slave setup with the same results.
My spark-env.sh file looks like this:
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=800m
export SPARK_EXECUTOR_MEMORY=800m
export SPARK_EXECUTOR_CORES=2
export SPARK_CLASSPATH=/home/ubuntu/spark/pg_driver.jar # Postgres driver I need for SQLContext
export PYTHONHASHSEED=1337 # have to make workers use same seed in Python3
It works if I spark-submit a Python file that just creates an RDD from a list or something. I only have problems when I try to use a JDBC RDD. What piece am I missing?

When using spark-submit you should supply the jar to the executors.
As mentioned in spark 2.1 JDBC documents:
To get started you will need to include the JDBC driver for you
particular database on the spark classpath. For example, to connect to
postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Note: The same should be for spark-submit command
Troubleshooting
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.

This is a horrible hack. I'm not considering this the answer, but it does work.
Alright, only pyspark works? Fine, then we'll use it. Wrote this Bash script:
cat $1 | $SPARK_HOME/bin/pyspark # pipe the Python file into pyspark
I run that script in my Python script that's submitting jobs. Also, I'm including the code I use to pass arguments between the processes, in case it helps someone:
new_env = os.environ.copy()
new_env["pyspark_argument_1"] = "some param I need in my Spark script" # etc...
p = subprocess.Popen(["pyspark_wrapper.sh {}".format(py_fname)], shell=True, env=new_env)
In my Spark script:
something_passed_from_submitter = os.environ["pyspark_argument_1"]
# do stuff in Spark...
I feel like Spark is better supported and (if this is a bug) less buggy with Scala than with Python 3, so that might be the better solution for now. But my script uses some files we wrote in Python 3, so...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string