I use spark on yarn mode,I have a problem when run
pyspark --master yarn
under python3.5 , when I run code like this
user_data = sc.textFile("/testdata/u.user")
user_fields = user_data.map(lambda line: line.split("|"))
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
the result show
File "/data/opt/spark-2.1.0-bin-hadoop2.6/python/pyspark/rdd.py", line 1753, in add_shuffle_key
File "/data/opt/hadoop-2.6.0/tmp/nm-local-dir/usercache/jsdxadm/appcache/application_1494985561557_0005/container_1494985561557_0005_01_000002/pyspark.zip/pyspark/rdd.py", line 74, in portable_hash
raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED environ=")
I try but can not resolve, can you help me

Include spark.executorEnv.PYTHONHASHSEED 0 in your spark-defaults.conf (in your Spark ./conf directory). That should work!

This is a problem in Spark 2.1 that is resolved in 2.2. If you are not able to upgrade or do not have access to spark-defaults.conf, you can use
before you submit your job.


PySpark config through airflow

I'm trying to pass packages org.apache.spark:spark-avro_2.12:2.4.3 through SparkSubmitOperator to the config as described here: https://spark.apache.org/docs/2.4.3/sql-data-sources-avro.html As I'm trying to use spark to read Avro files.
This is what I did in airflow dag, but it didn't work. Could someone please help to point out what I did wrong? Many thanks.
conf = Variable.get("spark_conf", deserialize_json = True)
conf_sp = conf.update({"spark.jars.packages":"org.apache.spark:spark-avro_2.12:2.4.3"})
op = SparkSubmitOperator(
application = "my_app",
conf = conf_sp
The SparkSubmitOperator relies on the SparkSubmitHook which at the end composes a spark-submit CLI command to be executed.
In the CLI command form, you need to specify a dependency on packages with the package option so that they can be fetched from Maven and not in the configuration option.
op = SparkSubmitOperator(
application = "my_app",
packages = "org.apache.spark:spark-avro_2.12:2.4.3"

How to get the SparkSession to find added python files

After running pip install BigDL==0.8.0, running from bigdl.util.common import * from python completed without issue.
However, with either of the following SparkSessions:
spark = (SparkSession.builder.master('yarn')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
spark = (SparkSession.builder.master('local')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
I get the following error.
ImportError: ('No module named bigdl.util.common', <function subimport at 0x7fd442a36aa0>, ('bigdl.util.common',))
In addition of the 'spark.submit.pyFiles' config above, after the SparkSession successfully starts, I have tried spark.sparkContext.addPyFile("util.zip") where "util.zip" contains all of the python files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl/util .
I have also zipped all of the contents in this folder https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8) and pointed to that file in the .config('spark.submit.pyFiles', '/path/to/bigdl.zip'), but this also does not work.
How do I get the SparkSession to see these files?
Figured it out. The only thing that worked was spark.sparkContext.addPyFile("bigdl.zip") after the SparkSesssion has started. Where "bigdl.zip" contained all of the files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8).
Not sure why .config('spark.submit.pyFiles', 'bigdl.zip') would not work.

Command line is too long ... Error when migrating from spark 1.6.1 to 2.3.0

In Intellij i have this gradle config :
ext.sparkVersion = '1.6.1'
ext.scalaVersion = '2.11'
And my script is working fine...
But as soon a change to :
ext.sparkVersion = '2.3.0'
ext.scalaVersion = '2.11'
I'm getting :
Error running 'SpaceWalk2': Command line is too long. Shorten command line for
SpaceWalk2 or also for Application default configuration
Any idea ?

pySpark local mode - loading text file with file:/// vs relative path

I am just getting started with spark and I am trying out examples in local mode...
I noticed that in some examples when creating the RDD the relative path to the file is used and in others the path starts with "file:///". The second option did not work for me at all - "Input path does not exist"
Can anyone explain what the difference is between using the file path and putting 'file:///' in front of it ?
I am using Spark 2.2 on Mac running in local mode
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)
#This will work providing the relative path
lines = sc.textFile("code/test.csv")
#This will not work
lines = sc.textFile("file:///code/test.csv")
sc.textFile("code/test.csv") means test.csv in /<hive.metastore.warehouse.dir>/code/test.csv on HDFS.
sc.textFile("hdfs:///<hive.metastore.warehouse.dir>/code/test.csv") is equal to above.
sc.textFile("file:///code/test.csv") means test.csv in /code/test.csv on local file system.

Pyspark - FileInputDStream: Error finding new files

Hi I'm new to Python Spark and I'm trying out this example from Spark github in order to Counts words in new text files created in the given directory :
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: hdfs_wordcount.py <directory>", file=sys.stderr)
sc = SparkContext(appName="PythonStreamingHDFSWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream("hdfs:///home/my-logs/")
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a+b)
And this is what I get :
a warning saying : WARN FileInputDStream: Error finding new files
a warning message saying : WARN FileInputDStream: Error finding new files.
and I got empty results even i'm adding files in this dir :/
Any suggested solution for this ?
The issue is spark streaming will not read old files from directory..since all logs files exist before your streaming job started
so what you need to do once you started your streaming job then put/copy input files in hdfs directory either manually or by an script.
I think you are referring to this example. Are you able to run it without modifying as I see you are setting directory to "hdfs:///" in program? You can run the example like below.
For example Spark is at /opt/spark-2.0.2-bin-hadoop2.7. You can run hdfs_wordcount.py available in example directory like below. We are using /tmp as directory to pass as argument to program.
user1#user1:/opt/spark-2.0.2-bin-hadoop2.7$ bin/spark-submit examples/src/main/python/streaming/hdfs_wordcount.py /tmp
Now while this program is running, open another terminal and copy some file to /tmp folder
user1#user1:~$ cp test.txt /tmp
You will see the word count in first terminal.
The issue is the build, i use to build like that using maven depending on their readme file from github :
build/mvn -DskipTests clean package
I've build that way depending on their documentation :
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Someone know what those params are ?
