PySpark config through airflow - apache-spark

I'm trying to pass packages org.apache.spark:spark-avro_2.12:2.4.3 through SparkSubmitOperator to the config as described here: https://spark.apache.org/docs/2.4.3/sql-data-sources-avro.html As I'm trying to use spark to read Avro files.
This is what I did in airflow dag, but it didn't work. Could someone please help to point out what I did wrong? Many thanks.
conf = Variable.get("spark_conf", deserialize_json = True)
conf_sp = conf.update({"spark.jars.packages":"org.apache.spark:spark-avro_2.12:2.4.3"})
op = SparkSubmitOperator(
application = "my_app",
conf = conf_sp
....
)

The SparkSubmitOperator relies on the SparkSubmitHook which at the end composes a spark-submit CLI command to be executed.
In the CLI command form, you need to specify a dependency on packages with the package option so that they can be fetched from Maven and not in the configuration option.
op = SparkSubmitOperator(
application = "my_app",
packages = "org.apache.spark:spark-avro_2.12:2.4.3"
)

Related

Importing package in client mode PYSPARK

I need use virtual environment in pyspark EMR cluster.
I am launching application with spark-sumbit using the following configuration.
spark-submit --deploy-mode client --archives path_to/environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
Environment variables are setting in python script. The code is working importing packages inside spark context function, but i wannt import outside the function. What's wrong?
from pyspark import SparkConf
from pyspark import SparkContext
import pendulum #IMPORT ERROR!!!!!!!!!!!!!!!!!!!!!!!
os.environ["PYSPARK_PYTHON"] = "./environment/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "which python"
conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def some_function(x):
import pendulum #IMPORT CORRECT
dur = pendulum.duration(days=x)
# More properties
# Use the libraries to do work
return dur.weeks
rdd = (sc.parallelize(range(1000))
.map(some_function)
.take(10))
print(rdd)
import pendulum #IMPORT ERROR
from the looks of it, seems you are using the correct environment for the workers ("PYSPARK_PYTHON") but override the environment for the driver ("PYSPARK_DRIVER_PYTHON") so that the driver's python does not see the package you wanted to import. the code in the some_function gets executed by the workers ( never by the driver), and thus it can see the imported package, while the latest import is executed by the driver, and fails.

Cloud dataflow python3 job not solving dependencies

I have a simple apache beam project using python 3 to transform some data and write to big query, it uses a package called texstat, if I run locally everything works, but when I run on dataflow I get the following error:
NameError: name 'textstat' is not defined [while running 'generatedPtransform-441']
This is my current setup.py file:
import setuptools
REQUIRED_PACKAGES = ['textstat==0.5.6']
PACKAGE_NAME = 'my_package'
PACKAGE_VERSION = '0.0.1'
setuptools.setup(
name=PACKAGE_NAME,
version=PACKAGE_VERSION,
description='Example project',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
)
and this are my pipeline args
pipeline_args = [
'--project={}'.format('etl-example'),
'--runner={}'.format('Dataflow'),
'--temp_location=gs://dataflowtemporal/',
'--setup_file=./setup.py',
]
and I run it like this
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(StandardOptions).streaming = True
pipeline = beam.Pipeline(options=pipeline_options)
...
pipeline.run()
I also tried with running this on the terminal before running the job:
python setup.py sdist --formats=gztar
but I get the same results of texstat not being found.
Another thing I tries was without setup.py and only with the argument
--requirements_file=./requirements.txt
But again, texstat is not found
At this point I don't know what else to try.
Normally it is because the library is not imported locally in your DoFn.
Alternatively you can try --save_main_session option as mentioned in https://cloud.google.com/dataflow/docs/resources/faq

Turn off pyspark logging through python sript

How can I turn off pyspark logging from a python script?
Pls Note : I do not want to make any changes in the spark logger properties file.
To remove (or modify) logging from a python script:
conf = SparkConf()
conf.set('spark.logConf', 'true') # necessary in order to be able to change log level
... # other stuff and configuration
# create the session
spark = SparkSession.builder\
.config(conf=conf) \
.appName(app_name) \
.getOrCreate()
# set the log level to one of ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
spark.sparkContext.setLogLevel("OFF")
docs configuration
docs setLogLevel
Hope this helps, good luck!
Edit: For earlier versions, e.g. 1.6, you can try something like the following, taken from here
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel(logger.Level.OFF)
# or
logger.LogManager.getRootLogger().setLevel(logger.Level.OFF)
I haven't tested it unfortunately, please, let me know if it works.

Randomness of hash of string should be disabled via PYTHONHASHSEED

I use spark on yarn mode,I have a problem when run
pyspark --master yarn
under python3.5 , when I run code like this
user_data = sc.textFile("/testdata/u.user")
user_fields = user_data.map(lambda line: line.split("|"))
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
the result show
File "/data/opt/spark-2.1.0-bin-hadoop2.6/python/pyspark/rdd.py", line 1753, in add_shuffle_key
File "/data/opt/hadoop-2.6.0/tmp/nm-local-dir/usercache/jsdxadm/appcache/application_1494985561557_0005/container_1494985561557_0005_01_000002/pyspark.zip/pyspark/rdd.py", line 74, in portable_hash
raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED environ=")
I try but can not resolve, can you help me
Include spark.executorEnv.PYTHONHASHSEED 0 in your spark-defaults.conf (in your Spark ./conf directory). That should work!
This is a problem in Spark 2.1 that is resolved in 2.2. If you are not able to upgrade or do not have access to spark-defaults.conf, you can use
export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0
before you submit your job.

Pyspark running external program using subprocess can't read files from hdfs

I'm trying to run an external program(such as bwa) within pyspark. My code looks like this.
import sys
import subprocess
from pyspark import SparkContext
def bwaRun(args):
a = ['/home/hd_spark/tool/bwa-0.7.13/bwa', 'mem', ref, args]
result = subprocess.check_output(a)
return result
sc = SparkContext(appName = 'sub')
ref = 'hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta'
input = 'hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq'
chunk_name = []
chunk_name.append(input)
data = sc.parallelize(chunk_name,1)
print data.map(bwaRun).collect()
I'm running spark with yarn cluster with 6 nodes of slaves and each node has bwa program installed. When i run the code, bwaRun function can't read input files from hdfs. Its kind of obvious this doesn't work because when i tried to run bwa program locally by giving
bwa mem hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq
on the shell didn't work because it can't read files from hdfs.
Can anyone give me idea how i could solve this?
Thanks in advance!

Resources