Pyspark running external program using subprocess can't read files from hdfs - apache-spark

I'm trying to run an external program(such as bwa) within pyspark. My code looks like this.
import sys
import subprocess
from pyspark import SparkContext
def bwaRun(args):
a = ['/home/hd_spark/tool/bwa-0.7.13/bwa', 'mem', ref, args]
result = subprocess.check_output(a)
return result
sc = SparkContext(appName = 'sub')
ref = 'hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta'
input = 'hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq'
chunk_name = []
chunk_name.append(input)
data = sc.parallelize(chunk_name,1)
print data.map(bwaRun).collect()
I'm running spark with yarn cluster with 6 nodes of slaves and each node has bwa program installed. When i run the code, bwaRun function can't read input files from hdfs. Its kind of obvious this doesn't work because when i tried to run bwa program locally by giving
bwa mem hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq
on the shell didn't work because it can't read files from hdfs.
Can anyone give me idea how i could solve this?
Thanks in advance!

Related

Importing package in client mode PYSPARK

I need use virtual environment in pyspark EMR cluster.
I am launching application with spark-sumbit using the following configuration.
spark-submit --deploy-mode client --archives path_to/environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
Environment variables are setting in python script. The code is working importing packages inside spark context function, but i wannt import outside the function. What's wrong?
from pyspark import SparkConf
from pyspark import SparkContext
import pendulum #IMPORT ERROR!!!!!!!!!!!!!!!!!!!!!!!
os.environ["PYSPARK_PYTHON"] = "./environment/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "which python"
conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def some_function(x):
import pendulum #IMPORT CORRECT
dur = pendulum.duration(days=x)
# More properties
# Use the libraries to do work
return dur.weeks
rdd = (sc.parallelize(range(1000))
.map(some_function)
.take(10))
print(rdd)
import pendulum #IMPORT ERROR
from the looks of it, seems you are using the correct environment for the workers ("PYSPARK_PYTHON") but override the environment for the driver ("PYSPARK_DRIVER_PYTHON") so that the driver's python does not see the package you wanted to import. the code in the some_function gets executed by the workers ( never by the driver), and thus it can see the imported package, while the latest import is executed by the driver, and fails.

Pyspark module not found error even after passing module as zip file

I have a Pyspark code repo which i am sending to the spark session as a zip file through --pyFile parameter. I am doing this because there is a UDF defined in one of the python files within the module which is not available when we run the code as the module is not available in the workers.
Even though all the python files are present inside the zip file i still get the module not found error.
|-Module
|----test1.py
|----test2.py
|----test3.py
When i try to from Module.test2 import foo to import in test3.py i get an error that module.test2 is not found. test2 contains an pyspark UDF.
Any help would be greatly appreciated
from pyspark.sql import SparkSession
Initiate a spark session
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
Remove the config - if not needed.
Make sure that the Module.zip is in the same directory as main.py
spark.sparkContext.addPyFile("Module.zip")
Dependency imports after adding the package.zip path to SparkSession.
from Module.test2 import foo
Finally
|-main.py - Core code to be executed.
|-Module.zip - Dependency module zipped.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
spark.sparkContext.addPyFile("Module.zip")
from Module.test2 import foo
IMPORTANT - Make sure you correctly zip you file - if the problem persists try other zip methods.

pySpark local mode - loading text file with file:/// vs relative path

I am just getting started with spark and I am trying out examples in local mode...
I noticed that in some examples when creating the RDD the relative path to the file is used and in others the path starts with "file:///". The second option did not work for me at all - "Input path does not exist"
Can anyone explain what the difference is between using the file path and putting 'file:///' in front of it ?
I am using Spark 2.2 on Mac running in local mode
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)
#This will work providing the relative path
lines = sc.textFile("code/test.csv")
#This will not work
lines = sc.textFile("file:///code/test.csv")
sc.textFile("code/test.csv") means test.csv in /<hive.metastore.warehouse.dir>/code/test.csv on HDFS.
sc.textFile("hdfs:///<hive.metastore.warehouse.dir>/code/test.csv") is equal to above.
sc.textFile("file:///code/test.csv") means test.csv in /code/test.csv on local file system.

Spark No module named found

I have a simple spark program and I get the following error -
Error:-
ImportError: No module named add_num
Command used to run :-
./bin/spark-submit /Users/workflow/test_task.py
Code:-
from __future__ import print_function
from pyspark.sql import SparkSession
from add_num import add_two_nos
def map_func(x):
print(add_two_nos(5))
return x*x
def main():
spark = SparkSession\
.builder\
.appName("test-task")\
.master("local[*]")\
.getOrCreate()
rdd = spark.sparkContext.parallelize([1,2,3,4,5]) # parallelize into 2
rdd = rdd.map(map_func) # call the image_chunk_func
print(rdd.collect())
spark.stop()
if __name__ == "__main__":
main()
function code:-
def add_two_nos(x):
return x*x
You can specify the .py file form which you wish to import in the code itself by adding a statement sc.addPyFile(Path). The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
Then use from add_num import add_two_nos
You need to include a zip containing add_num.py in your spark-submit command.
./bin/spark-submit --py-files sources.zip /Users/workflow/test_task.py
When submitting a python application to spark, all the source files imported by the main function/file(here test_task.py) should be packed in a egg or zip format and supplied to spark using --py-files option. If the main function needs only one other file, you can supply it directly without zipping it.
./bin/spark-submit --py-files add_num.py /Users/workflow/test_task.py
Above command should also work since there is only one other python source file required.

Pyspark - FileInputDStream: Error finding new files

Hi I'm new to Python Spark and I'm trying out this example from Spark github in order to Counts words in new text files created in the given directory :
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: hdfs_wordcount.py <directory>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingHDFSWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream("hdfs:///home/my-logs/")
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
And this is what I get :
a warning saying : WARN FileInputDStream: Error finding new files
a warning message saying : WARN FileInputDStream: Error finding new files.
and I got empty results even i'm adding files in this dir :/
Any suggested solution for this ?
thanks.
The issue is spark streaming will not read old files from directory..since all logs files exist before your streaming job started
so what you need to do once you started your streaming job then put/copy input files in hdfs directory either manually or by an script.
I think you are referring to this example. Are you able to run it without modifying as I see you are setting directory to "hdfs:///" in program? You can run the example like below.
For example Spark is at /opt/spark-2.0.2-bin-hadoop2.7. You can run hdfs_wordcount.py available in example directory like below. We are using /tmp as directory to pass as argument to program.
user1#user1:/opt/spark-2.0.2-bin-hadoop2.7$ bin/spark-submit examples/src/main/python/streaming/hdfs_wordcount.py /tmp
Now while this program is running, open another terminal and copy some file to /tmp folder
user1#user1:~$ cp test.txt /tmp
You will see the word count in first terminal.
Solved!
The issue is the build, i use to build like that using maven depending on their readme file from github :
build/mvn -DskipTests clean package
I've build that way depending on their documentation :
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Someone know what those params are ?

Resources