Unable to run simple pyspark program - apache-spark

I am trying to create an RDD out of one file which is located on a local system. I am using Eclipse IDE on windows. Below is my code:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("FirstProgram").setMaster("Local")
sc = SparkContext("local")
load_data=sc.textFile("E://words.txt")
load_data.collect()
Below is my config:
1) Spark 2.4.4
2) Python 3.7.4
I tried variations with the file path name but no luck. Below are the contents of the project where the file is stored in the source folder still unable to read it. However, I am able to read that file via the same path i.e E:/words.txt. I think there is some problem with the SparkContext object.
Directory of E:\workspacewa\FirstSparkProject\Sample
10/12/2019 07:33 PM <DIR> .
10/12/2019 07:33 PM <DIR> ..
10/12/2019 07:34 PM 119 FileRead.py
10/12/2019 06:21 PM 269 FirstSpark.py
02/02/2019 09:22 PM 82 words.txt
10/12/2019 01:22 PM 0 __init__.py
I reinstalled everything and now facing a new error as below:
Exception ignored in: <function Popen.__del__ at 0x000001924C5434C8>
Traceback (most recent call last):
File "C:\Users\siddh\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 860, in __del__
self._internal_poll(_deadstate=_maxsize)
File "C:\Users\siddh\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1216, in _internal_poll
if _WaitForSingleObject(self._handle, 0) == _WAIT_OBJECT_0:
OSError: [WinError 6] The handle is invalid

I cleaned up all the temp files, reinstalled everything and tried it once again with the below code and it works like a charm.
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
load_data=sc.textFile("E://long_sample.txt")
load_data.foreach(print())

Related

How could I resolve Pyspark error on SparkSession?

Please help me with resolving the problem I'm facing.
Here is the code which I run on a Jupyter notebook.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
--> I got error as
RuntimeError Traceback (most recent call last)
C:\Users\MOHANA~1\AppData\Local\Temp/ipykernel_18536/4172048100.py in <module>
1 from pyspark.sql import SparkSession
----> 2 spark = SparkSession.builder.appName('abc').getOrCreate()
RuntimeError: Java gateway process exited before sending its port number

Read XML file in Jupiter notebook

I tried to read an XML file in Jupiter notebook getting the below error
import os
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.13:0.4.1 pyspark-shell'
df = spark.read.format('xml').option('rowTag', 'row').load('test.xml')
The error:
Py4JJavaError: An error occurred while calling o251.load.
: java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

MODULE MASKED IN PYSPARK . NOT ACCESSIBLE

I am a novice to Pyspark. I was trying to run a pyspark code. I ran a code with name "time.py" because of which the pyspark is not able to run now. I get the below error.
Traceback (most recent call last):
File "/home/VAL_CODE/test.py", line 1, in <module>
from pyspark import SparkContext,HiveContext,SparkConf
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p0.1796617/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 51, in <module>
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p0.1796617/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 24, in <module>
File "/usr/lib64/python2.7/threading.py", line 14, in <module>
from time import time as _time, sleep as _sleep
File "/home/VAL_CODE/time.py", line 1, in <module>
ImportError: cannot import name SparkContext
20/08/13 19:04:16 INFO util.ShutdownHookManager: Shutdown hook called
20/08/13 19:04:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-f104da1f-ba70-4c45-8a19-6ffc55b609aa
The error is that the script "/usr/lib64/python2.7/threading.py" is referring to the local script "/home/VAL_CODE/time.py" which I created. Now, I have deleted the "/home/VAL_CODE/time.py" script. But still face the issue when I run a new code "/home/VAL_CODE/test.py". Please help to resolve.

PySpark custom UDF ModuleNotFoundError: No module named

testing existing code with python3.6 but some how the udf which used to work with python 2.7 is not working as is, couldn't figure it out where the issue is. Anyone facing similar issue locally or distributed way? similar to https://github.com/mlflow/mlflow/issues/797
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 202, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):+details
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 202, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 119, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 59, in read_command
command = serializer._read_with_length(file)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'project'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:83)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:66)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:52)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
Driver stacktrace:
1. My project has sub packages and then a sub package
pkg
subpckg1
subpkg2
.py
2. from my Main.py im calling a UDF which will be calling a function in subpkg2(.py) file
3 .due to more nesting functions and inter communication UDF's with lot other functions some how spark job couldn't find the subpkg2 files
solution :
create a egg file of the pkg and send via --py-files.
I had a similar situation and #Avinash's answer worked for me. If the sub package is nested below other packages and the sub package is being referenced in the code directly, I had to create a separate zip file for the sub package module (subpkg2 in this case) and add it as using addPyFile to Spark Context.
scripts
|__ analysis.py
pkg
|__ __init__.py
|__ subpkg1
|__ __init__.py
|__ subpkg2
| __init__.py
|__ file1.py
#########################
## scripts/analysis.py ##
#########################
# Add pkg to path
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
# Sub package referenced directly
from subpkg2 import file1
...
...
spark = (
SparkSession
.builder()
.master("local[*]")
.appname("some app")
.getOrCreate()
)
# Need to add this, else references to sub package when using UDF do not work
spark.sparkContext.addPyFile(subpkg2.zip)
...
...
# Some code here that uses Pandas UDF with PySpark
I also noticed that in Cloudera Data Science Workbench (I am not sure if it is a generic finding or specific to CDSW), if subpkg2 is at the root level (i.e. is it a package and not a sub package - not nested within pkg and subpkg1), then I do not have to zip up subpkg2 and the UDF is able to recognize all the custom modules directly. I am not sure why this is the case. I am still looking for an answer to this question
scripts
|__ analysis.py
subpkg2
|__ __init__.py
|__ file1.py
#########################
## scripts/analysis.py ##
#########################
# Everything is same as the original example, except that there is
# no need to specify this line. For some reason, UDF's recognize module
# references at the top level but not submodule references.
# spark.sparkContext.addPyFile(subpkg.zip)
This brings me to the final debug that I tried on the original example. If we change the references in the file to start with pkg.subpkg1 then we don't have to pass the subpkg.zip to Spark Context.
#########################
## scripts/analysis.py ##
#########################
# Add pkg to path
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
# Specify full path here
from pkg.subpkg1.subpkg2 import file1
...
...
spark = (
SparkSession
.builder()
.master("local[*]")
.appname("some app")
.getOrCreate()
)
# Don't need to add the zip file anymore since we changed the imports to use the full path
# spark.sparkContext.addPyFile(subpkg2.zip)
...
...
# Some code here that uses Pandas UDF with PySpark

pyspark 1.6.0 write to parquet gives "path exists" error

I read in from parquet file(s) from different folders, take e.g. February this year (one folder = one day)
indata = sqlContext.read.parquet('/data/myfolder/201602*')
do some very simple grouping and aggregation
outdata = indata.groupby(...).agg()
and want to store again.
outdata.write.parquet(outloc)
Here is how I run the script from bash:
spark-submit
--master yarn-cluster
--num-executors 16
--executor-cores 4
--driver-memory 8g
--executor-memory 16g
--files /etc/hive/conf/hive-site.xml
--driver-java-options
-XX:MaxPermSize=512m
spark_script.py
This generates multiple jobs (is that the right term?). First job runs successfully. Subsequent jobs fail with the following error:
Traceback (most recent call last):
File "spark_generate_maps.py", line 184, in <module>
outdata.write.parquet(outloc)
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 471, in parquet
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'path OBFUSCATED_PATH_THAT_I_CLEANED_BEFORE_SUBMIT already exists.;'
When I give only one folder as input, this works fine.
So it seems the first job creates the folder, all subsequent jobs fail to write into that folder. Why?
just in case this could help anybody:
imports:
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf, collect_list, countDistinct, count
import pyspark.sql.functions as func
from pyspark.sql.functions import lit
import numpy as np
import sys
import math
config:
conf = SparkConf().setAppName('spark-compute-maps').setMaster('yarn-cluster')
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
Your question is "why does Spark iterate on input folders, but applies the default write mode, that does not make sense in that context".
Quoting the Spark V1.6 Python API...
mode(saveMode)
Specifies the behavior when data or table already exists.
Options include:
append Append contents of this DataFrame to existing data.
overwrite Overwrite existing data.
error Throw an exception if data already exists.
ignore Silently ignore this operation if data already exists.
I think outdata.write.mode('append').parquet(outloc) is worth a try.
You should add mode option in your code.
outdata.write.mode('append').parquet(outloc)

Resources