How could I resolve Pyspark error on SparkSession? - apache-spark

Please help me with resolving the problem I'm facing.
Here is the code which I run on a Jupyter notebook.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
--> I got error as
RuntimeError Traceback (most recent call last)
C:\Users\MOHANA~1\AppData\Local\Temp/ipykernel_18536/4172048100.py in <module>
1 from pyspark.sql import SparkSession
----> 2 spark = SparkSession.builder.appName('abc').getOrCreate()
RuntimeError: Java gateway process exited before sending its port number

Related

Read XML file in Jupiter notebook

I tried to read an XML file in Jupiter notebook getting the below error
import os
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.13:0.4.1 pyspark-shell'
df = spark.read.format('xml').option('rowTag', 'row').load('test.xml')
The error:
Py4JJavaError: An error occurred while calling o251.load.
: java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

Unable to run simple pyspark program

I am trying to create an RDD out of one file which is located on a local system. I am using Eclipse IDE on windows. Below is my code:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("FirstProgram").setMaster("Local")
sc = SparkContext("local")
load_data=sc.textFile("E://words.txt")
load_data.collect()
Below is my config:
1) Spark 2.4.4
2) Python 3.7.4
I tried variations with the file path name but no luck. Below are the contents of the project where the file is stored in the source folder still unable to read it. However, I am able to read that file via the same path i.e E:/words.txt. I think there is some problem with the SparkContext object.
Directory of E:\workspacewa\FirstSparkProject\Sample
10/12/2019 07:33 PM <DIR> .
10/12/2019 07:33 PM <DIR> ..
10/12/2019 07:34 PM 119 FileRead.py
10/12/2019 06:21 PM 269 FirstSpark.py
02/02/2019 09:22 PM 82 words.txt
10/12/2019 01:22 PM 0 __init__.py
I reinstalled everything and now facing a new error as below:
Exception ignored in: <function Popen.__del__ at 0x000001924C5434C8>
Traceback (most recent call last):
File "C:\Users\siddh\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 860, in __del__
self._internal_poll(_deadstate=_maxsize)
File "C:\Users\siddh\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1216, in _internal_poll
if _WaitForSingleObject(self._handle, 0) == _WAIT_OBJECT_0:
OSError: [WinError 6] The handle is invalid
I cleaned up all the temp files, reinstalled everything and tried it once again with the below code and it works like a charm.
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
load_data=sc.textFile("E://long_sample.txt")
load_data.foreach(print())

Using xgboost in azure ML environment

I cannot succeed to use xgboost package in Azure Machine Learning Studio interpreter. I am trying to import a model using xgboost that I trained in order to deploy it here. But It seems that my package is not set correctly because I cannot access to certain functions, particularly "xgboost.sklearn"
My model is of course using the xgboost.sklearn.something to do the classification
I tried to implement the package following two different methods :
from the tar.gz principle like here :
How can certain python libraries be imported in azure ML?Like the line import humanfriendly gives error
and also by clean package with the sandbox like there :
Uploading xgboost to azure machine learning: %1 is not a valid Win32 application\r\nProcess returned with non-zero exit code 1
import sys
import sklearn
import pandas as pd
import pickle
import xgboost
def azureml_main(dataframe1 = None, dataframe2 = None):
sys.path.insert(0,".\Script Bundle")
model = pickle.load(open(".\\Script
Bundle\\xgboost\\XGBv1.pkl",'rb'))
dataframe1, dftrue = filterdata(dataframe1)
## one processing step
pred = predictorV1(dataframe1,dftrue)
dataframe1['Y'] = pred
return dataframe1
Here is the error I get
Error 0085: The following error occurred during script evaluation, please view the output log for more information:
---------- Start of error message from Python interpreter ----------
Caught exception while executing function: Traceback (most recent call last):
File "C:\server\invokepy.py", line 199, in batch
odfs = mod.azureml_main(*idfs)
File "C:\temp\1098d8754a52467181a9509ed16de8ac.py", line 89, in azureml_main
model = pickle.load(open(".\\Script Bundle\\xgboost\\XGBv1.pkl", 'rb'))
ImportError: No module named 'xgboost.sklearn'
Process returned with non-zero exit code 1
---------- End of error message from Python interpreter ----------
Start time: UTC 05/22/2019 13:11:08
End time: UTC 05/22/2019 13:11:49

pyspark 1.6.0 write to parquet gives "path exists" error

I read in from parquet file(s) from different folders, take e.g. February this year (one folder = one day)
indata = sqlContext.read.parquet('/data/myfolder/201602*')
do some very simple grouping and aggregation
outdata = indata.groupby(...).agg()
and want to store again.
outdata.write.parquet(outloc)
Here is how I run the script from bash:
spark-submit
--master yarn-cluster
--num-executors 16
--executor-cores 4
--driver-memory 8g
--executor-memory 16g
--files /etc/hive/conf/hive-site.xml
--driver-java-options
-XX:MaxPermSize=512m
spark_script.py
This generates multiple jobs (is that the right term?). First job runs successfully. Subsequent jobs fail with the following error:
Traceback (most recent call last):
File "spark_generate_maps.py", line 184, in <module>
outdata.write.parquet(outloc)
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 471, in parquet
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'path OBFUSCATED_PATH_THAT_I_CLEANED_BEFORE_SUBMIT already exists.;'
When I give only one folder as input, this works fine.
So it seems the first job creates the folder, all subsequent jobs fail to write into that folder. Why?
just in case this could help anybody:
imports:
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf, collect_list, countDistinct, count
import pyspark.sql.functions as func
from pyspark.sql.functions import lit
import numpy as np
import sys
import math
config:
conf = SparkConf().setAppName('spark-compute-maps').setMaster('yarn-cluster')
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
Your question is "why does Spark iterate on input folders, but applies the default write mode, that does not make sense in that context".
Quoting the Spark V1.6 Python API...
mode(saveMode)
Specifies the behavior when data or table already exists.
Options include:
append Append contents of this DataFrame to existing data.
overwrite Overwrite existing data.
error Throw an exception if data already exists.
ignore Silently ignore this operation if data already exists.
I think outdata.write.mode('append').parquet(outloc) is worth a try.
You should add mode option in your code.
outdata.write.mode('append').parquet(outloc)

PySpark - The system cannot find the path specified

Hy,
I have been run Spark multiple times (Spyder IDE).
Today I got this error (the code it's the same)
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
os.environ['SPARK_HOME']="C:/Apache/spark-1.6.0"
os.environ['JAVA_HOME']="C:/Program Files/Java/jre1.8.0_71"
sys.path.append("C:/Apache/spark-1.6.0/python/")
os.environ['HADOOP_HOME']="C:/Apache/spark-1.6.0/winutils/"
from pyspark import SparkContext
from pyspark import SparkConf
conf = SparkConf()
The system cannot find the path specified.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Apache\spark-1.6.0\python\pyspark\conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "C:\Apache\spark-1.6.0\python\pyspark\context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "C:\Apache\spark-1.6.0\python\pyspark\java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
What's go wrong?
thanks for your time.
Ok... Someone install a new java version in VirtualMachine. I'm only change this
os.environ['JAVA_HOME']="C:/Program Files/Java/jre1.8.0_91"
and works again.
thks for your time.

Resources