Creating an Apache Spark RDD of a Class in PySpark - apache-spark

I have to convert a Scala code to python.
The scala code converts an RDD of string to RDD of case-class. The code is as follow :
case class Stock(
stockName: String,
dt: String,
openPrice: Double,
highPrice: Double,
lowPrice: Double,
closePrice: Double,
adjClosePrice: Double,
volume: Double
)
def parseStock(inputRecord: String, stockName: String): Stock = {
val column = inputRecord.split(",")
Stock(
stockName,
column(0),
column(1).toDouble,
column(2).toDouble,
column(3).toDouble,
column(4).toDouble,
column(5).toDouble,
column(6).toDouble)
}
def parseRDD(rdd: RDD[String], stockName: String): RDD[Stock] = {
val header = rdd.first
rdd.filter((data) => {
data(0) != header(0) && !data.contains("null")
})
.map(data => parseStock(data, stockName))
}
Is it possible to implement this in PySpark? I tried to use following code and it gave error
from dataclasses import dataclass
#dataclass(eq=True,frozen=True)
class Stock:
stockName : str
dt: str
openPrice: float
highPrice: float
lowPrice: float
closePrice: float
adjClosePrice: float
volume: float
def parseStock(inputRecord, stockName):
column = inputRecord.split(",")
return Stock(stockName,
column[0],
column[1],
column[2],
column[3],
column[4],
column[5],
column[6])
def parseRDD(rdd, stockName):
header = rdd.first()
res = rdd.filter(lambda data : data != header).map(lambda data : parseStock(data, stockName))
return res
Error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 31, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 587, in loads
return pickle.loads(obj, encoding=encoding)
AttributeError: Can't get attribute 'main' on <module 'builtins' (built-in)>

The Dataset API is not available for python.
"A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar."
https://spark.apache.org/docs/latest/sql-programming-guide.html

Related

Getting java.lang.IllegalArgumentException after running Pandas UDF on a Spark DF

I'm new to PySpark and I've trying to do inference from a saved model here. I've tried almost every way to print the returned output from the pandas UDF but it always gives me an error. More often than not, it's java.lang.IllegalArgumentException. I don't understand what I'm doing wrong here and it'd be great if someone can help me out by debugging the pandas udf code, if that if where the problem lies.
Code:
saved_xgb.load_model("baseline_xgb.json")
sc = spark.sparkContext
broadcast_model = sc.broadcast(saved_xgb)
prediction_set = spark.sql("select *, floor(rand()*100) as prediction_group from test_dataset_view")
flattened_schema = StructType(prediction_set.schema.fields + [StructField('pred_label', FloatType(), nullable=True),
StructField('score', FloatType(), nullable=True)])
#pandas_udf(flattened_schema, PandasUDFType.GROUPED_MAP)
def model_scoring(pdf):
pdf = pd.DataFrame(pdf)
pdf = pdf.replace('\\N', np.nan)
y_test = pdf['label']
X_test = pdf.drop(['label', 'prediction_group'], axis=1)
y_pred = broadcast_model.value.predict(X_test)
auc_test = roc_auc_score(y_test, broadcast_model.value.predict_proba(X_test)[:, 1])
pdf['pred_label'] = y_pred
pdf['score'] = auc_test
return pdf
prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show()
Error Stack from stdout on Spark Submit:
XGBoost Version1.5.2
PySpark Version :2.3.2.3.1.0.0-78
PySpark Version :2.3.2.3.1.0.0-78
Traceback (most recent call last):
File "baseline_h2o.py", line 130, in <module>
prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show()
File ".../pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File ".../py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File ".../pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File ".../py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1241.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 603, executor 26): java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)
I have tried printing out the Spark DF in multiple ways and displaying just the 'score' column alone, but all of them lead to errors. Even tried saving that DataFrame directly by saving and writing it, even that gave the same error. Can anyone guide me to solve this?
Edit: Solved
I had to set the environment variable inside the pandas_udf function for PyArrow to work properly on the executors.
import os
os.environ['ARROW_PRE_0_15_IPC_FORMAT']='1'

Working with hdf files in Databricks cluster

I am trying to create a simple .hdf in the Databricks environment. I can create the file on the driver, but the same code when executed with rdd.map(), it throws following exception.
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 287.0 failed 4 times, most recent failure: Lost task 3.3 in stage 287.0 (TID 1080) (10.67.238.26 executor 1): org.apache.spark.api.python.PythonException: **'RuntimeError: Can't decrement id ref count (file write failed:** time = Tue Nov 1 11:38:44 2022
, filename = '/dbfs/mnt/demo.hdf', file descriptor = 7, errno = 95, error message = '**Operation not supported**', buf = 0x30ab998, total write size = 40, bytes this sub-write = 40, bytes actually written = 18446744073709551615, offset = 0)', from <command-1134257524229717>, line 13. Full traceback below:
I can write the same file on the worker and copy the file back to /dbfs/mnt. However, I was looking for a way through which I can write/modify the .hdf files stored in dbfs/mnt locaction through worker nodes directly.
def create_hdf_file_tmp(x):
import numpy as np
import h5py, os, subprocess
import pandas as pd
dummy_data = [1,2,3,4,5]
df_data = pd.DataFrame(dummy_data, columns=['Numbers'])
with h5py.File('/dbfs/mnt/demo.hdf', 'w') as f:
dset = f.create_dataset('default', data = df_data) # write to .hdf file
return True
def read_hdf_file(file_name, dname):
import numpy as np
import h5py, os, subprocess
import pandas as pd
with h5py.File(file_name, 'r') as f:
data = f[dname]
print(data[:5])
return data
#driver code
rdd = spark.sparkContext.parallelize(['/dbfs/mnt/demo.hdf'])
result = rdd.map(lambda x: create_hdf_file_tmp(x)).collect()
Above is the minimal code that I am trying to run in the Databricsk notebook with 1 driver and 2 worker nodes.
From error message: Operation not supported, most probably, when writing HDF file, the API uses something like random writes that aren't supported by DBFS (see DBFS Local API limitations in the docs). You will need to write a file to a local disk and then move that file to DBFS mount. But it will work only on the driver node...

Spark: KMeans - ValueError: could not convert string to float: '0\x00\x00'

I'm trying to create a kmeans for the mnist dataset. I have a way it works but it is the dirtiest hack.
My input is an CSV file with 784 (=28*28) values between 0 and 255 per row.
My first attempt was to just read my csv input, convert it to sparse arrays and fit the model with the data. However, the code below throws an error:
data = spark.read.csv("datasets/mnist_test.csv").rdd\
.map(lambda x : [float(str) for str in x])\
.toDF()
features = VectorAssembler(inputCols=data.columns, outputCol="features").transform(data).select("features")
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(features)
Output:
22/01/25 10:44:41 ERROR Executor: Exception in task 4.0 in stage 113.0 (TID 131)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
process()
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
serializer.dump_stream(out_iter, outfile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/opt/spark/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
return f(*args, **kwargs)
File "/tmp/ipykernel_74/2701217925.py", line 2, in <lambda>
File "/tmp/ipykernel_74/2701217925.py", line 2, in <listcomp>
ValueError: could not convert string to float: '0\x00\x00'
...
My next attempt was to save the dataframe as svm and then load it again:
MLUtils.saveAsLibSVMFile(features.rdd.map(lambda x: LabeledPoint(0, MLLibVectors.fromML(x.features))), './libsvm')
data2 = MLUtils.loadLibSVMFile(spark.sparkContext, './libsvm').toDF()
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(features)
Output:
22/01/25 10:47:06 ERROR Instrumentation: java.lang.IllegalArgumentException: requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
My final (working) attempt was to load the exported partitions with the spark.read.format("libsvm").load(...) method:
data3 = spark.read.format("libsvm").load("libsvm/part-00000").select("features")
data3arr = list()
for i in range(5):
data3arr.append(spark.read.format("libsvm").load("libsvm/part-0000"+str(i)).select("features"))
data3cpl = data3arr[0]
for i in data3arr[1:]:
data3cpl.union(i)
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(data3cpl)
If I look at the structure, the dataframes look quite similar in their structure. Only that features gives me an error on .show():
features.printSchema()
features.show(1,False)
data2.printSchema()
data2.show(1,False)
data3cpl.printSchema()
data3cpl.show(1,False)
Output:
root
|-- features: vector (nullable = true)
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
File "/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 663, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 564, in read_int
raise EOFError
EOFError

|features |

|(784,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])|

only showing top 1 row
root
|-- features: vector (nullable = true)
|-- label: double (nullable = true)

|features |label|

|(778,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])|0.0 |

only showing top 1 row
root
|-- features: vector (nullable = true)

|features |

|(776,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])|

only showing top 1 row
Can anyone tell me how to properly convert my data so I can feed it into my kmeans fit?
I'm answering because i had a similar issue and in this post i didn't find any solution.
In my case, the problem was caused by a filter operation on a DataFrame.
I solved calling cache() in that DataFrame.
In this case then, one possible solution is to try to cache the RDD:
data = spark.read.csv("datasets/mnist_test.csv").rdd\
.map(lambda x : [float(str) for str in x]).cache()\
.toDF()
features = VectorAssembler(inputCols=data.columns, outputCol="features").transform(data).select("features")
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(features)

Facing issue with integrating code with Aws glue code, ray and pyspark

I am facing the following exception tries various ways but not resolved.
It gives the exception in parallel distributed computing processing using ray library Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Traceback (most recent call last):
File "etl_engine_ray.py", line 148, in <module>
print(perform_etl(**requested_data))
File "etl_engine_ray.py", line 138, in perform_etl
futures = [process_etl.remote(each, uid, integration_id, time_stamp) for each in data]
File "etl_engine_ray.py", line 138, in <listcomp>
futures = [process_etl.remote(each, uid, integration_id, time_stamp) for each in data]
File "/home/glue_user/.local/lib/python3.7/site-packages/ray/remote_function.py", line 124, in _remote_proxy
return self._remote(args=args, kwargs=kwargs)
File "/home/glue_user/.local/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 295, in _invocation_remote_span
return method(self, args, kwargs, *_args, **_kwargs)
File "/home/glue_user/.local/lib/python3.7/site-packages/ray/remote_function.py", line 263, in _remote
self._pickled_function = pickle.dumps(self._function)
File "/home/glue_user/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/glue_user/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
return Pickler.dump(self, obj)
File "/home/glue_user/spark/python/pyspark/context.py", line 362, in __getnewargs__
"It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
from pyspark import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import SelectFields
import ray
import settings
sc = SparkContext.getOrCreate()
glue_context = GlueContext(sc)
#ray.remote
def process_etl(path:str, uid: str, integration_id: str, time_stamp: int):
try:
dynamic_df = glue_context.create_dynamic_frame_from_options(
connection_type = settings.CONNECTION_TYPE,
connection_options={
'paths':[path],
'recurse':True,
'groupFiles': settings.S3_GROUP_FILES,
'groupSize': settings.S3_GROUP_SIZE},
format='json',
format_options={"jsonPath": "*"}
)
# select only those column name that required
selected_data = SelectFields.apply(
frame = dynamic_df,
paths=['partner_script_id', 'details', 'registered_installation_id', 'type']
)
# Create file format
file_name = os.path.basename(path).split('.')[0]
parquet_path = f'{settings.S3_BUCKET_PATH}/{integration_id}/{uid}/{time_stamp}/{file_name}.parquet'
# If pipeline available then use custom pipeline
if file_name in settings.CUSTOM_ETL_PIPELINE:
selected_data = settings.CUSTOM_ETL_PIPELINE.get(file_name)(selected_data)
# Wtie data into bucket in parquet format
glue_context.write_dynamic_frame_from_options(
selected_data,
connection_type=settings.CONNECTION_TYPE,
connection_options={'path': parquet_path},
format='parquet',
format_options = {
"compression": "snappy",
'blockSize': settings.BLOCK_SIZE,
'pageSize': settings.PAGE_SIZE}
)
except Exception as error:
print(f'Exception in perform_etl is {error}')
return parquet_path
def perform_etl(uid: str, integration_id: str, type: str, data: list) -> dict:
time_stamp = int(time.time())
futures = [process_etl.remote(each, uid, integration_id, time_stamp) for each in data]
# a = sc.parallelize(data)
# d = a.map(lambda each: process_etl.remote(each, uid, integration_id, time_stamp)).collect()
# print(d)
final_data = ray.get(futures)
print(time.time() - start_time)
return final_data
if __name__ == '__main__':
print(perform_etl(**requested_data))
I have done lots of R and D but still have not found any root cause of it. Its not resolved please help me out with this.

Zeppelin/Spark: org.apache.spark.SparkException: Cannot run program "/usr/bin/": error=13, no permission

I try to get a basic regression run with Zeppelin 0.7.2 and Spark 2.1.1 on Debian 9. Both zeppelin are "installed" in /usr/local/ that means /usr/local/zeppelin/ and /usr/local/spark. Zeppelin also knows the correct SPARK_HOME. First I load the data:
%spark.pyspark
from sqlalchemy import create_engine #sql query
import pandas as pd #sql query
from pyspark import SparkContext #Spark DataFrame
from pyspark.sql import SQLContext #Spark DataFrame
# database connection and sql query
pdf = pd.read_sql("select col1, col2, col3 from table", create_engine('mysql+mysqldb://user:pass#host:3306/db').connect())
print(pdf.size) # size of pandas dataFrame
# convert pandas dataFrame into spark dataFrame
sdf = SQLContext(SparkContext.getOrCreate()).createDataFrame(pdf)
sdf.printSchema()# what does the spark dataFrame look like?
Fine, it works and I get the output with 46977 row and three cols:
46977
root
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: date (nullable = true)
Ok, now I want to do the regression:
%spark.pyspark
# do a linear regression with sparks ml libs
# https://community.intersystems.com/post/machine-learning-spark-and-cach%C3%A9
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# choose several inputCols and transform the "Features" column(s) into the correct vector format
vectorAssembler = VectorAssembler(inputCols=["col1"], outputCol="features")
data=vectorAssembler.transform(sdf)
print(data)
# Split the data into 70% training and 30% test sets.
trainingData,testData = data.randomSplit([0.7, 0.3], 0.0)
print(trainingData)
# Configure the model.
lr = LinearRegression().setFeaturesCol("features").setLabelCol("col2").setMaxIter(10)
## Train the model using the training data.
lrm = lr.fit(trainingData)
## Run the test data through the model and display its predictions for PetalLength.
#predictions = lrm.transform(testData)
#predictions.show()
But while doing lr.fit(trainingData), I get errors in the console (and log files of zeppelin). The errors seems to be while starting spark: Cannot run program "/usr/bin/": error=13, Keine Berechtigung. I wonder what should to be started in /usr/bin/ since I only use the path /usr/local/.
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 9, in <module>
File "/usr/local/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o70.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): **java.io.IOException: Cannot run program "/usr/bin/": error=13, Keine Berechtigung**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
It was a configuration error in Zeppelins conf/zeppelin-env.sh. There, I had the following line uncommented that caused the error and I now commented the line and it works:
#export PYSPARK_PYTHON=/usr/bin/ # path to the python command. must be the same path on the driver(Zeppelin) and all workers.
So the problem was that the path to PYSPARK_PYTHON was not set correctly, now it uses the default python binary. I found the solution by looking for the string /usr/bin/ by doing grep -R "/usr/bin/" in the Zeppelin base directory and checked the files.

Resources