Creating an Apache Spark RDD of a Class in PySpark - apache-spark
I have to convert a Scala code to python.
The scala code converts an RDD of string to RDD of case-class. The code is as follow :
case class Stock(
stockName: String,
dt: String,
openPrice: Double,
highPrice: Double,
lowPrice: Double,
closePrice: Double,
adjClosePrice: Double,
volume: Double
)
def parseStock(inputRecord: String, stockName: String): Stock = {
val column = inputRecord.split(",")
Stock(
stockName,
column(0),
column(1).toDouble,
column(2).toDouble,
column(3).toDouble,
column(4).toDouble,
column(5).toDouble,
column(6).toDouble)
}
def parseRDD(rdd: RDD[String], stockName: String): RDD[Stock] = {
val header = rdd.first
rdd.filter((data) => {
data(0) != header(0) && !data.contains("null")
})
.map(data => parseStock(data, stockName))
}
Is it possible to implement this in PySpark? I tried to use following code and it gave error
from dataclasses import dataclass
#dataclass(eq=True,frozen=True)
class Stock:
stockName : str
dt: str
openPrice: float
highPrice: float
lowPrice: float
closePrice: float
adjClosePrice: float
volume: float
def parseStock(inputRecord, stockName):
column = inputRecord.split(",")
return Stock(stockName,
column[0],
column[1],
column[2],
column[3],
column[4],
column[5],
column[6])
def parseRDD(rdd, stockName):
header = rdd.first()
res = rdd.filter(lambda data : data != header).map(lambda data : parseStock(data, stockName))
return res
Error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 31, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 587, in loads
return pickle.loads(obj, encoding=encoding)
AttributeError: Can't get attribute 'main' on <module 'builtins' (built-in)>
The Dataset API is not available for python.
"A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar."
https://spark.apache.org/docs/latest/sql-programming-guide.html
Related
Getting java.lang.IllegalArgumentException after running Pandas UDF on a Spark DF
I'm new to PySpark and I've trying to do inference from a saved model here. I've tried almost every way to print the returned output from the pandas UDF but it always gives me an error. More often than not, it's java.lang.IllegalArgumentException. I don't understand what I'm doing wrong here and it'd be great if someone can help me out by debugging the pandas udf code, if that if where the problem lies. Code: saved_xgb.load_model("baseline_xgb.json") sc = spark.sparkContext broadcast_model = sc.broadcast(saved_xgb) prediction_set = spark.sql("select *, floor(rand()*100) as prediction_group from test_dataset_view") flattened_schema = StructType(prediction_set.schema.fields + [StructField('pred_label', FloatType(), nullable=True), StructField('score', FloatType(), nullable=True)]) #pandas_udf(flattened_schema, PandasUDFType.GROUPED_MAP) def model_scoring(pdf): pdf = pd.DataFrame(pdf) pdf = pdf.replace('\\N', np.nan) y_test = pdf['label'] X_test = pdf.drop(['label', 'prediction_group'], axis=1) y_pred = broadcast_model.value.predict(X_test) auc_test = roc_auc_score(y_test, broadcast_model.value.predict_proba(X_test)[:, 1]) pdf['pred_label'] = y_pred pdf['score'] = auc_test return pdf prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show() Error Stack from stdout on Spark Submit: XGBoost Version1.5.2 PySpark Version :2.3.2.3.1.0.0-78 PySpark Version :2.3.2.3.1.0.0-78 Traceback (most recent call last): File "baseline_h2o.py", line 130, in <module> prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show() File ".../pyspark.zip/pyspark/sql/dataframe.py", line 350, in show File ".../py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File ".../pyspark.zip/pyspark/sql/utils.py", line 63, in deco File ".../py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o1241.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 603, executor 26): java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64) I have tried printing out the Spark DF in multiple ways and displaying just the 'score' column alone, but all of them lead to errors. Even tried saving that DataFrame directly by saving and writing it, even that gave the same error. Can anyone guide me to solve this? Edit: Solved I had to set the environment variable inside the pandas_udf function for PyArrow to work properly on the executors. import os os.environ['ARROW_PRE_0_15_IPC_FORMAT']='1'
Working with hdf files in Databricks cluster
I am trying to create a simple .hdf in the Databricks environment. I can create the file on the driver, but the same code when executed with rdd.map(), it throws following exception. Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 287.0 failed 4 times, most recent failure: Lost task 3.3 in stage 287.0 (TID 1080) (10.67.238.26 executor 1): org.apache.spark.api.python.PythonException: **'RuntimeError: Can't decrement id ref count (file write failed:** time = Tue Nov 1 11:38:44 2022 , filename = '/dbfs/mnt/demo.hdf', file descriptor = 7, errno = 95, error message = '**Operation not supported**', buf = 0x30ab998, total write size = 40, bytes this sub-write = 40, bytes actually written = 18446744073709551615, offset = 0)', from <command-1134257524229717>, line 13. Full traceback below: I can write the same file on the worker and copy the file back to /dbfs/mnt. However, I was looking for a way through which I can write/modify the .hdf files stored in dbfs/mnt locaction through worker nodes directly. def create_hdf_file_tmp(x): import numpy as np import h5py, os, subprocess import pandas as pd dummy_data = [1,2,3,4,5] df_data = pd.DataFrame(dummy_data, columns=['Numbers']) with h5py.File('/dbfs/mnt/demo.hdf', 'w') as f: dset = f.create_dataset('default', data = df_data) # write to .hdf file return True def read_hdf_file(file_name, dname): import numpy as np import h5py, os, subprocess import pandas as pd with h5py.File(file_name, 'r') as f: data = f[dname] print(data[:5]) return data #driver code rdd = spark.sparkContext.parallelize(['/dbfs/mnt/demo.hdf']) result = rdd.map(lambda x: create_hdf_file_tmp(x)).collect() Above is the minimal code that I am trying to run in the Databricsk notebook with 1 driver and 2 worker nodes.
From error message: Operation not supported, most probably, when writing HDF file, the API uses something like random writes that aren't supported by DBFS (see DBFS Local API limitations in the docs). You will need to write a file to a local disk and then move that file to DBFS mount. But it will work only on the driver node...
Spark: KMeans - ValueError: could not convert string to float: '0\x00\x00'
I'm trying to create a kmeans for the mnist dataset. I have a way it works but it is the dirtiest hack. My input is an CSV file with 784 (=28*28) values between 0 and 255 per row. My first attempt was to just read my csv input, convert it to sparse arrays and fit the model with the data. However, the code below throws an error: data = spark.read.csv("datasets/mnist_test.csv").rdd\ .map(lambda x : [float(str) for str in x])\ .toDF() features = VectorAssembler(inputCols=data.columns, outputCol="features").transform(data).select("features") kmeans = KMeans().setK(10).setSeed(1) model = kmeans.fit(features) Output: 22/01/25 10:44:41 ERROR Executor: Exception in task 4.0 in stage 113.0 (TID 131) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main process() File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process serializer.dump_stream(out_iter, outfile) File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/opt/spark/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper return f(*args, **kwargs) File "/tmp/ipykernel_74/2701217925.py", line 2, in <lambda> File "/tmp/ipykernel_74/2701217925.py", line 2, in <listcomp> ValueError: could not convert string to float: '0\x00\x00' ... My next attempt was to save the dataframe as svm and then load it again: MLUtils.saveAsLibSVMFile(features.rdd.map(lambda x: LabeledPoint(0, MLLibVectors.fromML(x.features))), './libsvm') data2 = MLUtils.loadLibSVMFile(spark.sparkContext, './libsvm').toDF() kmeans = KMeans().setK(10).setSeed(1) model = kmeans.fit(features) Output: 22/01/25 10:47:06 ERROR Instrumentation: java.lang.IllegalArgumentException: requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>. My final (working) attempt was to load the exported partitions with the spark.read.format("libsvm").load(...) method: data3 = spark.read.format("libsvm").load("libsvm/part-00000").select("features") data3arr = list() for i in range(5): data3arr.append(spark.read.format("libsvm").load("libsvm/part-0000"+str(i)).select("features")) data3cpl = data3arr[0] for i in data3arr[1:]: data3cpl.union(i) kmeans = KMeans().setK(10).setSeed(1) model = kmeans.fit(data3cpl) If I look at the structure, the dataframes look quite similar in their structure. Only that features gives me an error on .show(): features.printSchema() features.show(1,False) data2.printSchema() data2.show(1,False) data3cpl.printSchema() data3cpl.show(1,False) Output: root |-- features: vector (nullable = true) Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager File "/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 663, in main if read_int(infile) == SpecialLengths.END_OF_STREAM: File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 564, in read_int raise EOFError EOFError +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |features | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |(784,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ only showing top 1 row root |-- features: vector (nullable = true) |-- label: double (nullable = true) +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+ |features |label| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+ |(778,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])|0.0 | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+ only showing top 1 row root |-- features: vector (nullable = true) +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |features | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |(776,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ only showing top 1 row Can anyone tell me how to properly convert my data so I can feed it into my kmeans fit?
I'm answering because i had a similar issue and in this post i didn't find any solution. In my case, the problem was caused by a filter operation on a DataFrame. I solved calling cache() in that DataFrame. In this case then, one possible solution is to try to cache the RDD: data = spark.read.csv("datasets/mnist_test.csv").rdd\ .map(lambda x : [float(str) for str in x]).cache()\ .toDF() features = VectorAssembler(inputCols=data.columns, outputCol="features").transform(data).select("features") kmeans = KMeans().setK(10).setSeed(1) model = kmeans.fit(features)
Facing issue with integrating code with Aws glue code, ray and pyspark
I am facing the following exception tries various ways but not resolved. It gives the exception in parallel distributed computing processing using ray library Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. Traceback (most recent call last): File "etl_engine_ray.py", line 148, in <module> print(perform_etl(**requested_data)) File "etl_engine_ray.py", line 138, in perform_etl futures = [process_etl.remote(each, uid, integration_id, time_stamp) for each in data] File "etl_engine_ray.py", line 138, in <listcomp> futures = [process_etl.remote(each, uid, integration_id, time_stamp) for each in data] File "/home/glue_user/.local/lib/python3.7/site-packages/ray/remote_function.py", line 124, in _remote_proxy return self._remote(args=args, kwargs=kwargs) File "/home/glue_user/.local/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 295, in _invocation_remote_span return method(self, args, kwargs, *_args, **_kwargs) File "/home/glue_user/.local/lib/python3.7/site-packages/ray/remote_function.py", line 263, in _remote self._pickled_function = pickle.dumps(self._function) File "/home/glue_user/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps cp.dump(obj) File "/home/glue_user/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump return Pickler.dump(self, obj) File "/home/glue_user/spark/python/pyspark/context.py", line 362, in __getnewargs__ "It appears that you are attempting to reference SparkContext from a broadcast " Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. from pyspark import SparkContext from awsglue.context import GlueContext from awsglue.transforms import SelectFields import ray import settings sc = SparkContext.getOrCreate() glue_context = GlueContext(sc) #ray.remote def process_etl(path:str, uid: str, integration_id: str, time_stamp: int): try: dynamic_df = glue_context.create_dynamic_frame_from_options( connection_type = settings.CONNECTION_TYPE, connection_options={ 'paths':[path], 'recurse':True, 'groupFiles': settings.S3_GROUP_FILES, 'groupSize': settings.S3_GROUP_SIZE}, format='json', format_options={"jsonPath": "*"} ) # select only those column name that required selected_data = SelectFields.apply( frame = dynamic_df, paths=['partner_script_id', 'details', 'registered_installation_id', 'type'] ) # Create file format file_name = os.path.basename(path).split('.')[0] parquet_path = f'{settings.S3_BUCKET_PATH}/{integration_id}/{uid}/{time_stamp}/{file_name}.parquet' # If pipeline available then use custom pipeline if file_name in settings.CUSTOM_ETL_PIPELINE: selected_data = settings.CUSTOM_ETL_PIPELINE.get(file_name)(selected_data) # Wtie data into bucket in parquet format glue_context.write_dynamic_frame_from_options( selected_data, connection_type=settings.CONNECTION_TYPE, connection_options={'path': parquet_path}, format='parquet', format_options = { "compression": "snappy", 'blockSize': settings.BLOCK_SIZE, 'pageSize': settings.PAGE_SIZE} ) except Exception as error: print(f'Exception in perform_etl is {error}') return parquet_path def perform_etl(uid: str, integration_id: str, type: str, data: list) -> dict: time_stamp = int(time.time()) futures = [process_etl.remote(each, uid, integration_id, time_stamp) for each in data] # a = sc.parallelize(data) # d = a.map(lambda each: process_etl.remote(each, uid, integration_id, time_stamp)).collect() # print(d) final_data = ray.get(futures) print(time.time() - start_time) return final_data if __name__ == '__main__': print(perform_etl(**requested_data)) I have done lots of R and D but still have not found any root cause of it. Its not resolved please help me out with this.
Zeppelin/Spark: org.apache.spark.SparkException: Cannot run program "/usr/bin/": error=13, no permission
I try to get a basic regression run with Zeppelin 0.7.2 and Spark 2.1.1 on Debian 9. Both zeppelin are "installed" in /usr/local/ that means /usr/local/zeppelin/ and /usr/local/spark. Zeppelin also knows the correct SPARK_HOME. First I load the data: %spark.pyspark from sqlalchemy import create_engine #sql query import pandas as pd #sql query from pyspark import SparkContext #Spark DataFrame from pyspark.sql import SQLContext #Spark DataFrame # database connection and sql query pdf = pd.read_sql("select col1, col2, col3 from table", create_engine('mysql+mysqldb://user:pass#host:3306/db').connect()) print(pdf.size) # size of pandas dataFrame # convert pandas dataFrame into spark dataFrame sdf = SQLContext(SparkContext.getOrCreate()).createDataFrame(pdf) sdf.printSchema()# what does the spark dataFrame look like? Fine, it works and I get the output with 46977 row and three cols: 46977 root |-- col1: double (nullable = true) |-- col2: double (nullable = true) |-- col3: date (nullable = true) Ok, now I want to do the regression: %spark.pyspark # do a linear regression with sparks ml libs # https://community.intersystems.com/post/machine-learning-spark-and-cach%C3%A9 from pyspark.ml.regression import LinearRegression from pyspark.ml.feature import VectorAssembler # choose several inputCols and transform the "Features" column(s) into the correct vector format vectorAssembler = VectorAssembler(inputCols=["col1"], outputCol="features") data=vectorAssembler.transform(sdf) print(data) # Split the data into 70% training and 30% test sets. trainingData,testData = data.randomSplit([0.7, 0.3], 0.0) print(trainingData) # Configure the model. lr = LinearRegression().setFeaturesCol("features").setLabelCol("col2").setMaxIter(10) ## Train the model using the training data. lrm = lr.fit(trainingData) ## Run the test data through the model and display its predictions for PetalLength. #predictions = lrm.transform(testData) #predictions.show() But while doing lr.fit(trainingData), I get errors in the console (and log files of zeppelin). The errors seems to be while starting spark: Cannot run program "/usr/bin/": error=13, Keine Berechtigung. I wonder what should to be started in /usr/bin/ since I only use the path /usr/local/. Traceback (most recent call last): File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 367, in <module> raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 360, in <module> exec(code, _zcUserQueryNameSpace) File "<stdin>", line 9, in <module> File "/usr/local/spark/python/pyspark/ml/base.py", line 64, in fit return self._fit(dataset) File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 236, in _fit java_model = self._fit_java(dataset) File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java return self._java_obj.fit(dataset._jdf) File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o70.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): **java.io.IOException: Cannot run program "/usr/bin/": error=13, Keine Berechtigung** at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
It was a configuration error in Zeppelins conf/zeppelin-env.sh. There, I had the following line uncommented that caused the error and I now commented the line and it works: #export PYSPARK_PYTHON=/usr/bin/ # path to the python command. must be the same path on the driver(Zeppelin) and all workers. So the problem was that the path to PYSPARK_PYTHON was not set correctly, now it uses the default python binary. I found the solution by looking for the string /usr/bin/ by doing grep -R "/usr/bin/" in the Zeppelin base directory and checked the files.