Getting java.lang.IllegalArgumentException after running Pandas UDF on a Spark DF - python-3.x

I'm new to PySpark and I've trying to do inference from a saved model here. I've tried almost every way to print the returned output from the pandas UDF but it always gives me an error. More often than not, it's java.lang.IllegalArgumentException. I don't understand what I'm doing wrong here and it'd be great if someone can help me out by debugging the pandas udf code, if that if where the problem lies.
Code:
saved_xgb.load_model("baseline_xgb.json")
sc = spark.sparkContext
broadcast_model = sc.broadcast(saved_xgb)
prediction_set = spark.sql("select *, floor(rand()*100) as prediction_group from test_dataset_view")
flattened_schema = StructType(prediction_set.schema.fields + [StructField('pred_label', FloatType(), nullable=True),
StructField('score', FloatType(), nullable=True)])
#pandas_udf(flattened_schema, PandasUDFType.GROUPED_MAP)
def model_scoring(pdf):
pdf = pd.DataFrame(pdf)
pdf = pdf.replace('\\N', np.nan)
y_test = pdf['label']
X_test = pdf.drop(['label', 'prediction_group'], axis=1)
y_pred = broadcast_model.value.predict(X_test)
auc_test = roc_auc_score(y_test, broadcast_model.value.predict_proba(X_test)[:, 1])
pdf['pred_label'] = y_pred
pdf['score'] = auc_test
return pdf
prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show()
Error Stack from stdout on Spark Submit:
XGBoost Version1.5.2
PySpark Version :2.3.2.3.1.0.0-78
PySpark Version :2.3.2.3.1.0.0-78
Traceback (most recent call last):
File "baseline_h2o.py", line 130, in <module>
prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show()
File ".../pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File ".../py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File ".../pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File ".../py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1241.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 603, executor 26): java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)
I have tried printing out the Spark DF in multiple ways and displaying just the 'score' column alone, but all of them lead to errors. Even tried saving that DataFrame directly by saving and writing it, even that gave the same error. Can anyone guide me to solve this?
Edit: Solved
I had to set the environment variable inside the pandas_udf function for PyArrow to work properly on the executors.
import os
os.environ['ARROW_PRE_0_15_IPC_FORMAT']='1'

Related

Creating an Apache Spark RDD of a Class in PySpark

I have to convert a Scala code to python.
The scala code converts an RDD of string to RDD of case-class. The code is as follow :
case class Stock(
stockName: String,
dt: String,
openPrice: Double,
highPrice: Double,
lowPrice: Double,
closePrice: Double,
adjClosePrice: Double,
volume: Double
)
def parseStock(inputRecord: String, stockName: String): Stock = {
val column = inputRecord.split(",")
Stock(
stockName,
column(0),
column(1).toDouble,
column(2).toDouble,
column(3).toDouble,
column(4).toDouble,
column(5).toDouble,
column(6).toDouble)
}
def parseRDD(rdd: RDD[String], stockName: String): RDD[Stock] = {
val header = rdd.first
rdd.filter((data) => {
data(0) != header(0) && !data.contains("null")
})
.map(data => parseStock(data, stockName))
}
Is it possible to implement this in PySpark? I tried to use following code and it gave error
from dataclasses import dataclass
#dataclass(eq=True,frozen=True)
class Stock:
stockName : str
dt: str
openPrice: float
highPrice: float
lowPrice: float
closePrice: float
adjClosePrice: float
volume: float
def parseStock(inputRecord, stockName):
column = inputRecord.split(",")
return Stock(stockName,
column[0],
column[1],
column[2],
column[3],
column[4],
column[5],
column[6])
def parseRDD(rdd, stockName):
header = rdd.first()
res = rdd.filter(lambda data : data != header).map(lambda data : parseStock(data, stockName))
return res
Error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 31, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 587, in loads
return pickle.loads(obj, encoding=encoding)
AttributeError: Can't get attribute 'main' on <module 'builtins' (built-in)>
The Dataset API is not available for python.
"A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar."
https://spark.apache.org/docs/latest/sql-programming-guide.html

KeyError when training a model with pyspark.ml on AWS EMR with data from s3 bucket

I am training a machine learning model with pyspark.ml on .json data from an s3 bucket on AWS EMR in a JupyterLab notebook. The bucket is not mine, but I think access works fine because data preprocessing, feature engineering etc. works fine. But when I call the cv.fit(training_data) function, the training process runs until it almost finishes (indicated by the status bar), but then throws an error:
Exception in thread cell_monitor-64:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6571
I could not find any information on this error yet. What is going on?
This is my pipeline:
train, test = clean_df.randomSplit([0.8, 0.2], seed=42)
va1 = VectorAssembler(inputCols="vars", outputCol="vars")
scaler = StandardScaler(inputCol="to_scale", outputCol="scaled_features")
va2 = VectorAssembler(inputCols=["more_vars","scaled_features"], outputCol="features")
gbt = GBTClassifier()
pipeline = Pipeline(stages=[va1, scaler,va2,gbt])
paramGrid = ParamGridBuilder()\
.addGrid(gbt.maxDepth, [2, 5])\
.addGrid(gbt.maxIter, [10, 100])\
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(metricName='f1'),
numFolds=3)
cvModel = crossval.fit(train)
Second, I have a hunch that I might by resolved in Python 3.8; can I install Python 3.8 on EMR?
We are having the same problem. We are using Hyperopt and we just added try except to avoid this issue. The Error keeps appearing but it keeps running.
It seems that the error affects the bars showing the progress inside the spark jobs on the EMR notebook but it completes the pipeline.
# Defining the hyperopt objetive
def objetive(params):
try:
# Pipeline here with Vector Assembler and GBT
return {'loss': -metrics_val.areaUnderPR,
"status": STATUS_OK,
"output_dict": output_dict}
except Exception as e:
print("## Exception", e)
return {'loss': 0,
"status": STATUS_FAIL,
"except": e,
"output_dict": {"params": params}}
The Exception we got is the following:
Exception in thread cell_monitor-18:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 2395
But we got the output from all hyperopt:
100%|##########| 5/5 [34:36<00:00, 415.32s/trial, best loss: -0.3907675279893325]
This is not an error in your code. The key error is referencing the stage number which does not exist. It is some kind of internal Spark monitoring error. I do not However, it does not seem to affect the running of the application.

spark java.lang.stackoverflow logistic regression fit with large dataset

I am trying to fit a logistic regression model for a data set with 470 features and 10 million training instances. Here is a snippet of my code.
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RFormula
formula = RFormula(formula = "label ~ .-classWeight")
bestregLambdaVal = 0.005
bestregAlphaVal = 0.01
lr = LogisticRegression(maxIter=1000, regParam=bestregLambdaVal, elasticNetParam=bestregAlphaVal,weightCol="classWeight")
pipeLineLr = Pipeline(stages = [formula, lr])
pipeLineFit = pipeLineLr.fit(mySparkDataFrame[featureColumnNameList + ['classWeight','label']])
I have also created a checkpoint directory,
sc.setCheckpointDir('checkpoint/')
as suggested here:
Spark gives a StackOverflowError when training using ALS
However I get an error and here is a partial trace:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 108, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 265, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 262, in _fit_java
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o383361.fit.
: java.lang.StackOverflowError
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
I would also like to note that the 470 feature columns were iteratively added to spark data frame using withcolumn().
So the mistake I was making is that, when checkpointing the dataframe, I would only do:
mySparkDataFrame.checkpoint(eager=True)
The right was to do:
mySparkDataFrame = mySparkDataFrame.checkpoint(eager=True)
This is based on another question I had asked (and got an answer for) here:
pyspark rdd isCheckPointed() is false
Also, it is recommended to persist() the dataframe before checkpoint and also to count() it after the checkpoint

Zeppelin/Spark: org.apache.spark.SparkException: Cannot run program "/usr/bin/": error=13, no permission

I try to get a basic regression run with Zeppelin 0.7.2 and Spark 2.1.1 on Debian 9. Both zeppelin are "installed" in /usr/local/ that means /usr/local/zeppelin/ and /usr/local/spark. Zeppelin also knows the correct SPARK_HOME. First I load the data:
%spark.pyspark
from sqlalchemy import create_engine #sql query
import pandas as pd #sql query
from pyspark import SparkContext #Spark DataFrame
from pyspark.sql import SQLContext #Spark DataFrame
# database connection and sql query
pdf = pd.read_sql("select col1, col2, col3 from table", create_engine('mysql+mysqldb://user:pass#host:3306/db').connect())
print(pdf.size) # size of pandas dataFrame
# convert pandas dataFrame into spark dataFrame
sdf = SQLContext(SparkContext.getOrCreate()).createDataFrame(pdf)
sdf.printSchema()# what does the spark dataFrame look like?
Fine, it works and I get the output with 46977 row and three cols:
46977
root
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: date (nullable = true)
Ok, now I want to do the regression:
%spark.pyspark
# do a linear regression with sparks ml libs
# https://community.intersystems.com/post/machine-learning-spark-and-cach%C3%A9
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# choose several inputCols and transform the "Features" column(s) into the correct vector format
vectorAssembler = VectorAssembler(inputCols=["col1"], outputCol="features")
data=vectorAssembler.transform(sdf)
print(data)
# Split the data into 70% training and 30% test sets.
trainingData,testData = data.randomSplit([0.7, 0.3], 0.0)
print(trainingData)
# Configure the model.
lr = LinearRegression().setFeaturesCol("features").setLabelCol("col2").setMaxIter(10)
## Train the model using the training data.
lrm = lr.fit(trainingData)
## Run the test data through the model and display its predictions for PetalLength.
#predictions = lrm.transform(testData)
#predictions.show()
But while doing lr.fit(trainingData), I get errors in the console (and log files of zeppelin). The errors seems to be while starting spark: Cannot run program "/usr/bin/": error=13, Keine Berechtigung. I wonder what should to be started in /usr/bin/ since I only use the path /usr/local/.
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 9, in <module>
File "/usr/local/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o70.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): **java.io.IOException: Cannot run program "/usr/bin/": error=13, Keine Berechtigung**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
It was a configuration error in Zeppelins conf/zeppelin-env.sh. There, I had the following line uncommented that caused the error and I now commented the line and it works:
#export PYSPARK_PYTHON=/usr/bin/ # path to the python command. must be the same path on the driver(Zeppelin) and all workers.
So the problem was that the path to PYSPARK_PYTHON was not set correctly, now it uses the default python binary. I found the solution by looking for the string /usr/bin/ by doing grep -R "/usr/bin/" in the Zeppelin base directory and checked the files.

reading json file in pyspark

I'm new to PySpark, Below is my JSON file format from kafka.
{
"header": {
"platform":"atm",
"version":"2.0"
}
"details":[
{
"abc":"3",
"def":"4"
},
{
"abc":"5",
"def":"6"
},
{
"abc":"7",
"def":"8"
}
]
}
how can I read through the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)]. The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code.
parsed = messages.map(lambda (k,v): json.loads(v))
list = []
summed = parsed.map(lambda detail:list.append((String(['mcc']), String(['mid']), String(['dsrc']))))
output = summed.collect()
print output
It produces the error 'too many values to unpack'
Error message below at statement summed.collect()
16/09/12 12:46:10 INFO deprecation: mapred.task.is.map is deprecated.
Instead, use mapreduce.task.ismap 16/09/12 12:46:10 INFO deprecation:
mapred.task.partition is deprecated. Instead, use
mapreduce.task.partition 16/09/12 12:46:10 INFO deprecation:
mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/09/12
12:46:10 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py",
line 111, in main
process() File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py",
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile) File
"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 263, in dump_stream
vs = list(itertools.islice(iterator, batch)) File "", line 1, in ValueError: too many values to unpack
First of all, the json is invalid. After the header a , is missing.
That being said, lets take this json:
{"header":{"platform":"atm","version":"2.0"},"details":[{"abc":"3","def":"4"},{"abc":"5","def":"6"},{"abc":"7","def":"8"}]}
This can be processed by:
>>> df = sqlContext.jsonFile('test.json')
>>> df.first()
Row(details=[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')], header=Row(platform='atm', version='2.0'))
>>> df = df.flatMap(lambda row: row['details'])
PythonRDD[38] at RDD at PythonRDD.scala:43
>>> df.collect()
[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')]
>>> df.map(lambda entry: (int(entry['abc']), int(entry['def']))).collect()
[(3, 4), (5, 6), (7, 8)]
Hope this helps!
import pyspark
from pyspark import SparkConf
# You can configure the SparkContext
conf = SparkConf()
conf.set('spark.local.dir', '/remote/data/match/spark')
conf.set('spark.sql.shuffle.partitions', '2100')
SparkContext.setSystemProperty('spark.executor.memory', '10g')
SparkContext.setSystemProperty('spark.driver.memory', '10g')
sc = SparkContext(appName='mm_exp', conf=conf)
sqlContext = pyspark.SQLContext(sc)
data = sqlContext.read.json(file.json)
I feel that he missed an important part of the read sequence. You have to initialize a SparkContext.
When you start a SparkContext, it also spins up a webUI on port 4040. The webUI can be accessed using http://localhost:4040. That is a useful place to check progress of all calculations.
try this with latest spark version.
df = spark.read.json('test.json')
According to the info in the comments, each row in messages RDD holds one line from the json file
u'{',
u' "header": {',
u' "platform":"atm",'
Your code is failing in the following line:
parsed = messages.map(lambda (k,v): json.loads(v))
Your code takes line like: '{' and try to convert it into key,value, and execute json.loads(value)
it is clear that python/spark won't be able to divide one char '{' into key-value pair.
The json.loads() command should be executed on a complete json data-object
This specific task might be accomplished easier with pure python

Resources