Py4JError: An error occurred while calling o90.fit - apache-spark

I want to apply random forest algorithm over a dataframe consisting of three columns namely JournalID, IndexedJournalID(Obtained using StringIndexer of Spark) and feature vector. I used below code to read the dataframe from parquet file and apply String Indexer over JournalID column to convert it to categorical type.
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors
from pyspark.ml.linalg import VectorUDT
df=spark.read.parquet('JouID-UBTFIDFVectors-server22.parquet')
labelIndexer = StringIndexer(inputCol="journalid", outputCol="IndexedJournalID")
labelsDF=labelIndexer.fit(df)
df1=labelsDF.transform(df)
# This function converts sparse vectors to dense vectors....I applied this on raw features column to convert them to VectorUDT type.....
parse_ = udf(lambda l: Vectors.dense(l), VectorUDT())
df2 = df1.withColumn("featuresNew", parse_(df1["features"])).drop('features')
New Dataframe Schema(df2) is as follows:
root
|-- journalid: string (nullable = true)
|-- indexedLabel: double (nullable = false)
|-- featuresNew: vector (nullable = true)
Then I split df2 into training and test set and create object of random forest classifier as below:
(trainingData, testData) = df2.randomSplit([0.8, 0.2])
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="featuresNew", numTrees=2 )
Finally apply fit() method over trainingData obtained above.
rfModel=rf.fit(trainingData)
With this I am able to train model on 100 instances of input dataframe. However,over whole training data, this line gives following error.
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 53652)
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/accumulators.py", line 235, in handle
num_updates = read_int(self.rfile)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/serializers.py", line 685, in read_int
raise EOFError
EOFError
----------------------------------------
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:41060)
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-46d7488961c7>", line 1, in <module>
rfModel=rf.fit(trainingData)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 288, in _fit
java_model = self._fit_java(dataset)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 285, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o90.fit
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1828, in showtraceback
stb = value._render_traceback_()
AttributeError: 'Py4JError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
.(traceback...not writing due to space issue)
.
.
Py4JError: An error occurred while calling o90.fit
This error is not very descriptive and hence it has become difficult for me to identify the where I am going wrong. Any help would help a lot.
Input Description:
Input Dataframe Contains 2696512 rows and each row's feature vector is of 262144 length.

After going through lot of related questions on stackoverflow , I thought this may be happening because of running this in jupyter-notebook. So later I ran it on commandline using spark-submit script and I am not getting this error anymore. I don't know though why this error is popping up if I run this in jupyter-notebook.

Related

I am getting TypeError: '(slice(None, None, None), None)' is an invalid key when I use pandas in my new computer

I am running a code in a new computer (the code works perfectly in my old computer) and I get the following errors:
Traceback (most recent call last):
File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:3621 in get_loc
return self._engine.get_loc(casted_key)
File pandas\_libs\index.pyx:136 in pandas._libs.index.IndexEngine.get_loc
File pandas\_libs\index.pyx:142 in pandas._libs.index.IndexEngine.get_loc
TypeError: '(slice(None, None, None), None)' is an invalid key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ~...\temp.py:364 in <module>
ax.plot(pd.to_datetime(hodls.index, unit="s"), hodls)
File ~\anaconda3\lib\site-packages\matplotlib\axes\_axes.py:1632 in plot
lines = [*self._get_lines(*args, data=data, **kwargs)]
File ~\anaconda3\lib\site-packages\matplotlib\axes\_base.py:312 in __call__
yield from self._plot_args(this, kwargs)
File ~\anaconda3\lib\site-packages\matplotlib\axes\_base.py:488 in _plot_args
y = _check_1d(xy[1])
File ~\anaconda3\lib\site-packages\matplotlib\cbook\__init__.py:1327 in _check_1d
ndim = x[:, None].ndim
File ~\anaconda3\lib\site-packages\pandas\core\frame.py:3505 in __getitem__
indexer = self.columns.get_loc(key)
File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:3628 in get_loc
self._check_indexing_error(key)
File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5637 in _check_indexing_error
raise InvalidIndexError(key)
InvalidIndexError: (slice(None, None, None), None)
Can anybody help me?

Value error while using roboflow object detection Yolov4 pytorch model on custom dataset

We are using Roboflow for object detection using Yolov4 Pytorch model for our custom data set. During the training process, we are getting the following error.
Traceback (most recent call last):
File "./pytorch-YOLOv4/train.py", line 447, in <module>
device=device, )
File "./pytorch-YOLOv4/train.py", line 310, in train
for i, batch in enumerate(train_loader):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 7.
Original Traceback (most recent call last):
File "/content/pytorch-YOLOv4/dataset.py", line 382, in __getitem__
out_bboxes1[:min(out_bboxes.shape[0], self.cfg.boxes)] = out_bboxes[:min(out_bboxes.shape[0], self.cfg.boxes)]
AttributeError: 'list' object has no attribute 'shape'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/pytorch-YOLOv4/dataset.py", line 385, in __getitem__
out_bboxes1[:min(out_bboxes.shape[0], self.cfg.boxes)] = out_bboxes[:min(out_bboxes.shape[0], self.cfg.boxes)]
ValueError: could not broadcast input array from shape (0) into shape (0,5)
I don't know the info of your params, but the log said something wrong with your code:
out_bboxes1[:min(out_bboxes.shape[0], self.cfg.boxes)] \
= out_bboxes[:min(out_bboxes.shape[0], self.cfg.boxes)]
The first error said your param out_bboxes has no attribute 'shape', because it's a list object. So you can consider change it's datatype as you need.

Python3 Pandas.DataFrame.info() Error Key: 30

So I was digging around some datasets, and trying to use pandas to analyze then and i stumbled across the following error.. and my brain froze :(
here is the snippet where the exception is being raised
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = pd.DataFrame(X)
data['class'] = y
data.head()
data.tail()
data.columns
print('length of data is', len(data))
data.shape
data.info()
here's the error trackback
C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\Scripts\python.exe C:/Users/97150/PycharmProjects/EmbeddedLinux/AI/project.py
length of data is 569
Traceback (most recent call last):
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 30
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/97150/PycharmProjects/EmbeddedLinux/AI/project.py", line 42, in <module>
data.info()
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\frame.py", line 2587, in info
self, verbose, buf, max_cols, memory_usage, null_counts
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\io\formats\info.py", line 250, in info
self._verbose_repr(lines, ids, dtypes, show_counts)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\io\formats\info.py", line 335, in _verbose_repr
dtype = dtypes[i]
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\series.py", line 882, in __getitem__
return self._get_value(key)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 30
Process finished with exit code 1
note: I'm using PyCharm community 2020.2, and checked for updates and such, and nothing changed
So, turned out, pandas is straight up acting weird.
removing the () from the data.info() fixed the issue :)
You can alternatively try passing verbose=True and null_counts=True arguments to the .info() method to display the result(you can just use verbose argument if you don't want to consider null values).
data.info(verbose=True, null_counts=True)
Let me know if things work out for you.

Yahoo_Finance Package for Python - Share() does not work anymore

Since today I am experiencing some errors caused by the yahoo_finance package version 1.4.
Here is a code example that is causing the error:
from yahoo_finance import Share
Apple = Share("AAPL")
Results in the following Error:
Traceback (most recent call last):
File "C:\Users\Julian\Anaconda3\lib\site-packages\yahoo_finance\__init__.py", line 120, in _request
_, results = response['query']['results'].popitem()
AttributeError: 'NoneType' object has no attribute 'popitem'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Julian\Anaconda3\lib\site-packages\yahoo_finance\__init__.py", line 123, in _request
raise YQLQueryError(response['error']['description'])
KeyError: 'error'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Julian\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-44fc1f59aa21>", line 1, in <module>
Apple = Share("AAPL")
File "C:\Users\Julian\Anaconda3\lib\site-packages\yahoo_finance\__init__.py", line 178, in __init__
self.refresh()
File "C:\Users\Julian\Anaconda3\lib\site-packages\yahoo_finance\__init__.py", line 142, in refresh
self.data_set = self._fetch()
File "C:\Users\Julian\Anaconda3\lib\site-packages\yahoo_finance\__init__.py", line 181, in _fetch
data = super(Share, self)._fetch()
File "C:\Users\Julian\Anaconda3\lib\site-packages\yahoo_finance\__init__.py", line 134, in _fetch
data = self._request(query)
File "C:\Users\Julian\Anaconda3\lib\site-packages\yahoo_finance\__init__.py", line 125, in _request
raise YQLResponseMalformedError()
yahoo_finance.YQLResponseMalformedError: Response malformed.
Do you experience similar issues or is this only an issue for me personally?
Thank you for your replies.
If yes - what are potential fixes for this issue?

Pyspark error while querying cassandra to convert into dataframes

I am getting the following error while executing the command:
user = sc.cassandraTable("DB NAME", "TABLE NAME").toDF()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 60, in toDF
return sqlContext.createDataFrame(self, schema, sampleRatio)
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 333, in createDataFrame
schema = self._inferSchema(rdd, samplingRatio)
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 220, in _inferSchema
raise ValueError("Some of types cannot be determined by the "
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
Load into a Dataframe directly this will also avoid any python level code for interpreting types.
sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="tb").load()

Resources