Pyspark error while querying cassandra to convert into dataframes - cassandra

I am getting the following error while executing the command:
user = sc.cassandraTable("DB NAME", "TABLE NAME").toDF()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 60, in toDF
return sqlContext.createDataFrame(self, schema, sampleRatio)
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 333, in createDataFrame
schema = self._inferSchema(rdd, samplingRatio)
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 220, in _inferSchema
raise ValueError("Some of types cannot be determined by the "
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

Load into a Dataframe directly this will also avoid any python level code for interpreting types.
sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="tb").load()

Related

Value error while using roboflow object detection Yolov4 pytorch model on custom dataset

We are using Roboflow for object detection using Yolov4 Pytorch model for our custom data set. During the training process, we are getting the following error.
Traceback (most recent call last):
File "./pytorch-YOLOv4/train.py", line 447, in <module>
device=device, )
File "./pytorch-YOLOv4/train.py", line 310, in train
for i, batch in enumerate(train_loader):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 7.
Original Traceback (most recent call last):
File "/content/pytorch-YOLOv4/dataset.py", line 382, in __getitem__
out_bboxes1[:min(out_bboxes.shape[0], self.cfg.boxes)] = out_bboxes[:min(out_bboxes.shape[0], self.cfg.boxes)]
AttributeError: 'list' object has no attribute 'shape'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/pytorch-YOLOv4/dataset.py", line 385, in __getitem__
out_bboxes1[:min(out_bboxes.shape[0], self.cfg.boxes)] = out_bboxes[:min(out_bboxes.shape[0], self.cfg.boxes)]
ValueError: could not broadcast input array from shape (0) into shape (0,5)
I don't know the info of your params, but the log said something wrong with your code:
out_bboxes1[:min(out_bboxes.shape[0], self.cfg.boxes)] \
= out_bboxes[:min(out_bboxes.shape[0], self.cfg.boxes)]
The first error said your param out_bboxes has no attribute 'shape', because it's a list object. So you can consider change it's datatype as you need.

Brightway2: SyntaxError: at expr='ecoinvent 3.7 cut-off' when bw.write_database()

I'm using an LCI dataset from an Excel file.
I used it several times to conduct LCA with Brightway2.
I created a new product in that same Excle file and the first steps of importation went right, I mean these ones:
imp = bw.ExcelImporter(os.path.join(ROOT_DIR, "LCI_CW.xlsx"))
imp.apply_strategies()
imp.match_database(fields=('name', 'unit', 'location'))
imp.match_database("ecoinvent 3.7 cut-off",
fields=('name', 'unit', 'location'))
imp.statistics()
When checking with imp.write_excel() the activities match,etc.
BUT
When using imp.write_database()
I come with this error:
SyntaxError: at expr='ecoinvent 3.7 cut-off'
Any idea where this mistake could be hidden? I checked my use of expr='ecoinvent 3.7 cut-off', etc.
More details below:
Traceback (most recent call last):
File "C:\...\asteval\asteval.py", line 254, in parse
out = ast.parse(text)
File "C:\...\lib\ast.py", line 50, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line 1
ecoinvent 3.7 cut-off
^
SyntaxError: invalid syntax
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\...\IPython\core\interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "C:\...\Local\Temp/ipykernel_14440/3637353444.py", line 1, in <module>
imp.write_database()
File "C:\...\bw2io\importers\excel.py", line 277, in write_database
super(ExcelImporter, self).write_database(**kwargs)
File "C:\...\bw2io\importers\base_lci.py", line 266, in write_database
self.write_database_parameters(activate_parameters, delete_existing)
File "C:\...\bw2io\importers\excel.py", line 270, in write_database_parameters
super(ExcelImporter, self).write_database_parameters(
File "C:\...\bw2io\importers\base_lci.py", line 118, in write_database_parameters
parameters.new_database_parameters(
File "C:\...\bw2data\parameters.py", line 1319, in new_database_parameters
DatabaseParameter.recalculate(database)
File "C:\...\bw2data\parameters.py", line 348, in recalculate
new_symbols = get_new_symbols(data.values(), set(data))
File "C:\...\bw2data\parameters.py", line 1526, in get_new_symbols
nf.generic_visit(interpreter.parse(formula))
File "C:\...\asteval\asteval.py", line 256, in parse
self.raise_exception(None, msg='Syntax Error', expr=text)
File "C:\...\asteval\asteval.py", line 244, in raise_exception
raise exc(self.error_msg)
File "<string>", line unknown
SyntaxError: at expr='ecoinvent 3.7 cut-off'
The phrase 'ecoinvent 3.7 cut-off' is showing up in a definition of a database parameter, which isn't a valid Python expression, i.e. it is the same as typing in a Python prompt:
In [1]: ecoinvent 3.7 cut-off
File "<ipython-input-1-db6705818daf>", line 1
ecoinvent 3.7 cut-off
^
SyntaxError: invalid syntax
We could help debug the Excel file if you are comfortable sharing it. Otherwise I don't see a way to help understand the specific formatting error.

Python3 Pandas.DataFrame.info() Error Key: 30

So I was digging around some datasets, and trying to use pandas to analyze then and i stumbled across the following error.. and my brain froze :(
here is the snippet where the exception is being raised
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = pd.DataFrame(X)
data['class'] = y
data.head()
data.tail()
data.columns
print('length of data is', len(data))
data.shape
data.info()
here's the error trackback
C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\Scripts\python.exe C:/Users/97150/PycharmProjects/EmbeddedLinux/AI/project.py
length of data is 569
Traceback (most recent call last):
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 30
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/97150/PycharmProjects/EmbeddedLinux/AI/project.py", line 42, in <module>
data.info()
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\frame.py", line 2587, in info
self, verbose, buf, max_cols, memory_usage, null_counts
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\io\formats\info.py", line 250, in info
self._verbose_repr(lines, ids, dtypes, show_counts)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\io\formats\info.py", line 335, in _verbose_repr
dtype = dtypes[i]
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\series.py", line 882, in __getitem__
return self._get_value(key)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 30
Process finished with exit code 1
note: I'm using PyCharm community 2020.2, and checked for updates and such, and nothing changed
So, turned out, pandas is straight up acting weird.
removing the () from the data.info() fixed the issue :)
You can alternatively try passing verbose=True and null_counts=True arguments to the .info() method to display the result(you can just use verbose argument if you don't want to consider null values).
data.info(verbose=True, null_counts=True)
Let me know if things work out for you.

How to read .hql file (to run hive query) in pyspark

I have .hql file with huge amount of queries. It is running slow in hive. I want to read and run .hql file using pyspark/sparksql.
I tried count = sqlContext.sql(open("file.hql").read()).count() but gives the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"missing EOF at ';' near 'db'; line 1 pos 36"

Py4JError: An error occurred while calling o90.fit

I want to apply random forest algorithm over a dataframe consisting of three columns namely JournalID, IndexedJournalID(Obtained using StringIndexer of Spark) and feature vector. I used below code to read the dataframe from parquet file and apply String Indexer over JournalID column to convert it to categorical type.
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors
from pyspark.ml.linalg import VectorUDT
df=spark.read.parquet('JouID-UBTFIDFVectors-server22.parquet')
labelIndexer = StringIndexer(inputCol="journalid", outputCol="IndexedJournalID")
labelsDF=labelIndexer.fit(df)
df1=labelsDF.transform(df)
# This function converts sparse vectors to dense vectors....I applied this on raw features column to convert them to VectorUDT type.....
parse_ = udf(lambda l: Vectors.dense(l), VectorUDT())
df2 = df1.withColumn("featuresNew", parse_(df1["features"])).drop('features')
New Dataframe Schema(df2) is as follows:
root
|-- journalid: string (nullable = true)
|-- indexedLabel: double (nullable = false)
|-- featuresNew: vector (nullable = true)
Then I split df2 into training and test set and create object of random forest classifier as below:
(trainingData, testData) = df2.randomSplit([0.8, 0.2])
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="featuresNew", numTrees=2 )
Finally apply fit() method over trainingData obtained above.
rfModel=rf.fit(trainingData)
With this I am able to train model on 100 instances of input dataframe. However,over whole training data, this line gives following error.
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 53652)
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/accumulators.py", line 235, in handle
num_updates = read_int(self.rfile)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/serializers.py", line 685, in read_int
raise EOFError
EOFError
----------------------------------------
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:41060)
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-46d7488961c7>", line 1, in <module>
rfModel=rf.fit(trainingData)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 288, in _fit
java_model = self._fit_java(dataset)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 285, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o90.fit
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1828, in showtraceback
stb = value._render_traceback_()
AttributeError: 'Py4JError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
.(traceback...not writing due to space issue)
.
.
Py4JError: An error occurred while calling o90.fit
This error is not very descriptive and hence it has become difficult for me to identify the where I am going wrong. Any help would help a lot.
Input Description:
Input Dataframe Contains 2696512 rows and each row's feature vector is of 262144 length.
After going through lot of related questions on stackoverflow , I thought this may be happening because of running this in jupyter-notebook. So later I ran it on commandline using spark-submit script and I am not getting this error anymore. I don't know though why this error is popping up if I run this in jupyter-notebook.

Resources