Pyspark mllib + count or collect method throws ArrayIndexOutOfBounds exception - python-3.x

I'm learning pyspark and mllib.
After predicting the test data using A RF model, I'm assigning the result in a variable called 'predictions' which is a RDD.
If I call predictions.count() or prediction.collect(), then it is failing with the following exception.
Can you please share your thoughts? Already spent quite some time, but didn't find what is missing.
predictions = predict(training_data, test_data)
File "/mp5/part_d_poc.py", line 36, in predict
print(predictions.count())
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 28, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 7
I constructed the training data in the following way.
raw_training_data.map(lambda row: LabeledPoint(row.split(',')[-1], Vectors.dense(row.split(',')[0:-1])))

It seems like this error is caused when there's a mismatch between the schema and data. Please refer to these -
ArrayIndexOutOfBoundsException with Spark, Spark-Avro and Google Analytics Data
https://github.com/Azure/spark-cdm-connector/issues/46#issuecomment-717543025
https://forums.couchbase.com/t/arrayindexoutofboundsexception/10311/3

Related

Error using MultiWorkerMirroredStrategy to train object detection research model ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8

I'm trying to train research model ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8 using the MultiWorkerMirroredStrategy (by setting --num_workers=2 in the invocation of model_main_tf2.py). I'm trying to train across two workers (0 and 1), each with a single GPU. However, when I attempt this I get the following error, always on worker 1:
Traceback (most recent call last):
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 553, in __next__
return self.get_next()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 610, in get_next
return self._get_next_no_partial_batch_handling(name)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 642, in _get_next_no_partial_batch_handling
replicas.extend(self._iterators[i].get_next_as_list(new_name))
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 1594, in get_next_as_list
return self._format_data_list_with_options(self._iterator.get_next())
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\multi_device_iterator_ops.py", line 580, in get_next
result.append(self._device_iterators[i].get_next())
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 889, in get_next
return self._next_internal()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 819, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2922, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\framework\ops.py", line 7186, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 114, in <module>
tf.compat.v1.app.run()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\platform\app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 105, in main
model_lib_v2.train_loop(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 605, in train_loop
load_fine_tune_checkpoint(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 401, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 161, in _ensure_model_is_built
features, labels = iter(input_dataset).next()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 549, in next
return self.__next__()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 555, in __next__
raise StopIteration
StopIteration
Worker 0 eventually fails after detecting that worker 1 has gone down.
This error happens regardless of the physical machines on which the two workers run. In other words I see it if I'm running both workers on a single machine (using localhost) OR different machines on the same network.
Based on the trace in the error messages, the error appears to be occurring whenever the training loop attempts to iterate over the training data generated by strategy.experimental_distribute_datasets_from_function. Note that if I change the strategy to MirroredStrategy it runs fine on a single machine (no other changes made). I'm not sure if I'm doing something wrong or if there is a bug in the object detection API.
My setup on both machines is identical (I basically followed the setup instructions on the object detection web-site):
Windows 10
Tensorflow 2.8.0
Cuda Toolkit 11.2
cudnn 8.1
Has anyone ever seen this error before? If so, is there a way around it?
Ok, I think I understand the issue. In the object detection library there is a file called dataset_builder.py that builds the training dataset from the TFRecord stored in the file specified in the pipeline.config file (in the input_path item of the tf_record_input_reader). The function that actually reads the TFRecord file is _read_dataset_internal. This function treats the input_path of the pipeline config as a LIST OF FILES and then applies a sharding function (passed as an argument) to divide the files between the replicas doing the training (one replica per worker). Since my input_path only specified a single TFRecord file it was assigned to the first replica and the other replicas were given empty filenames!! Thus only the first replica actually had an input dataset to work with, hence the crash.
The solution was to split the training data across two files (two TFRecords) and then set the input_path in pipeline.config to be a list of paths rather than a single path. Once I did this it appears as though the model trained successfully (at least it didn't crash).
I'm not sure if this is a bug in the object detection code or not. I assumed that if I only had one training record (visible to both workers) that both workers would use it and just batch the data accordingly. I'm just not sure if the assumption itself is wrong or if the assumption is correct and the code is wrong.
Anyway, I this helps anyone who might be wrestling with the same issue.

Custom delimited text file from dataframe

I'm using spark 1.6 and trying to create a delimited file from the dataframe.
The field delimiter is '|^', so I'm concatenating the columns from the dataframe while selecting from the temp table
Now the below code fails everytime with this error
ERROR scheduler.TaskSetManager: Task 172 in stage 9.0 failed 4 times; aborting job
19/03/01 09:10:15 ERROR datasources.InsertIntoHadoopFsRelation: Aborting job.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 172 in stage 9.0 failed 4 times, most recent failure: Lost task 172.3 in stage 9.0 (TID 1397, tplhc01d104.iuser.iroot.adidom.com, executor 7): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272
The Code I'm using is this..
tempDF.registerTempTable("BNUC_TEMP")
context.sql("select concat('VALID','|^', RECORD_ID,'|^', DATA_COL1,'|^', DATA_COL2,'|^','P','|^', DATA_COL4,'|^', DATA_COL5,'|^', DATA_COL6,'GBP','|^',from_unixtime(unix_timestamp( ACTION_DATE)),'|^',from_unixtime(unix_timestamp( UPDATED_DATE))) from BNUC_TEMP")
.write.mode("overwrite")
.text("/user/USERNAME/landing/staging/BNU/temp/")

Can't write to local Hive using JDBC

I am running a small Amazon EMR cluster and wish to write to its Hive database from a remote connection via JDBC. I am running into an error that also appears if I execute everything locally on that EMR cluster, which is why I think the fault is not the remote connection but something directly on EMR.
The error appears when running this minimal example:
connectionProperties = {
"user" : "aengelhardt",
"password" : "doot",
"driver" : "org.apache.hive.jdbc.HiveDriver"
}
from pyspark.sql import DataFrame, Row
test_df = sqlContext.createDataFrame([
Row(name=1)
])
test_df.write.jdbc(url= "jdbc:hive2://127.0.0.1:10000", table = "test_df", properties=connectionProperties, mode="overwrite")
I then get a lot of Java error messages, but I think the important lines are these:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 940, in jdbc
self.mode(mode)._jwrite.jdbc(url, table, jprop)
File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o351.jdbc.
: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:23 cannot recognize input near '"name"' 'BIGINT' ')' in column name or primary key or foreign key
The last line hints that something came up while creating the table, since he tries to specifiy the 'name' column as a 'BIGINT' there.
I found this question which has a similar problem, and the issue was that the SQL query was wrongly specified. But here, I don't specify a query, so I don't know where that happened or how to fix it.
As of now, I have no idea how to dive in deeper to find the cause of this. Does anyone have a solution or an idea of how to search further for the cause?

PySpark groupBy count fails with show method

I have a problem with my df, running Spark 2.1.0, that has several string columns created as an SQL query from a Hive DB that gives this .summary():
DataFrame[summary: string, visitorid: string, eventtype: string, ..., target: string].
If I only run df.groupBy("eventtype").count(), it works and I get DataFrame[eventtype: string, count: bigint]
When running with show df.groupBy('eventtype').count().show(), I keep getting :
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-9040214714346906648.py", line 267, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-9040214714346906648.py", line 265, in <module>
exec(code)
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 318, in show
print(self._jdf.showString(n, 20))
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o4636.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 633.0 failed 4 times, most recent failure: Lost task 0.3 in stage 633.0 (TID 19944, ip-172-31-28-173.eu-west-1.compute.internal, executor 440): java.lang.NullPointerException
I have no clue what is wrong with the show method (neither of the other columns works either, not event column target which I created). The admin of the cluster could not help me either.
Many thanks for any pointers
There is some problem, currently we know the issue if your DataFrame contain some limit. If yes, you probably went into https://issues.apache.org/jira/browse/SPARK-18528
That means, you must upgrade Spark version to 2.1.1 or you can use repartition as a workaround to avoid this problem
As #AssafMendelson said, the count() only creates new DataFrame, but it doesn't start calculation. Performing show or i.e. head will start the calculation.
If the Jira ticket and upgrade don't help you, please post logs of workers
When you run
df.groupBy("eventtype").count()
You are actually defining a lazy transformation on HOW to calculate the result. This would return a new dataframe almost immediately regardless of the data size. When you call show you are performing an action, this is when the actual calculation begins.
If you look at the bottom of your error log:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 633.0 failed 4 times, most recent failure: Lost task 0.3 in stage 633.0 (TID 19944, ip-172-31-28-173.eu-west-1.compute.internal, executor 440): java.lang.NullPointerException
You can see that one of the task failed due to a null pointer exception. I would go and check the definition of df to see what happened before (maybe even see if simply doing df.count() causes the exception).

why cassandra throws exception in select query?

I am using cassandra db ,while i use select at some times i get this exception?
Traceback (most recent call last):
File "bin/cqlsh", line 1001, in perform_statement_untraced
self.cursor.execute(statement, decoder=decoder)
File "bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/cursor.py", line 81, in execute
return self.process_execution_results(response, decoder=decoder)
File "bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/thrifteries.py", line 131, in process_execution_results
raise Exception('unknown result type %s' % response.type)
Exception: unknown result type None
can any one explain why this exceptions occur and also i get Internal application error.
what this error message actually means?
EDIT: I get this error for the first time, next time onwards its running correctly.I dont get why it is so?
//cql query via cqlsh
select * from event_logging limit 5;

Resources