PySpark groupBy count fails with show method - apache-spark

I have a problem with my df, running Spark 2.1.0, that has several string columns created as an SQL query from a Hive DB that gives this .summary():
DataFrame[summary: string, visitorid: string, eventtype: string, ..., target: string].
If I only run df.groupBy("eventtype").count(), it works and I get DataFrame[eventtype: string, count: bigint]
When running with show df.groupBy('eventtype').count().show(), I keep getting :
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-9040214714346906648.py", line 267, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-9040214714346906648.py", line 265, in <module>
exec(code)
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 318, in show
print(self._jdf.showString(n, 20))
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o4636.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 633.0 failed 4 times, most recent failure: Lost task 0.3 in stage 633.0 (TID 19944, ip-172-31-28-173.eu-west-1.compute.internal, executor 440): java.lang.NullPointerException
I have no clue what is wrong with the show method (neither of the other columns works either, not event column target which I created). The admin of the cluster could not help me either.
Many thanks for any pointers

There is some problem, currently we know the issue if your DataFrame contain some limit. If yes, you probably went into https://issues.apache.org/jira/browse/SPARK-18528
That means, you must upgrade Spark version to 2.1.1 or you can use repartition as a workaround to avoid this problem
As #AssafMendelson said, the count() only creates new DataFrame, but it doesn't start calculation. Performing show or i.e. head will start the calculation.
If the Jira ticket and upgrade don't help you, please post logs of workers

When you run
df.groupBy("eventtype").count()
You are actually defining a lazy transformation on HOW to calculate the result. This would return a new dataframe almost immediately regardless of the data size. When you call show you are performing an action, this is when the actual calculation begins.
If you look at the bottom of your error log:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 633.0 failed 4 times, most recent failure: Lost task 0.3 in stage 633.0 (TID 19944, ip-172-31-28-173.eu-west-1.compute.internal, executor 440): java.lang.NullPointerException
You can see that one of the task failed due to a null pointer exception. I would go and check the definition of df to see what happened before (maybe even see if simply doing df.count() causes the exception).

Related

Failed to get followers count of 'b'metapodcode'' ~empty list

With the June 2022 update, there have been some changes in Instagram's APIs. There was a discussion here about changing or updating this code. You can find the discussion here then i did some research on this topic and i found the fix here but this code is written in another javascript language, if the code here is integrated into instapy, it seems that all the problem will be solved. Instapy is an application that I love, but I've been looking for a solution to this problem for days and I'm not good at programming languages. I'm trying to get help here as a last resort. I'm waiting for your help
INFO [2022-07-22 03:20:06] [metapodcod] Failed to get following count of 'b'metapodcod'' ~empty list
WARNING [2022-07-22 03:20:06] [metapodcod] Unable to save account progress, skipping data update
b"'NoneType' object has no attribute 'get'"
INFO [2022-07-22 03:20:07] [metapodcod] Sessional Live Report:
|> No any statistics to show
[Session lasted 2.5 minutes]
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
INFO [2022-07-22 03:20:07] [metapodcod] Session ended!
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
Traceback (most recent call last):
File "C:\Users\metapodcod\InstaPy\yeni.py", line 41, in <module>
with smart_run(session):
File "C:\Users\metapodcod\AppData\Local\Programs\Python\Python310\lib\contextlib.py", line 135, in __enter__
return next(self.gen)
File "C:\Users\metapodcod\InstaPy\instapy\util.py", line 1983, in smart_run
session.login()
File "C:\Users\metapodcod\InstaPy\instapy\instapy.py", line 475, in login
self.followed_by = log_follower_num(self.browser, self.username, self.logfolder)
File "C:\Users\metapodcod\InstaPy\instapy\print_log_writer.py", line 21, in log_follower_num
followed_by = getUserData("graphql.user.edge_followed_by.count", browser)
File "C:\Users\metapodcod\InstaPy\instapy\util.py", line 501, in getUserData
get_key = shared_data.get("entry_data").get("ProfilePage")
AttributeError: 'NoneType' object has no attribute 'get'

ete3 error : could not be translated into taxids! - Bioinformatics

I am using ete3(http://etetoolkit.org/) package in Python within a bioinformatics pipeline I wrote myself.
While running this script, I get the following error. I have used this script a lot for other datasets which don't have any issues and have not given any errors. I am using Python3.5 and miniconda. Any fixes/insights to resolve this error will be appreciated.
[Error]
Traceback (most recent call last):
File "/Users/d/miniconda2/envs/py35/bin/ete3", line 11, in <module>
load_entry_point('ete3==3.1.1', 'console_scripts', 'ete3')()
File "/Users/d/miniconda2/envs/py35/lib/python3.5/site-packages/ete3/tools/ete.py", line 95, in main
_main(sys.argv)
File "/Users/d/miniconda2/envs/py35/lib/python3.5/site-packages/ete3/tools/ete.py", line 268, in _main
args.func(args)
File "/Users/d/miniconda2/envs/py35/lib/python3.5/site-packages/ete3/tools/ete_ncbiquery.py", line 168, in run
collapse_subspecies=args.collapse_subspecies)
File "/Users/d/miniconda2/envs/py35/lib/python3.5/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 434, in get_topology
lineage = id2lineage[sp]
KeyError: 3
Continuing from the comment section for better formatting.
Assuming that the sp contains 3 as suggested by the error message (do check this yourself). You can inspect the ete3 code (current version) following its definition, you can trace it to line:
def get_lineage_translator(self, taxids):
"""Given a valid taxid number, return its corresponding lineage track as a
hierarchically sorted list of parent taxids.
So I went to https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi
and checked if 3 is valid taxid and it appears that it is not.
# relevant section from ncbi taxonomy browser
No result found in the Taxonomy database for taxonomy id
3
It appears to me that your only option is to trace how the 3 gets computed. Because the root cause is simply that taxid 3 is not valid taxid number as required by the function.

Pyspark mllib + count or collect method throws ArrayIndexOutOfBounds exception

I'm learning pyspark and mllib.
After predicting the test data using A RF model, I'm assigning the result in a variable called 'predictions' which is a RDD.
If I call predictions.count() or prediction.collect(), then it is failing with the following exception.
Can you please share your thoughts? Already spent quite some time, but didn't find what is missing.
predictions = predict(training_data, test_data)
File "/mp5/part_d_poc.py", line 36, in predict
print(predictions.count())
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 28, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 7
I constructed the training data in the following way.
raw_training_data.map(lambda row: LabeledPoint(row.split(',')[-1], Vectors.dense(row.split(',')[0:-1])))
It seems like this error is caused when there's a mismatch between the schema and data. Please refer to these -
ArrayIndexOutOfBoundsException with Spark, Spark-Avro and Google Analytics Data
https://github.com/Azure/spark-cdm-connector/issues/46#issuecomment-717543025
https://forums.couchbase.com/t/arrayindexoutofboundsexception/10311/3

Can't write to local Hive using JDBC

I am running a small Amazon EMR cluster and wish to write to its Hive database from a remote connection via JDBC. I am running into an error that also appears if I execute everything locally on that EMR cluster, which is why I think the fault is not the remote connection but something directly on EMR.
The error appears when running this minimal example:
connectionProperties = {
"user" : "aengelhardt",
"password" : "doot",
"driver" : "org.apache.hive.jdbc.HiveDriver"
}
from pyspark.sql import DataFrame, Row
test_df = sqlContext.createDataFrame([
Row(name=1)
])
test_df.write.jdbc(url= "jdbc:hive2://127.0.0.1:10000", table = "test_df", properties=connectionProperties, mode="overwrite")
I then get a lot of Java error messages, but I think the important lines are these:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 940, in jdbc
self.mode(mode)._jwrite.jdbc(url, table, jprop)
File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o351.jdbc.
: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:23 cannot recognize input near '"name"' 'BIGINT' ')' in column name or primary key or foreign key
The last line hints that something came up while creating the table, since he tries to specifiy the 'name' column as a 'BIGINT' there.
I found this question which has a similar problem, and the issue was that the SQL query was wrongly specified. But here, I don't specify a query, so I don't know where that happened or how to fix it.
As of now, I have no idea how to dive in deeper to find the cause of this. Does anyone have a solution or an idea of how to search further for the cause?

why cassandra throws exception in select query?

I am using cassandra db ,while i use select at some times i get this exception?
Traceback (most recent call last):
File "bin/cqlsh", line 1001, in perform_statement_untraced
self.cursor.execute(statement, decoder=decoder)
File "bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/cursor.py", line 81, in execute
return self.process_execution_results(response, decoder=decoder)
File "bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/thrifteries.py", line 131, in process_execution_results
raise Exception('unknown result type %s' % response.type)
Exception: unknown result type None
can any one explain why this exceptions occur and also i get Internal application error.
what this error message actually means?
EDIT: I get this error for the first time, next time onwards its running correctly.I dont get why it is so?
//cql query via cqlsh
select * from event_logging limit 5;

Resources