Weird error when selecting more than 100 spark udf columns - apache-spark

Starting with a simple spark dataframe with only one value, I create N simple udf columns.
N = 100
df = sqlContext.createDataFrame([{'value': 0}])
udf_columns = [pyspark.sql.functions.udf(lambda x: 0)('value') for _ in range(N)]
df.select(udf_columns).take(1)
For N <= 100 this code works perfectly.
But as soon as N >= 101, I found the following error
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 50, localhost): java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#<lambda>(input[0, LongType])
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.genCode(Expression.scala:239)
at org.apache.spark.sql.execution.PythonUDF.genCode(python.scala:44)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:104)

Related

NetworkX find_cliques error using PySpark

I'm trying to calculate find_cliques functionality to locate the maximal cliques for each subgroup.
I'm using this implementation using pandas_udf grouped by each connected component.
def pd_create_subgroups(pdf):
index = pdf.component.unique()[0]
try:
# building the graph
gnx = nx.from_pandas_edgelist(pdf, "src", "dst")
bic = list(find_cliques(gnx))
if len(bic) <= 1:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
bic_sorted = sorted(map(sorted, bic))
bic_sorted = [b for b in bic_sorted if len(b) >= 3]
if len(bic_sorted) == 0:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
return pd.DataFrame([bic_sorted]).transpose().rename(columns={0: "cliques"})
except:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
pdf is a pandas dataframe containing the fields src, dst, component
it has around 200M-300M undirected edges
and returns the following error -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 331) (executor 9): java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: range(0, 2147483648))
When running on smaller graphs it works properly.

Python Pandas indexing provides KeyError: (slice(None, None, None), )

I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.

About "spark.sql.objectHashAggregate.sortBased.fallbackThreshold"

During running one of Spark Stage,
I face the unexpected error log on executors, and stage is not working anymore.
Could anyone tell me what happened if below message is occurred?
and is there any limitation to set this setting value? (spark.sql.objectHashAggregate.sortBased.fallbackThreshold)
# encountered message on executor
21/04/29 07:25:58 INFO ObjectAggregationIterator: Aggregation hash map size 128 reaches threshold capacity (128 entries), spilling and falling back to sort based aggregation. You may change the threshold by adjust option spark.sql.objectHashAggregate.sortBased.fallbackThreshold
21/04/29 07:26:51 INFO PythonUDFRunner: Times: total = 87019, boot = -361, init = 380, finish = 87000
21/04/29 07:26:51 INFO MemoryStore: Block rdd_36_1765 stored as values in memory (estimated size 19.6 MB, free 5.2 GB)
21/04/29 07:26:53 INFO PythonRunner: Times: total = 2154, boot = 6, init = 1, finish = 2147
21/04/29 07:26:53 INFO Executor: Finished task 1765.0 in stage 6.0 (TID 11172). 5310 bytes result sent to driver
21/04/29 07:27:33 INFO PythonUDFRunner: Times: total = 93086, boot = -461, init = 480, finish = 93067
21/04/29 07:27:33 INFO MemoryStore: Block rdd_36_1792 stored as values in memory (estimated size 19.7 MB, free 5.2 GB)
21/04/29 07:27:35 INFO PythonRunner: Times: total = 1999, boot = -40047, init = 40051, finish = 1995
21/04/29 07:27:35 INFO Executor: Finished task 1792.0 in stage 6.0 (TID 11199). 5267 bytes result sent to driver
21/04/29 07:27:35 INFO PythonUDFRunner: Times: total = 97305, boot = -313, init = 350, finish = 97268
21/04/29 07:27:35 INFO MemoryStore: Block rdd_36_1789 stored as values in memory (estimated size 19.7 MB, free 5.3 GB)
21/04/29 07:27:37 INFO PythonRunner: Times: total = 1928, boot = -2217, init = 2220, finish = 1925
21/04/29 07:27:37 INFO Executor: Finished task 1789.0 in stage 6.0 (TID 11196). 5310 bytes result sent to driver
# about Spark Stage I did
#
# given dataframe is
# (uid is given by monotonically_increasing_id)
# |-------|-------|
# | uids | score |
# |-------|-------|
# |[1,2,3]| 50 |
# |[1,2] | 70 |
# |[1] | 90 |
#
# expected result
# |-------|-------|
# | uid | score |
# |-------|-------|
# | 1 | 90 |
# | 2 | 70 |
# | 3 | 50 |
rdd = df.select(F.explode('uids').alias('uid'), 'score') \
.rdd.map(lambda x: (x['uid'], x)) \
.reduceByKey(func=max, numPartitions=1800) \
.cache()

calibration_and_holdout_data: AttributeError: 'int' object has no attribute 'n'

I'm trying to run a BG/NBD model using the lifetimes libary.
All my analysis are based on the following example, yet with my own data:
https://towardsdatascience.com/whats-a-customer-worth-8daf183f8a4f
Somehow I receive the following error and after reading 50+ stackoverflow articles without finding any answer, I'd like to ask my own question:
What am I doing wrong? :(
Thanks in Advance! :)
I tried to change the type of all columns that are part of my dataframe, without any changes.
df2 = df
df2.head()
person_id effective_date accounting_sales_total
0 219333 2018-08-04 1049.89
1 333219 2018-12-21 4738.97
2 344405 2018-07-16 253.99
3 455599 2017-07-14 2199.96
4 766665 2017-08-15 1245.00
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
calibration_period_end='2017-12-31',
observation_period_end='2018-12-31')
print(summary_cal_holdout.head())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-85-cdcb400098dc> in <module>()
7 summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
8 calibration_period_end='2017-12-31',
----> 9 observation_period_end='2018-12-31')
10
11 print(summary_cal_holdout.head())
/usr/local/envs/py3env/lib/python3.5/site-packages/lifetimes/utils.py in calibration_and_holdout_data(transactions, customer_id_col, datetime_col, calibration_period_end, observation_period_end, freq, datetime_format, monetary_value_col)
122 combined_data.fillna(0, inplace=True)
123
--> 124 delta_time = (to_period(observation_period_end) - to_period(calibration_period_end)).n
125 combined_data["duration_holdout"] = delta_time
126
AttributeError: 'int' object has no attribute 'n'
This actually runs fine as it is :)
data = {'person_id':[219333, 333219, 344405, 455599, 766665],
'effective_date':['2018-08-04', '2018-12-21', '2018-07-16', '2017-07-14', '2017-08-15'],
'accounting_sales_total':[1049.89, 4738.97, 253.99, 2199.96, 1245.00]}
df2 = pd.DataFrame(data)
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
calibration_period_end='2017-12-31',
observation_period_end='2018-12-31')
print(summary_cal_holdout.head())
Returns:
frequency_cal recency_cal T_cal frequency_holdout \
person_id
455599 0.0 0.0 170.0 0.0
766665 0.0 0.0 138.0 0.0
duration_holdout
person_id
455599 365
766665 365
Which means your issue is probably with package versioning, try:
pip install lifetimes --upgrade

PySpark join two RDD results in an empty RDD

I'm a Spark newbie trying to edit and apply this movie recommendation tutorial(https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html) on my dataset .But it keeps throwing This error :
ValueError: Can not reduce() empty RDD
This is the function that computes the Root Mean Squared Error of the model :
def computeRmse(model, data, n):
"""
Compute RMSE (Root Mean Squared Error).
"""
predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
print predictions.count()
print predictions.first()
print "predictions above"
print data.count()
print data.first()
print "validation data above"
predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
#LINE56
.join(data.map(lambda line: line.split(‘,’) ).map(lambda x: ((x[0], x[1]), x[2]))) \
.values()
print predictionsAndRatings.count()
print "predictions And Ratings above"
#LINE63
return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
model = ALS.train(training, rank, numIter, lambda). data is the validation data set.
training and validation set originally from a ratings.txt file in the format of : userID,productID,rating,ratingopID
These are parts of the output :
879
...
Rating(user=0, product=656, rating=4.122132631144641)
predictions above
...
1164
...
(u'640085', u'1590', u'5')
validation data above
...
16/08/26 12:47:18 INFO DAGScheduler: Registering RDD 259 (join at /path/myapp/MyappALS.py:56)
16/08/26 12:47:18 INFO DAGScheduler: Got job 20 (count at /path/myapp/MyappALS.py:59) with 12 output partitions
16/08/26 12:47:18 INFO DAGScheduler: Final stage: ResultStage 238 (count at /path/myapp/MyappALS.py:59)
16/08/26 12:47:18 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 237)
16/08/26 12:47:18 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 237)
16/08/26 12:47:18 INFO DAGScheduler: Submitting ShuffleMapStage 237 (PairwiseRDD[259] at join at /path/myapp/MyappALS.py:56), which has no missing parents
....
0
predictions And Ratings above
...
Traceback (most recent call last):
File "/path/myapp/MyappALS.py", line 130, in <module>
validationRmse = computeRmse(model, validation, numValidation)
File "/path/myapp/MyappALS.py", line 63, in computeRmse
return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 805, in reduce
ValueError: Can not reduce() empty RDD
So from the count() i'm sure the initial RDD are not empty .
Than the INFO log Registering RDD 259 (join at /path/myapp/MyappALS.py:56) does it mean that the join job is launched ?
Is there something wrong i'm missing ?
Thank you .
That error disappeared when i added int() to :
predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
.join(data.map(lambda x: ((int(x[0]), int(x[1])), int(x[2])))) \
.values()
we think its because pediction is outputed from the method predictAll which gives tupple ,but the other data that was parsed manually by the algorithm

Resources