PySpark join two RDD results in an empty RDD - apache-spark

I'm a Spark newbie trying to edit and apply this movie recommendation tutorial(https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html) on my dataset .But it keeps throwing This error :
ValueError: Can not reduce() empty RDD
This is the function that computes the Root Mean Squared Error of the model :
def computeRmse(model, data, n):
"""
Compute RMSE (Root Mean Squared Error).
"""
predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
print predictions.count()
print predictions.first()
print "predictions above"
print data.count()
print data.first()
print "validation data above"
predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
#LINE56
.join(data.map(lambda line: line.split(‘,’) ).map(lambda x: ((x[0], x[1]), x[2]))) \
.values()
print predictionsAndRatings.count()
print "predictions And Ratings above"
#LINE63
return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
model = ALS.train(training, rank, numIter, lambda). data is the validation data set.
training and validation set originally from a ratings.txt file in the format of : userID,productID,rating,ratingopID
These are parts of the output :
879
...
Rating(user=0, product=656, rating=4.122132631144641)
predictions above
...
1164
...
(u'640085', u'1590', u'5')
validation data above
...
16/08/26 12:47:18 INFO DAGScheduler: Registering RDD 259 (join at /path/myapp/MyappALS.py:56)
16/08/26 12:47:18 INFO DAGScheduler: Got job 20 (count at /path/myapp/MyappALS.py:59) with 12 output partitions
16/08/26 12:47:18 INFO DAGScheduler: Final stage: ResultStage 238 (count at /path/myapp/MyappALS.py:59)
16/08/26 12:47:18 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 237)
16/08/26 12:47:18 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 237)
16/08/26 12:47:18 INFO DAGScheduler: Submitting ShuffleMapStage 237 (PairwiseRDD[259] at join at /path/myapp/MyappALS.py:56), which has no missing parents
....
0
predictions And Ratings above
...
Traceback (most recent call last):
File "/path/myapp/MyappALS.py", line 130, in <module>
validationRmse = computeRmse(model, validation, numValidation)
File "/path/myapp/MyappALS.py", line 63, in computeRmse
return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 805, in reduce
ValueError: Can not reduce() empty RDD
So from the count() i'm sure the initial RDD are not empty .
Than the INFO log Registering RDD 259 (join at /path/myapp/MyappALS.py:56) does it mean that the join job is launched ?
Is there something wrong i'm missing ?
Thank you .

That error disappeared when i added int() to :
predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
.join(data.map(lambda x: ((int(x[0]), int(x[1])), int(x[2])))) \
.values()
we think its because pediction is outputed from the method predictAll which gives tupple ,but the other data that was parsed manually by the algorithm

Related

NetworkX find_cliques error using PySpark

I'm trying to calculate find_cliques functionality to locate the maximal cliques for each subgroup.
I'm using this implementation using pandas_udf grouped by each connected component.
def pd_create_subgroups(pdf):
index = pdf.component.unique()[0]
try:
# building the graph
gnx = nx.from_pandas_edgelist(pdf, "src", "dst")
bic = list(find_cliques(gnx))
if len(bic) <= 1:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
bic_sorted = sorted(map(sorted, bic))
bic_sorted = [b for b in bic_sorted if len(b) >= 3]
if len(bic_sorted) == 0:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
return pd.DataFrame([bic_sorted]).transpose().rename(columns={0: "cliques"})
except:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
pdf is a pandas dataframe containing the fields src, dst, component
it has around 200M-300M undirected edges
and returns the following error -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 331) (executor 9): java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: range(0, 2147483648))
When running on smaller graphs it works properly.

SUOD Model gives ValueError: Input Contains Nan

I am running SUOD from pyod which is ensemble method and received this error.
The models that I am running are Iforest, COPOD and ECOD.
Running these models individually does not say that the data has nan values in it. Also I have already verified if any of the columns has nan values and it does not have any. The data is one hot encoded
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 1.0min remaining: 0.0s
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 1.0min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 5.8s remaining: 0.0s
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 5.8s finished
Traceback (most recent call last):
File "ensemble.py", line 76, in <module>
clf.fit(x_train_scaled)
File "/home/ubuntu/thesis/lib/python3.8/site-packages/pyod/models/suod.py", line 220, in fit
decision_score_mat, self.score_scalar_ = standardizer(
File "/home/ubuntu/thesis/lib/python3.8/site-packages/pyod/utils/utility.py", line 152, in standardizer
X = check_array(X)
File "/home/ubuntu/thesis/lib/python3.8/site-packages/sklearn/utils/validation.py", line 919, in check_array
_assert_all_finite(
File "/home/ubuntu/thesis/lib/python3.8/site-packages/sklearn/utils/validation.py", line 161, in _assert_all_finite
raise ValueError(msg_err)
ValueError: Input contains NaN.
and this is my code
train_data.dropna(axis=0)
test1_data.dropna(axis=0)
test2_data.dropna(axis=0)
mm_scaler = MinMaxScaler()
x_train_scaled = mm_scaler.fit_transform(train_data)
x_test2_scaled = mm_scaler.transform(test2_data)
x_test1_scaled = mm_scaler.transform(test1_data)
detector_list = [COPOD(), IForest(n_estimators=100,max_samples=10000, max_features=10,
bootstrap=True, n_jobs=-1, random_state=42),
IForest(n_estimators=200,max_samples=10000, max_features=10,
bootstrap=True, n_jobs=-1, random_state=42), ECOD(contamination=0.001)]
clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
verbose=False)
clf.fit(x_train_scaled)
train_pred = clf.predict(x_train_scaled)
test_pred1 = clf.predict(x_test1_scaled)
test_pred2 = clf.predict(x_test2_scaled)
Thing that I have tried
SimpleImputer
dropping nan rows.
adding the mock patch
As error output, you need to handle NaN values. dropna method return a new dataframe. If you want modify it, set parameter inplace to true and do operation inplace (return None),
inplace : boolean, default False
So to modify it in place data.dropna(axis=0, how='any', inplace=True)
Also possible method to handle NaN values (this is optional and if applicable to your problem or a data mining related) is to convert NaN inputs to the median of the column df = df.fillna(df.mean()).
Another case (uncommon) is that your dataframe contains nan values represented as string type or "NaN", then functions to manage nan values wont work, in that case you need to use something like df.replace("NaN", numpy.nan) and drop.

Retrieve checkpointed DataFrame in pySpark

Context
Assume I have a DataFrame I checkpointed.
# spark = SparkSession
# sc = SparkContext
sc.setCheckpointDir("hdfs:/some_path/")
df = spark.createDataFrame([
(1, "A"), (2, "B")
]
, ["A", "B"]
)
df = df.checkpoint(eager=True)
# Do whatever you want to do
Now assume the process crashed, the Context is gone and you want to recover the data you checkpointed. This should be retrievable like this:
import pyspark.serializer as s
recovered_df = sc._checkpointFile(
"hdfs:/some_path/checkpoint/"
, s.PickleSerializer
)
result = recovered_df.collect()
An analog example how to do this in Scala can be found here.
Problem
recovered_df is returned as a ReliableCheckpointRDD. Anyhow, expressed as Scala-class this should be a ReliableCheckpointRDD[InternalRow], but is a ReliableCheckpointRDD[UnsafeRow].
The error of the recovered_df.collect() shows the problem:
Exception in thread "serve RDD 8" org.apache.spark.SparkException:
Unexpected element type class
org.apache.spark.sql.catalyst.expressions.UnsafeRow
The exception is thrown at this point:
TypeError: load_stream() missing 1 required positional argument:
'stream' TypeError Traceback (most recent call last) in engine ----> 1
rdd.collect()
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in
collect(self) 815 with SCCallSiteSync(self.context) as css: 816
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
--> 817 return list(_load_from_socket(sock_info, self._jrdd_deserializer)) 818 819 def reduce(self, f):
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in
_load_from_socket(sock_info, serializer) 147 sock.settimeout(None) 148 # The socket will be automatically closed when garbage-collected. --> 149 return serializer.load_stream(sockfile) 150 151
TypeError: load_stream() missing 1 required positional argument:
'stream'
Question
How can I succesfully recover my DataFrame (or a readable RDD) in pySpark?
Optional, if it is required for the solution: How can I convert the UnsafeRow into an InternalRow?

About "spark.sql.objectHashAggregate.sortBased.fallbackThreshold"

During running one of Spark Stage,
I face the unexpected error log on executors, and stage is not working anymore.
Could anyone tell me what happened if below message is occurred?
and is there any limitation to set this setting value? (spark.sql.objectHashAggregate.sortBased.fallbackThreshold)
# encountered message on executor
21/04/29 07:25:58 INFO ObjectAggregationIterator: Aggregation hash map size 128 reaches threshold capacity (128 entries), spilling and falling back to sort based aggregation. You may change the threshold by adjust option spark.sql.objectHashAggregate.sortBased.fallbackThreshold
21/04/29 07:26:51 INFO PythonUDFRunner: Times: total = 87019, boot = -361, init = 380, finish = 87000
21/04/29 07:26:51 INFO MemoryStore: Block rdd_36_1765 stored as values in memory (estimated size 19.6 MB, free 5.2 GB)
21/04/29 07:26:53 INFO PythonRunner: Times: total = 2154, boot = 6, init = 1, finish = 2147
21/04/29 07:26:53 INFO Executor: Finished task 1765.0 in stage 6.0 (TID 11172). 5310 bytes result sent to driver
21/04/29 07:27:33 INFO PythonUDFRunner: Times: total = 93086, boot = -461, init = 480, finish = 93067
21/04/29 07:27:33 INFO MemoryStore: Block rdd_36_1792 stored as values in memory (estimated size 19.7 MB, free 5.2 GB)
21/04/29 07:27:35 INFO PythonRunner: Times: total = 1999, boot = -40047, init = 40051, finish = 1995
21/04/29 07:27:35 INFO Executor: Finished task 1792.0 in stage 6.0 (TID 11199). 5267 bytes result sent to driver
21/04/29 07:27:35 INFO PythonUDFRunner: Times: total = 97305, boot = -313, init = 350, finish = 97268
21/04/29 07:27:35 INFO MemoryStore: Block rdd_36_1789 stored as values in memory (estimated size 19.7 MB, free 5.3 GB)
21/04/29 07:27:37 INFO PythonRunner: Times: total = 1928, boot = -2217, init = 2220, finish = 1925
21/04/29 07:27:37 INFO Executor: Finished task 1789.0 in stage 6.0 (TID 11196). 5310 bytes result sent to driver
# about Spark Stage I did
#
# given dataframe is
# (uid is given by monotonically_increasing_id)
# |-------|-------|
# | uids | score |
# |-------|-------|
# |[1,2,3]| 50 |
# |[1,2] | 70 |
# |[1] | 90 |
#
# expected result
# |-------|-------|
# | uid | score |
# |-------|-------|
# | 1 | 90 |
# | 2 | 70 |
# | 3 | 50 |
rdd = df.select(F.explode('uids').alias('uid'), 'score') \
.rdd.map(lambda x: (x['uid'], x)) \
.reduceByKey(func=max, numPartitions=1800) \
.cache()

Issue with using .take() function with spark 2+ pyspark

This is the code I am using. Here it runs fine without data.take but gives error when using it in pyspark python
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
data = sc.textFile("re_u.data")
pData=data.take(2000)
ratings = pData.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
Gives Error
AttributeError Traceback (most recent call last)
<ipython-input-12-c9c51af1b2e9> in <module>
2 data = sc.textFile("re_u.data")
3 pData=data.take(2000)
----> 4 ratings = pData.map(lambda l: l.split(','))\
5 .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
AttributeError: 'list' object has no attribute 'map'
Update:
After using your change #Hristo Iliev it helped but encountered another issue that followed with ratings as a list. Thank you for your help!
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.take(2000)
rank = 20
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
Gives error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-7e35afff970b> in <module>
1 rank = 20
2 numIterations = 20
----> 3 model = ALS.train(ratings, rank, numIterations)
C:\spark\spark-3.0.0-preview2-bin-hadoop2.7\python\pyspark\mllib\recommendation.py in train(cls, ratings, rank, iterations, lambda_, blocks, nonnegative, seed)
271 (default: None)
272 """
--> 273 model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations,
274 lambda_, blocks, nonnegative, seed)
275 return MatrixFactorizationModel(model)
C:\spark\spark-3.0.0-preview2-bin-hadoop2.7\python\pyspark\mllib\recommendation.py in _prepare(cls, ratings)
227 else:
228 raise TypeError("Ratings should be represented by either an RDD or a DataFrame, "
--> 229 "but got %s." % type(ratings))
230 first = ratings.first()
231 if isinstance(first, Rating):
TypeError: Ratings should be represented by either an RDD or a DataFrame, but got <class 'list'>.
Please help!
take() is an action that takes the specified number elements from the top of an RDD and transfers them to the driver program. What you get from it is a Python list with the requested elements, which is:
local to the driver, hence you should not take too many elements
doesn't have a map() method simply because Python list class has no map() method
What you most likely want to do is to first apply the transformations to the data RDD and take() from the transformed RDD:
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.take(2000)
You'll get a list of Rating instances.
Since you pass the data further down to ALS, which takes distributed data, i.e., an RDD, and not driver-local list, you have three choices:
Parallelise again the list, turning it into an RDD:
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.take(2000)
ratingsRDD = sc.parallelize(ratings)
rank = 20
numIterations = 20
model = ALS.train(ratingsRDD, rank, numIterations)
Use the sample() method to sample a subset of the data in the RDD:
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.sample(False, 0.1, 42)
rank = 20
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
Here sample(False, 0.1, 42) means take approximately 10% of the original data and use 42 as the seed of the pseudorandom number generator. Fixing the seed will allow for reproducibility while testing. You should adjust 0.1 to the proper value so you get about 2000 samples. Notice that those samples will be taken from random places inside the RDD and will most likely not be the first 2000.
Emulate take() while staying withing the realm of RDDs:
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.zipWithIndex()\
.filter(lambda l: l[1] < 2000)\
.map(lambda l: l[0])
rank = 20
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
zipWithIndex() creates tuples of the RDD content where the first element comes from the RDD and the second one is the index in the RDD (essentially, the line number). You can then filter only elements with index less than 2000 and then get rid of the index using map(lambda l: l[0]).
Method 2 is probably the best one.

Resources