How to efficiently deal with nested data in PySpark? - python-3.x

I had a situation here, and found that collect_list in spark is not efficient when the item is already a list.
Basically, I tried to calculate the mean of a nested list (the size of each list is guaranteed to be the same). When the data set becomes, for example, 10 M rows, it may produce out of memory errors. Originally, I thought it has something to do with the udf (to calculate the mean). But actually, I found that the aggregation part (collect_list of lists) is the real problem.
What I am doing now is to divide the 10 M rows into multiple blocks (by 'user'), aggregate each block individually, and then union them at the end. Any better suggestion on efficiently dealing with nested data?
For example, the toy example is like this:
data = [('user1','place1', ['place1', 'place2', 'place3'], [0.0, 0.5, 0.4], [0.0, 0.4, 0.3]),
('user1','place2', ['place1', 'place2', 'place3'], [0.7, 0.0, 0.4], [0.6, 0.0, 0.3]),
('user2','place1', ['place1', 'place2', 'place3'], [0.0, 0.4, 0.3], [0.0, 0.3, 0.4]),
('user2','place3', ['place1', 'place2', 'place3'], [0.1, 0.2, 0.0], [0.3, 0.1, 0.0]),
('user3','place2', ['place1', 'place2', 'place3'], [0.3, 0.0, 0.4], [0.2, 0.0, 0.4]),
]
data_df = sparkApp.sparkSession.createDataFrame(data, ['user', 'place', 'places', 'data1', 'data2'])
data_agg = data_df.groupBy('user') \
.agg(f.collect_list('place').alias('place_list'),
f.first('places').alias('places'),
f.collect_list('data1').alias('data1'),
f.collect_list('data1').alias('data2'),
)
import numpy as np
def average_values(sim_vectors):
if len(sim_vectors) == 1:
return sim_vectors[0]
mat = np.array(sim_vectors)
mean_vector = np.mean(mat, axis=0)
return np.round(mean_vector, 3).tolist()
avg_vectors_udf = f.udf(average_values, ArrayType(DoubleType()))
data_agg_ave = data_agg.withColumn('data1', avg_vectors_udf('data1')) \
.withColumn('data2', avg_vectors_udf('data2'))
The result would be:
+-----+----------------+--------------------+-----------------+-----------------+
| user| place_list| places| data1| data2|
+-----+----------------+--------------------+-----------------+-----------------+
|user1|[place1, place2]|[place1, place2, ...|[0.35, 0.25, 0.4]|[0.35, 0.25, 0.4]|
|user3| [place2]|[place1, place2, ...| [0.3, 0.0, 0.4]| [0.3, 0.0, 0.4]|
|user2|[place1, place3]|[place1, place2, ...|[0.05, 0.3, 0.15]|[0.05, 0.3, 0.15]|
+-----+----------------+--------------------+-----------------+-----------------+

Related

I am getting ValueError: The estimator Sequential should be a classifier. how can I solve it?

I am using VotingClassifier for 31 Pre-trained models, when I wanted to do voting using VotingClassifier. I got this error ValueError: The estimator Sequential should be a classifier. The code is as shown below:
estimators = [("EfficientNetB0_model",EfficientNetB0_model),("EfficientNetB1_model",EfficientNetB1_model),("DenseNet121_model",DenseNet121_model),
("DenseNet169_model",DenseNet169_model),("DenseNet201_model",DenseNet201_model),("EfficientNetB2_model",EfficientNetB2_model),
("EfficientNetB3_model",EfficientNetB3_model),("EfficientNetB4_model",EfficientNetB4_model),("EfficientNetB5_model",EfficientNetB5_model),
("EfficientNetB6_model",EfficientNetB6_model),("EfficientNetB7_model",EfficientNetB7_model),("EfficientNetV2B0_model",EfficientNetV2B0_model),
("EfficientNetV2B1_model",EfficientNetV2B1_model),("EfficientNetV2B2_model",EfficientNetV2B2_model),("EfficientNetV2B3_model",EfficientNetV2B3_model),
("EfficientNetV2L_model",EfficientNetV2L_model),("EfficientNetV2M_model",EfficientNetV2M_model),("EfficientNetV2S_model",EfficientNetV2S_model),
("InceptionResNetV2_model",InceptionResNetV2_model),("InceptionV3_model",InceptionV3_model),
("ResNet50_model",ResNet50_model),("ResNet50V2_model",ResNet50V2_model),("ResNet101_model",ResNet101_model),
("ResNet101V2_model",ResNet101V2_model),("ResNet152_model",ResNet152_model),("ResNet152V2_model",ResNet152V2_model),
("VGG16_model",VGG16_model),("VGG19_model",VGG19_model),("Xception_model",Xception_model),
("MobileNet_model",MobileNet_model),("MobileNetV2_model",MobileNetV2_model)]
weights = [0.2, 0.3, 0.0, 0.1, 0.0,
0.3, 0.2, 0.1, 0.0, 0.3,
0.1, 0.3, 0.3, 0.1, 0.0,
0.1, 0.2, 0.1, 0.1, 0.1,
0.4, 0.0, 0.2, 0.1, 0.4,
0.0, 0.0, 0.1, 0.1, 0.0, 0.0
]
ensemble = VotingClassifier(estimators, weights=weights, voting= 'soft')
ensemble._estimator_type = "classifier"
ensemble = ensemble.fit(X_train, y_train)
print(ensemble.predict(X_test))
Could you help me since I could not find any solution for that. Thank you
Is there any other ways to do voting ?

How to use fit_transform with an array?

Example of array content:
[
[4.9, 3.0, 1.4, 0.2, 0.0, 2.0],
[4.7, 3.2, 1.3, 0.2, 0.0, 2.0],
[4.6, 3.1, 1.5, 0.2, 0.0, 2.0],
...
]
model = TSNE(learning_rate=100)
transformed = model.fit_transform(data)
I'm trying to apply tSNE to a float array, but I get an error. What should I change?
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (149,) + inhomogeneous part.
Try this example:
from sklearn.manifold import TSNE
import numpy as np
X = np.array([[4.9, 3.0, 1.4, 0.2, 0.0, 2.0], [4.7, 3.2, 1.3, 0.2, 0.0, 2.0]])
model = TSNE(learning_rate=100)
transformed = model.fit_transform(X)
print(transformed)

removing nested list elements with all zero entries

In a nested list (list of lists), how can I remove the elements that have all the entries as zero.
For instance: values =
[[1.1, 3.0], [2.5, 5.2], [4.7, 8.2], [69.2, 36.6], [0.7, 0.0], [0.0, 0.0], [0.4, 17.9], [14.7, 29.1], [6.8, 0.0], [0.0, 0.0]]
should change to
[[1.1, 3.0], [2.5, 5.2], [4.7, 8.2], [69.2, 36.6], [0.7, 0.0], [0.4, 17.9], [14.7, 29.1], [6.8, 0.0]]
Note: nested list could have n number of elements, not just 2.
Trying to do this to crop two other lists. something like:
for label, color, value in zip(labels, colors, values):
if any(value) in values: #this check needs update
new_labels.append(label)
new_colors.append(color)
Take advantage of the fact that 0.0 is falsey and filter using any().
result = [sublist for sublist in arr if any(sublist)]

Change colors in colormap based on range of values

Is it possible to set the lower and/or upper parts of a colorbar based on ranges of values? For example, given the ROYGBIV colormap below and optionally an offset and a range value, I'd like to change the colors below offset and/or above range. In other words, suppose offset = 20 and range = 72, I'd like to color all the values less than or equal to 20 in black and all values greater than or equal to 72 in white. I'm aware of the methods set_under and set_over, but they require changing the parameters vmin and vmax (as far as I know), which is not what I want. I want to keep the original minimum and maximum values (e.g., vmin = 0 and vmax = 100), and only (optionally) change the colors of the extremities.
ROYGBIV = {
"blue": ((0.0, 1.0, 1.0),
(0.167, 1.0, 1.0),
(0.333, 1.0, 1.0),
(0.5, 0.0, 0.0),
(0.667, 0.0, 0.0),
(0.833, 0.0, 0.0),
(1.0, 0.0, 0.0)),
"green": ((0.0, 0.0, 0.0),
(0.167, 0.0, 0.0),
(0.333, 0.0, 0.0),
(0.5, 1.0, 1.0),
(0.667, 1.0, 1.0),
(0.833, 0.498, 0.498),
(1.0, 0.0, 0.0)),
"red": ((0.0, 0.5608, 0.5608),
(0.167, 0.4353, 0.4353),
(0.333, 0.0, 0.0),
(0.5, 0.0, 0.0),
(0.667, 1.0, 1.0),
(0.833, 1.0, 1.0),
(1.0, 1.0, 1.0))
}
rainbow_mod = matplotlib.colors.LinearSegmentedColormap("rainbow_mod", ROYGBIV, 256)
I found one way to do it using ListedColormap as explained here. The basic idea is to obtain the RGBA lists/tuples of the colors in the LinearSegmentedColormap object (numpy array) and replace the first or last few lists with replicates of the desired color.
It looks something like this:
under_color = [0.0, 0.0, 0.0, 1.0] # black (alpha = 1.0)
over_color = [1.0, 1.0, 1.0, 1.0] # white (alpha = 1.0)
all_colors = rainbow_mod(np.linspace(0, 1, 256))
vmin = 0.0
vmax = 100.0
all_colors[:int(np.round((20.0 - vmin) / (vmax - vmin) * 256)), :] = under_color
all_colors[int(np.round((72.0 - vmin) / (vmax - vmin) * 256)):, :] = over_color
rainbow_mod_list = matplotlib.colors.ListedColormap(all_colors.tolist())

Optimization for faster numpy 'where' with boolean condition

I generate a bunch of 5-elements vectors with
def beam(n):
# For performance considerations, see
# https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-generation-in-intel-distribution-for-python
try:
import numpy.random_intel
generator = numpy.random_intel.multivariate_normal
except ModuleNotFoundError:
import numpy.random
generator = numpy.random.multivariate_normal
return generator(
[0.0,
0.0,
0.0,
0.0,
0.0
],
numpy.array([
[1.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.2]
]),
int(n)
)
This vector will be multiplied by 5x5 matrices (element wise) and checked for boundaries. I use this:
b = beam(1e5)
bound = 1000
s = (b[:, 0]**2 + b[:, 3]**2) < bound**2
#b[np.where(s)] (equivalent performances)
b[s] # <= returned value from a function
It seems that this operation with 100k elements is quite time consuming (3ms on my machine).
Would there be an obvious (or less obvious) way to perform this
operation (the where part, the random generation is there to give an example) ?
As your components are uncorrelated one obvious speedup would be to use the univariate normal instead of the multivariate:
>>> from timeit import repeat
>>> import numpy as np
>>>
>>> kwds = dict(globals=globals(), number=100)
>>>
>>> repeat('np.random.multivariate_normal(np.zeros((5,)), np.diag((1,1,1,1,0.2)), (100,))', **kwds)
[0.01475344318896532, 0.01471381587907672, 0.013099645031616092]
>>> repeat('np.random.normal((0,0,0,0,0), (1,1,1,1,np.sqrt(0.2)), (100, 5))', **kwds)
[0.003930734936147928, 0.004097769036889076, 0.004246715921908617]
Further, as it stands your condition is extremely unlikely to fail. So, just check s.all() and if True do nothing.

Resources