How to use fit_transform with an array? - python-3.x

Example of array content:
[
[4.9, 3.0, 1.4, 0.2, 0.0, 2.0],
[4.7, 3.2, 1.3, 0.2, 0.0, 2.0],
[4.6, 3.1, 1.5, 0.2, 0.0, 2.0],
...
]
model = TSNE(learning_rate=100)
transformed = model.fit_transform(data)
I'm trying to apply tSNE to a float array, but I get an error. What should I change?
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (149,) + inhomogeneous part.

Try this example:
from sklearn.manifold import TSNE
import numpy as np
X = np.array([[4.9, 3.0, 1.4, 0.2, 0.0, 2.0], [4.7, 3.2, 1.3, 0.2, 0.0, 2.0]])
model = TSNE(learning_rate=100)
transformed = model.fit_transform(X)
print(transformed)

Related

I am getting ValueError: The estimator Sequential should be a classifier. how can I solve it?

I am using VotingClassifier for 31 Pre-trained models, when I wanted to do voting using VotingClassifier. I got this error ValueError: The estimator Sequential should be a classifier. The code is as shown below:
estimators = [("EfficientNetB0_model",EfficientNetB0_model),("EfficientNetB1_model",EfficientNetB1_model),("DenseNet121_model",DenseNet121_model),
("DenseNet169_model",DenseNet169_model),("DenseNet201_model",DenseNet201_model),("EfficientNetB2_model",EfficientNetB2_model),
("EfficientNetB3_model",EfficientNetB3_model),("EfficientNetB4_model",EfficientNetB4_model),("EfficientNetB5_model",EfficientNetB5_model),
("EfficientNetB6_model",EfficientNetB6_model),("EfficientNetB7_model",EfficientNetB7_model),("EfficientNetV2B0_model",EfficientNetV2B0_model),
("EfficientNetV2B1_model",EfficientNetV2B1_model),("EfficientNetV2B2_model",EfficientNetV2B2_model),("EfficientNetV2B3_model",EfficientNetV2B3_model),
("EfficientNetV2L_model",EfficientNetV2L_model),("EfficientNetV2M_model",EfficientNetV2M_model),("EfficientNetV2S_model",EfficientNetV2S_model),
("InceptionResNetV2_model",InceptionResNetV2_model),("InceptionV3_model",InceptionV3_model),
("ResNet50_model",ResNet50_model),("ResNet50V2_model",ResNet50V2_model),("ResNet101_model",ResNet101_model),
("ResNet101V2_model",ResNet101V2_model),("ResNet152_model",ResNet152_model),("ResNet152V2_model",ResNet152V2_model),
("VGG16_model",VGG16_model),("VGG19_model",VGG19_model),("Xception_model",Xception_model),
("MobileNet_model",MobileNet_model),("MobileNetV2_model",MobileNetV2_model)]
weights = [0.2, 0.3, 0.0, 0.1, 0.0,
0.3, 0.2, 0.1, 0.0, 0.3,
0.1, 0.3, 0.3, 0.1, 0.0,
0.1, 0.2, 0.1, 0.1, 0.1,
0.4, 0.0, 0.2, 0.1, 0.4,
0.0, 0.0, 0.1, 0.1, 0.0, 0.0
]
ensemble = VotingClassifier(estimators, weights=weights, voting= 'soft')
ensemble._estimator_type = "classifier"
ensemble = ensemble.fit(X_train, y_train)
print(ensemble.predict(X_test))
Could you help me since I could not find any solution for that. Thank you
Is there any other ways to do voting ?

Python GEKKO: Value of parameter changes while solving the model

I face the following problem with GEKKO: some parameters (.Param) are changing (others not) when solving a model and I cannot determine why.
Background: I am currently trying to translate code from EViews (see gennaro.zezza.it) to python. I use GEKKO to simulate a system consisting out of 11 equations (for now). I do want to use parameters (instead of constants which seem to work perfectly fine) as I need to ('exogenously') change their value over time (and thus need an array).
Example: In the following example, an 'economic system' reacts to new government expenditures. Here, I particularly face problems with "m.alpha1" and "m.alpha2" - if they are introduced as ".Param" their value will change to 1.0 (instead of 0.6 and 0.4) when solving the model. How can I stop GEKKO from doing this? (Again, I want to be able to change, e.g., alpha1 to 0.7 after time x. E.g., lower and upper bounds won't help here.)
Thanks for your help!!
Code:
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
# Initialize model
m = GEKKO(remote=False)
tstart = 1945
tend = 2000
tdur = tend-tstart+1
m.time = np.linspace(0, tend-tstart, tdur)
# Model parameters
m.t = m.Param(value=m.time)
# Exogenous parameters
alpha1_ex = 0.6
alpha2_ex = 0.4
theta_ex = 0.2
w_ex = 1
# -as .Const
m.alpha1 = m.Const(value=alpha1_ex, name='Propensity to consume out of income')
m.alpha2 = m.Const(value=alpha2_ex, name='Propensity to consume out of wealth')
#m.theta = m.Const(value=theta_ex, name='Tax rate')
#m.w = m.Const(value=w_ex, name='Wage rate')
# -as .Param: issues with alpha1 & alpha2
#m.alpha1 = m.Param(value=np.full(tdur,alpha1_ex), name='Propensity to consume out of income')
#m.alpha2 = m.Param(value=np.full(tdur,alpha2_ex), name='Propensity to consume out of wealth')
m.theta = m.Param(value=np.full(tdur,theta_ex), name='Tax rate')
m.w = m.Param(value=np.ones(tdur), name='Wage rate')
# no issues with g_d
m.g_d = m.Param(value=np.zeros(tdur), name='Government goods, demand')
m.g_d[1:] = 20
# Endogenous variables
m.c_d = m.Var(value=0, name='Consumption goods demand by households')
m.c_s = m.Var(value=0, name='Consumption goods supply')
m.g_s = m.Var(value=0, name='Government goods, supply')
m.h_h = m.Var(value=0, name='Cash money held by households')
m.h_s = m.Var(value=0, name='Cash money supplied by government')
m.n_d = m.Var(value=0, name='Demand for labor')
m.n_s = m.Var(value=0, name='Supply for labor')
m.t_d = m.Var(value=0, name='Taxes, "demand"')
m.t_s = m.Var(value=0, name='Taxes, "supply"')
m.y = m.Var(value=0, name='Income (=GDP)')
m.yd = m.Var(value=0, name='Disposable income of households')
# Lag variables
m.h_h_lag = m.Var(value=0, name='Cash money held by households (t-1)')
m.delay(m.h_h,m.h_h_lag,1) # m.h_h_lag = m.h_h(t-1)
m.h_s_lag = m.Var(value=0, name='Cash money supplied by government (t-1)')
m.delay(m.h_s,m.h_s_lag,1)
# Equations
m.Equation(m.c_s == m.c_d)
m.Equation(m.g_s == m.g_d)
m.Equation(m.t_s == m.t_d)
m.Equation(m.n_s == m.n_d)
m.Equation(m.yd == m.w*m.n_s - m.t_s)
m.Equation(m.t_d == m.theta*m.w*m.n_s)
m.Equation(m.c_d == m.alpha1*m.yd + m.alpha2*m.h_h_lag)
m.Equation(m.h_s == m.h_s_lag + m.g_d - m.t_d)
m.Equation(m.h_h == m.h_h_lag + m.yd - m.c_d)
m.Equation(m.y == m.c_s + m.g_s)
m.Equation(m.n_d == m.y/m.w)
# Solve
m.options.IMODE = 4
m.solve(disp=False)
print("Alpha1 = ", m.alpha1.value)
print("Alpha2 = ", m.alpha2.value)
print("Theta = ", m.theta.value)
print("w = ", m.w.value)
# Plot results
fig, axes = plt.subplots(2, 2, sharex=True, figsize=(8, 7))
fig.canvas.manager.set_window_title('Figures Chapter 3')
fig.suptitle('SIM Model - basic')
x_major_ticks = np.arange(0,tdur,5)
axes[0,0].plot(m.time, m.g_d.value, '-', color='black', linewidth=1)
axes[0,0].legend([m.g_d.name],loc=4,fontsize=7)
axes[0,0].grid()
axes[0,0].set_xticks(x_major_ticks)
axes[1,0].plot(m.time, m.y.value, '-', color='red', linewidth=1)
axes[1,0].legend([m.y.name],loc=4,fontsize=7)
axes[1,0].grid()
axes[1,0].set_xlabel('Time (years)')
axes[1,0].set_xticks(x_major_ticks)
axes[0,1].plot(m.time, m.c_d.value, '-', color='blue', linewidth=0.75)
axes[0,1].plot(m.time, m.yd.value, '-', color='green', linewidth=0.75)
axes[0,1].legend([m.c_d.name,m.yd.name],loc=4,fontsize=7)
axes[0,1].grid()
axes[0,1].set_xticks(x_major_ticks)
ln1 = axes[1,1].plot(m.time, m.h_h.value, '-', color='purple', linewidth=0.75)
axes[1,1].tick_params(axis='y', labelcolor='purple')
ax2 = axes[1,1].twinx()
ln2 = ax2.plot(m.time, [a_i - b_i for a_i, b_i in zip(m.h_h, m.h_h_lag)], '-', color='orange', linewidth=0.75)
ax2.tick_params(axis='y', labelcolor='orange')
lns = ln1+ln2
axes[1,1].legend(lns,[m.h_h.name,'Household savings'],loc=4,fontsize=7)
axes[1,1].grid()
axes[1,1].set_xticks(x_major_ticks)
axes[1,1].set_xlabel('Time (years)')
plt.show()
Output #1: with m.alpha1 and m.alpha2 as .const
Alpha1 = 0.6
Alpha2 = 0.4
Theta = [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]
w = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Output #2: with m.alpha1 as .param
Alpha1 = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Alpha2 = 0.4
Theta = [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]
w = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
The problem is that the name of the variable name='Propensity to consume out of income' is over 25 characters long.
m.alpha1 = m.Param(value=np.full(tdur,alpha1_ex), name='Propensity to consume out of income')
m.alpha2 = m.Param(value=np.full(tdur,alpha2_ex), name='Propensity to consume out of wealth')
The model file is produced correctly (gk_model0.apm) but the data file (gk_model0.csv) header is truncated to 25 characters. The files are accessible with m.open_folder(). The bug is in this line of gk_write_files.py where numbers are output as strings of length 25.
np.savetxt(os.path.join(self._path,file_name), csv_data.T, delimiter=",", fmt='%1.25s')
I've added this as a bug report with tracking on GitHub. One work-around is to use shorter variable names or leave off the variable names.
m.alpha1 = m.Param(value=np.full(tdur,alpha1_ex)) # Propensity to consume out of income

Dense Vector Column to Sparse Vector Column

I have a unique situation where I need to go from a DenseVector to a Sparse Vector Column.
I am trying to implement the SMOTE technique I found here: https://github.com/Angkirat/Smote-for-Spark/blob/master/PythonCode.py
But on line 44 I had to change it from min_Array[neigh][0] - min_Array[i][0] to DenseVector(min_Array[neigh][0]) - DenseVector(min_Array[i][0]) due to an error.
Once I have the DenseVector column, I need to convert it back to a SparseVector column to union my data.
I have tried the Following:
df = sc.parallelize([
(1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
(2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
(3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
list_to_vector_udf = udf(lambda l: Vectors.sparse(l), VectorUDT())
df = df.withColumn('features', list_to_vector_udf(df["features"]))
"int() argument must be a string, a bytes-like object or a number, not 'DenseVector''
assembler = VectorAssembler(inputCols=['features'],outputCol='features')
df = assembler.transform(df)
"Data type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> of column features is not supported."
It usually doesn't make too much sense to convert a dense vector to a sparse vector since dense vector has already taken the memory. If you really need to do this, look at the sparse vector API, it either accepts a list of pairs (indice, value) or you need to directly pass nonzero indices and values to the constructor. Something like the following:
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml.linalg import DenseVector
df = sc.parallelize([
(1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
(2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
(3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
def to_sparse(dense_vector):
size = len(dense_vector)
pairs = [(i, v) for i, v in enumerate(dense_vector.values.tolist()) if v != 0]
return Vectors.sparse(size, pairs)
dense_to_sparse_udf = udf(to_sparse, VectorUDT())
df = df.withColumn('features', dense_to_sparse_udf(df["features"]))
df.show()
+-------+--------------------+
|row_num| features|
+-------+--------------------+
| 1|(10,[1,2,3,4,5],[...|
| 2| (10,[9],[100.0])|
| 3| (10,[1],[1.0])|
+-------+--------------------+

How to efficiently deal with nested data in PySpark?

I had a situation here, and found that collect_list in spark is not efficient when the item is already a list.
Basically, I tried to calculate the mean of a nested list (the size of each list is guaranteed to be the same). When the data set becomes, for example, 10 M rows, it may produce out of memory errors. Originally, I thought it has something to do with the udf (to calculate the mean). But actually, I found that the aggregation part (collect_list of lists) is the real problem.
What I am doing now is to divide the 10 M rows into multiple blocks (by 'user'), aggregate each block individually, and then union them at the end. Any better suggestion on efficiently dealing with nested data?
For example, the toy example is like this:
data = [('user1','place1', ['place1', 'place2', 'place3'], [0.0, 0.5, 0.4], [0.0, 0.4, 0.3]),
('user1','place2', ['place1', 'place2', 'place3'], [0.7, 0.0, 0.4], [0.6, 0.0, 0.3]),
('user2','place1', ['place1', 'place2', 'place3'], [0.0, 0.4, 0.3], [0.0, 0.3, 0.4]),
('user2','place3', ['place1', 'place2', 'place3'], [0.1, 0.2, 0.0], [0.3, 0.1, 0.0]),
('user3','place2', ['place1', 'place2', 'place3'], [0.3, 0.0, 0.4], [0.2, 0.0, 0.4]),
]
data_df = sparkApp.sparkSession.createDataFrame(data, ['user', 'place', 'places', 'data1', 'data2'])
data_agg = data_df.groupBy('user') \
.agg(f.collect_list('place').alias('place_list'),
f.first('places').alias('places'),
f.collect_list('data1').alias('data1'),
f.collect_list('data1').alias('data2'),
)
import numpy as np
def average_values(sim_vectors):
if len(sim_vectors) == 1:
return sim_vectors[0]
mat = np.array(sim_vectors)
mean_vector = np.mean(mat, axis=0)
return np.round(mean_vector, 3).tolist()
avg_vectors_udf = f.udf(average_values, ArrayType(DoubleType()))
data_agg_ave = data_agg.withColumn('data1', avg_vectors_udf('data1')) \
.withColumn('data2', avg_vectors_udf('data2'))
The result would be:
+-----+----------------+--------------------+-----------------+-----------------+
| user| place_list| places| data1| data2|
+-----+----------------+--------------------+-----------------+-----------------+
|user1|[place1, place2]|[place1, place2, ...|[0.35, 0.25, 0.4]|[0.35, 0.25, 0.4]|
|user3| [place2]|[place1, place2, ...| [0.3, 0.0, 0.4]| [0.3, 0.0, 0.4]|
|user2|[place1, place3]|[place1, place2, ...|[0.05, 0.3, 0.15]|[0.05, 0.3, 0.15]|
+-----+----------------+--------------------+-----------------+-----------------+

Optimization for faster numpy 'where' with boolean condition

I generate a bunch of 5-elements vectors with
def beam(n):
# For performance considerations, see
# https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-generation-in-intel-distribution-for-python
try:
import numpy.random_intel
generator = numpy.random_intel.multivariate_normal
except ModuleNotFoundError:
import numpy.random
generator = numpy.random.multivariate_normal
return generator(
[0.0,
0.0,
0.0,
0.0,
0.0
],
numpy.array([
[1.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.2]
]),
int(n)
)
This vector will be multiplied by 5x5 matrices (element wise) and checked for boundaries. I use this:
b = beam(1e5)
bound = 1000
s = (b[:, 0]**2 + b[:, 3]**2) < bound**2
#b[np.where(s)] (equivalent performances)
b[s] # <= returned value from a function
It seems that this operation with 100k elements is quite time consuming (3ms on my machine).
Would there be an obvious (or less obvious) way to perform this
operation (the where part, the random generation is there to give an example) ?
As your components are uncorrelated one obvious speedup would be to use the univariate normal instead of the multivariate:
>>> from timeit import repeat
>>> import numpy as np
>>>
>>> kwds = dict(globals=globals(), number=100)
>>>
>>> repeat('np.random.multivariate_normal(np.zeros((5,)), np.diag((1,1,1,1,0.2)), (100,))', **kwds)
[0.01475344318896532, 0.01471381587907672, 0.013099645031616092]
>>> repeat('np.random.normal((0,0,0,0,0), (1,1,1,1,np.sqrt(0.2)), (100, 5))', **kwds)
[0.003930734936147928, 0.004097769036889076, 0.004246715921908617]
Further, as it stands your condition is extremely unlikely to fail. So, just check s.all() and if True do nothing.

Resources