Related
I have a very large dataframe (millions of rows) and every time I am getting a 1-row dataframe with the same columns.
For example:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,-1], 'c': [-1,0.4,31]})
input = pd.DataFrame([[11, -0.44, 4]], columns=list('abc'))
I would like to calculate cosine similarity between the input and the whole df.
I am using the following:
from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input), axis=1)
But it's a bit slow. Tried with swifter package, and it seems to run faster.
Please advise what is the best practice for such a task, do it like this or change to another method?
I usually don't do matrix manipulation with DataFrame but with numpy.array. So I will first convert them
df_npy = df.values
input_npy = input.values
And then I don't want to use scipy.spatial.distance.cosine so I will take care of the calculation myself, which is to first normalize each of the vectors
df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)
And then matrix multiply them together
df_npy # input_npy.T
which will give you
array([[0.213],
[0.524],
[0.431]])
The reason I don't want to use scipy.spatial.distance.cosine is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.
I am wanting to do the following:
Fill NaN values in a single column using values within a specific range.
The range I am wanting to use is the mean of the non-Nan values in the column +/- 1 one standard
deviation of the computed mean.
NOTE If possible, I would like to be able to use multiples of the std dev by simply multiplying it by
a constant.
I thought I had it (see full code below) but the output from print(df['C'].describe()) shows that
I am generating values well outside my desired range. In fact, I am generating numbers outside
the original min and max of the column, which is definitely not what I want.
import pandas as pd
import numpy as np
import sys
print('Python: {}'.format(sys.version))
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('\033[1;31m' + '--------------' + '\033[0m') # Bold red
display_settings = {
'max_columns': 15,
'max_colwidth': 60,
'expand_frame_repr': False, # Wrap to multiple pages
'max_rows': 50,
'precision': 6,
'show_dimensions': False
}
# pd.options.display.float_format = '{:,.2f}'.format
for op, value in display_settings.items():
pd.set_option("display.{}".format(op), value)
df = pd.DataFrame(np.random.randint(0, 1000, size=(200, 10)), columns=list('ABCDEFGHIJ'))
# df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list(['AA','BB','C2','D2']))
print(df, '\n')
# https://stackoverflow.com/questions/55149738/pandas-replace-values-with-nan-at-random
df['C'] = df['C'].sample(frac=0.65) # The percentage of non-NaN values.
df['H'] = df['H'].sample(frac=0.75) # The percentage of non-NaN values.
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe(), '\n')
def fillNaN_with_unifrand(col):
a = col.values
m = np.isnan(a) # mask of NaNs
mu, sigma = col.mean(), col.std()
a[m] = np.random.normal(mu, sigma, size=m.sum())
return col
# https://stackoverflow.com/questions/46543060/how-to-replace-every-nan-in-a-column-with-different-random-values-using-pandas?rq=1
fillNaN_with_unifrand(df['C'])
pd.options.display.float_format = '{:.0f}'.format
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe())
Output of print(df['C'].describe()):
Starting:
count 130.000000
mean 462.446154
std 290.760432
min 7.000000
25% 187.500000
50% 433.000000
75% 671.250000
max 992.000000
Name: C, dtype: float64
Ending:
count 200
mean 517
std 298
min -187
25% 281
50% 544
75% 763
max 1218
Name: C, dtype: float64
Note the min and max. All of my fill values (in this instance) should have been 462 +/- 290.
Well, this is not how statistics work. A Gaussian Normal Distribution has a mean and a std but values can be drawn far away from mean +- std, they are just less likeley. As per definition of a normal distribution, 68 % of all values are within +- 1*std, 95 % are within +-2*std and so on. The question is: What do you want to do with outliers? Set them to mean +- std or draw again?
Case 1: Set outliers to min/max
This is usually unwanted, as this changes your distribution and puts more weight on the lower and upper boundary.
from matplotlib import pyplot as plt
mu = 100
sigma = 7
a = np.random.normal(mu, sigma, size=2000) # I used a size of 2000 as an example
a[a<(mu-sigma)] = mu-sigma
a[a>(mu+sigma)] = mu+sigma
plt.hist(a, bins=12, edgecolor='black')
plt.show()
Case 2: Truncated Normal Distribution
What you usually want is the Truncated Normal Distribution. It creates a distribution with an upper and a lower boundary. You find this function at the scipy.stats module. It works a bit different though: you first create the distribution by normalizing the lower and upper clip and then you create a numer of random variates rvs from it like this:
from matplotlib import pyplot as plt
import scipy.stats as stats
mu = 100
sigma = 7
lower_clip = mu-sigma
upper_clip = mu+sigma
a = stats.truncnorm((lower_clip - mu) / sigma, (upper_clip - mu) / sigma, loc=mu, scale=sigma)
plt.hist(a.rvs(2000), bins=12, edgecolor='black')
plt.show()
The constant of multiples of sigma is easily implemented. You can just change your lower and upper clip like
lower_clip = mu-x*sigma
with x being your constant.
I have a Multi-index dataframe with multiple test result values.
For further data analysis I want to add the derivation to the dataframe.
I tried to either calculate it via a lambda function directly after I grouped the dataframe. Grouping (mean values) is required due to the noise in the sampling.
Later I want to delete the rows from my dataframes where the derivative is <= 0.
The simplified Multi-index dataframe looks like this:
arrays = [['LS13', 'LS13', 'LS13', 'LS13','LS14','LS14','LS14','LS14','LS14','LS14','LS14','LS14'],[0, 2, 2.5, 3,0,2,5,5.5,6,6.5,7,7.5]]
index = pd.MultiIndex.from_arrays(arrays, names=('File', 'Flow Rate Setpoint [l/s]'))
df = pd.DataFrame({('Flow Rate [l/s]','mean') : [-0.057,2.089,2.496,3.011,0.056,2.070,4.995,5.519,6.011,6.511,7.030,7.499],('Time [s]','mean') : [42.225,104.909,165.676,226.446,42.225,104.918,469.560,530.328,591.100,651.864,712.660,773.034],('Shear Stress [Pa]','mean') : [-0.698,5.621,7.946,11.278,-0.774,6.557,40.610,48.370,54.685,58.414,58.356,56.254]},index=index)
if I run my code:
import numpy as np
xls = ['LS13', 'LS14']
gradient = [pd.Series(np.gradient(df.loc[(i),('Shear Stress [Pa]','mean')],df.loc[(i),('Time [s]','mean')])) for i in xls]
now I want to concat gradient to df on axis = 1, Title could be df['Gradient''values'].
So my pd.Series looks like:
Gradient
values
0 0.100808
1 0.069048
2 0.04654
3 0.054801
0 0.116941
1 0.087431
2 0.149521
3 0.115805
4 0.082639
5 0.030213
6 -0.017938
7 -0.034806
next step would be to remove/drop the rows where ['Gradient','values'] <= 0, in my example ['LS14','7':'7.5']
When I tried to concatenate both Dataframe df and Series gradient (I'm aware that the indexes are different)
merged = pd.concat([pd.DataFrame(df),pd.Series(gradient)], axis=1 , ignore_index = True)
Errors are usually one of the following:
ValueError: Buffer dtype mismatch, expected 'Python object' but got
'long long'
TypeError: cannot concatenate object of type "<class 'list'>"; only
pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I would also assume there is an easier way to get this done with an lambda function and just apply it in place.
merged = pd.concat([df, pd.Series([gradient], name=('Gradient','value'))], axis=1)
I would have expected that to work, but I also get a miss match error:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
when I try:
df[("Gradient","value")] =pd.Series([pd.Series(np.gradient(df.loc[(i),('Shear Stress [Pa]','mean')],df.loc[(i),('Time [s]','mean')])) for i in xls])
The 'Gradient','value' column gets correctly added to the dataframe but the values are again NaN.
You can try groupby().apply():
def get_gradients(x):
gradients = np.gradient(x[('Shear Stress [Pa]', 'mean')],x[('Time [s]', 'mean')] )
return pd.Series(gradients, index=x.index)
df[('Gradient','Value')] = (df.groupby('File', group_keys=False)
.apply(get_gradients)
)
I am doing an experiment with three time-series datasets with different characteristics for my experiment whose format is as the following.
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
The first column is a timestamp. For reproducibility reasons, I am sharing the data here. From column 2, I wanted to read the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, here is the code:
import numpy as np
import matplotlib.pyplot as plt
protocols = {}
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
plt.figure(); plt.clf()
plt.plot(quotient_times,quotient, ".", label=protname, color="blue")
plt.ylim(0, 1.0001)
plt.title(protname)
plt.xlabel("time")
plt.ylabel("quotient")
plt.legend()
plt.show()
And this produces the following three points - one for each dataset I shared.
As we can see from the points in the plots based on the code given above, data1 is pretty consistent whose value is around 1, data2 will have two quotients (whose values will concentrate either around 0.5 or 0.8) and the values of data3 are concentrated around two values (either around 0.5 or 0.7). This way, given a new data point (with quotient and quotient_times), I want to know which cluster it belongs to by building each dataset stacking these two transformed features quotient and quotient_times. I am trying it with KMeans clustering as the following
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
But this is giving me an error: ValueError: n_samples=1 should be >= n_clusters=3. How can we fix this error?
Update: samlpe quotient data = array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129,
0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 ,
0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581,
0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
As is, your quotient variable is now one single sample; here I get a different error message, probably due to different Python/scikit-learn version, but the essence is the same:
import numpy as np
quotient = np.array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
quotient.shape
# (20,)
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
This gives the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[0.7 0.7 0.4973262 0.7008547 0.71287129 0.704
0.49723757 0.49723757 0.70676692 0.5 0.5 0.70754717
0.5 0.49723757 0.70322581 0.5 0.49723757 0.49723757
0.5 0.49723757].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
which, despite the different wording, is not different from yours - essentially it says that your data look like a single sample.
Following the first advice(i.e. considering that quotient contains a single feature (column) resolves the issue:
k_means.fit(quotient.reshape(-1,1))
# result
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
Please try the code below. A brief explanation on what I've done:
First I built the dataset sample = np.vstack((quotient_times, quotient)).T and standardized it, so it would become easier to cluster. Following, I've applied DBScan with multiple hyperparameters (eps and min_samples) until I've found the one that separated the points better. Finally, I've plotted the data with its respective labels, since you are working with 2 dimensional data, it's easy to visualize how good the clustering is.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
dataset = np.empty((0, 2))
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
sample = np.vstack((quotient_times, quotient)).T
dataset = np.append(dataset, sample, axis=0)
scaler = StandardScaler()
dataset = scaler.fit_transform(dataset)
k_means = DBSCAN(eps=0.6, min_samples=1)
k_means.fit(dataset)
colors = [i for i in k_means.labels_]
plt.figure();
plt.title('Dataset 1,2,3')
plt.xlabel("time")
plt.ylabel("quotient")
plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)
plt.legend()
plt.show()
You are trying to make 3 clusters, while you have only 1 np.array i.e n_samples.
Try increasing the no. of arrays.
Decreasing no. of clusters.
Reshaping the array (not sure)
I have a dataframe called 'games':
Game_id Goals P_value
1 2 0.4
2 3 0.321
45 0 0.64
I need to split the P value to 0.05 steps, bin the rows per P value and than create a line graph that shows the sum per p value.
What I currently have:
games.set_index('p value', inplace=True)
games.sort_index()
np.cumsum(games['goals']).plot()
But I get this:
No matter what I tried, I couldn't group the P values and show the sum of goals per P value..
I also tried to use matplotlib.pyplot but than I couldn't use the cumsum function..
If I understood you correctly, you want to have discrete steps in the p-value of width 0.05 and show the cumulative sum?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create some random example data
df = pd.DataFrame({
'goals': np.random.poisson(3, size=1000),
'p_value': np.random.uniform(0, 1, size=1000)
})
# define binning in p-value
bin_edges = np.arange(0, 1.025, 0.05)
bin_center = 0.5 * (bin_edges[:-1] + bin_edges[1:])
bin_width = np.diff(bin_edges)
# find the p_value bin, each row belongs to
# 0 is underflow, len(edges) is overflow bin
df['bin'] = np.digitize(df['p_value'], bins=bin_edges)
# get the number of goals per p_value bin
goals_per_bin = df.groupby('bin')['goals'].sum()
print(goals_per_bin)
# not every bin might be filled, so we will use pandas index
# matching t
binned = pd.DataFrame({
'center': bin_center,
'width': bin_width,
'goals': np.zeros(len(bin_center))
}, index=np.arange(1, len(bin_edges)))
binned['goals'] = goals_per_bin
plt.step(
binned['center'],
binned['goals'],
where='mid',
)
plt.xlabel('p-value')
plt.ylabel('goals')
plt.show()