Vectorizing a for loop with a pandas dataframe

Vectorizing a for loop with a pandas dataframe - python-3.x

I am trying to do a project for my physics class where we are supposed to simulate motion of charged particles. We are supposed to randomly generate their positions and charges but we have to have positively charged particles in one region and negatively charged ones anywhere else. Right now, as a proof of concept, I am trying to do only 10 particles but the final project will have at least 1000.
My thought process is to create a dataframe with the first column containing the randomly generated charges and run a loop to see what value I get and place in the same dataframe as the next three columns their generated positions.
I have tried to do a simple for loop going over the rows and inputting the data as I go, but I run into an IndexingError: too many indexers. I also want this to run as efficiently as possible so that if I scale up the number of particles, it doesn't slow as much.
I also want to vectorize the operations of calculating the motion of each particle since it is based on position of every other particle which, through normal loops would take a lot of computational time.
Any vectorization optimization or offloading to GPU would be very helpful, thanks.
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
# In[2]:
num_points=10
df_position = pd.DataFrame(pd,np.empty((num_points,4)),columns=['Charge','X','Y','Z'])
# In[3]:
charge = np.array([np.random.choice(2,num_points)])
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[4]:
def positive():
return np.random.uniform(low=0, high=5)
def negative():
return np.random.uniform(low=5, high=10)
# In[5]:
for row in df_position.itertuples(index=True,name='Charge'):
if(getattr(row,"Charge")==-1):
df_position.iloc[row,1]=positive()
df_position.iloc[row,2]=positive()
df_position.iloc[row,3]=positive()
else:
df_position.iloc[row,1]=negative()
#this is where I would get the IndexingError and would like to optimize this portion
df_position.iloc[row,2]=negative()
df_position.iloc[row,3]=negative()
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[6]:
ax=plt.axes(projection='3d')
ax.set_xlim(0, 10); ax.set_ylim(0, 10); ax.set_zlim(0,10);
xdata=df_position.iloc[:,1]
ydata=df_position.iloc[:,2]
zdata=df_position.iloc[:,3]
chargedata=df_position.iloc[:11,0]
colors = np.where(df_position["Charge"]==1,'r','b')
ax.scatter3D(xdata,ydata,zdata,c=colors,alpha=1)
EDIT:
The dataframe that I want the results in would be something like this
Charge X Y Z
-1
1
-1
-1
1
With the inital coordinates of each charge listed after in their respective columns. It will be a 3D dataframe as I will need to track of all their new positions after each time step so that I can do animations of the motion. Each layer will be exactly the same format.

Some code for creating your dataframe:
import numpy as np
import pandas as pd
num_points = 1_000
# uniform distribution of int, not sure it is the best one for your problem
# positive_point = np.random.randint(0, num_points)
positive_point = int(num_points / 100 * np.random.randn() + num_points / 2)
negavite_point = num_points - positive_point
positive_df = pd.DataFrame(
np.random.uniform(0.0, 5.0, size=[positive_point, 3]), index=[1] * positive_point, columns=['X', 'Y', 'Z']
)
negative_df = pd.DataFrame(
np.random.uniform(5.0, 10.0, size=[negavite_point, 3]), index=[-1] *negavite_point, columns=['X', 'Y', 'Z']
)
df = pd.concat([positive_df, negative_df])
It is quite fast for 1,000 or 1,000,000.
Edit: with my first answer, I totally miss a big part of the question. This new one should fit better.
Second edit: I use a better distribution for the number of positive point than a uniform distribution of int.

Related

Why does kmeans give exactly the same results everytime?

I have re-run kmeans 4 times and get
From other answers, I got that
Everytime K-Means initializes the centroid, it is generated randomly.
Could you please explain why the results are exactly the same each time?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()

They are not the same. They are similar. K-means is an algorithm that is in a way moving centroids iteratively so that they become better and better at splitting data and while this process is deterministic, you have to pick initial values for those centroids and this is usually done at random. Random start, doesn't mean that final centroids will be random. They will converge to something relatively good and often similar.
Have a look at your code with this simple modification:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
cc = []
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
cc.append(kmeans.cluster_centers_)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
cc
if you have a look at exact values of those centroids, they will look like that:
[array([[ 4.97975722, 4.93316461],
[ 5.21715504, -0.18757547],
[ 0.31141141, 0.06726803],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162],
[ 4.97975722, 4.93316461]]),
array([[ 0.30066361, 0.06804847],
[ 4.97975722, 4.93316461],
[ 5.21017831, -0.18735444],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 4.97975722, 4.93316461],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162]])]
Similar, but different sets of values.
Also:
Have a look at default arguments to KMeans. There is one called n_init:
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
By default it is equal to 10. Which means every time you run k-means it actually run 10 times and picked the best result. Those best results will be even more similar, than results of a single run of k-means.

I post #AEF's comment to remove this question from unanswered list.
Random initialziation does not necessarily mean random result. Easiest example: k-means with k=1 always finds the mean in one step, regardless of where the center is initialised.

Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used. Note that the mere presence of random_state doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle, being set.
The passed value will have an effect on the reproducibility of the results returned by the function (fit, split, or any other function like k_means). random_state’s value may be:
for reference :
https://scikit-learn.org/stable/glossary.html#term-random_state

How to bin a netcdf data using xarray

I have some spatiotemporal data derived from the CHIRPS Database. It is a NetCDF that contains daily precipitation for all over the world with a spatial resolution of 1x1km2. The DataSet possesses 3 dimensions ('time', 'longitude', 'latitude').
I would like to bin this precipitation data according to each pixel's coordinate ('latitude' & 'longitude') temporal distribution. Therefore, the dimension I wish to apply the binnarization is the 'time' domain.
This is a similar question already discussed in StackOverflow (see in here). The difference between their Issue and mine is that, in my case, I need to binnarize the data according to each specific pixel's temporal distribution, instead of applying a single range of values for binnarization for all my coordinates (pixels). As a consequence, I expect to have different binning thresholds ('n' sets of thresholds), one for each of the 'n' pixels in my dataset.
As far as I understand, the simplest and fastest way to apply a function over each of the coordinates (pixels) of a Xarray's DataArray/DataSet is to use the xarray.apply_ufunc.
For the binnarization, I am using the pandas qcut method, which only requires an array of values and some given relative frequency (i.e.: [0.1%, 0.5%, 25%, 99%]) in order for it to work.
Since pandas binning function requires an array of data, and it also returns another array of binnarized data, I understand that I have to use the argument "vectorize"=True in the U_function (described in here).
Finally, when I run the analysis, The resulted Xarray DataSet ends up losing the 'time' dimension after the processing. Also, I get unsure whether that processing truly returned an Xarray DataSet with data properly classified.
Here is a reproducible snippet code. Notice that the 'time' dimension of the "ds_binned" is lost. Therefore, I have to later insert the binned data back to the original xarray dataset (ds). Also notice that the dimensions are not set in proper order. That also is causing problems for my analysis.
import pandas as pd
pd.set_option('display.width', 50000)
pd.set_option('display.max_rows', 50000)
pd.set_option('display.max_columns', 5000)
import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar
ds = xr.tutorial.open_dataset('rasm').load()
def parse_datetime(time):
return pd.to_datetime([str(x) for x in time])
ds.coords['time'] = parse_datetime(ds.coords['time'].values)
def binning_function(x, distribution_type='Positive', b=False):
y = np.where(np.abs(x)==np.inf, 0, x)
y = np.where(np.isnan(y), 0, y)
if np.all(y) == 0:
return x
else:
Classified = pd.qcut(y, np.linspace(0.01, 1, 10))
return Classified.codes
def xarray_parse_extremes(ds, dim=['time'], dask='allowed', new_dim_name=['classes'], kwargs={'b': False, 'distribution_type':'Positive'}):
filtered = xr.apply_ufunc(binning_function,
ds,
dask=dask,
vectorize=True,
input_core_dims=[dim],
#exclude_dims = [dim],
output_core_dims=[new_dim_name],
kwargs=kwargs,
output_dtypes=[float],
join='outer',
dataset_fill_value=np.nan,
).compute()
return filtered
with ProgressBar():
da_binned = xarray_parse_extremes(ds['Tair'] ,
['time'],
dask='allowed')
da_binned.name = 'classes'
ds_binned = da_binned.to_dataset()
ds['classes'] = (('y', 'x', 'time'), ds_binned['classes'].values)
mask = (ds['classes'] >= 5) & (ds['classes'] != 0)
ds.where(mask, drop=True).resample({'time':'Y'}).count('time')['Tair'].isel({'time':-1}).plot()
print(ds)
(ds.where(mask, drop=True).resample({'time':'Y'}).count('time')['Tair']
.to_dataframe().dropna().sort_values('Tair', ascending=False)
)
delayed_to_netcdf = ds.to_netcdf(r'F:\Philipe\temp\teste_tutorial.nc',
engine='netcdf4',
compute =False)
print('saving data classified')
with ProgressBar():
delayed_to_netcdf.compute()

Statistical tests: how do (perception; actual results; and next) interact?

What is the interaction between perception, outcome, and outlook?
I've brought them into categorical variables to [potentially] simplify things.
import pandas as pd
import numpy as np
high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
'age': np.random.randint(0, high, size),
'smokes_cat': pd.Categorical(np.tile(['lots', 'little', 'not'],
size//3+1)[:size]),
'outcome': np.random.randint(0, high, size),
'outlook_cat': pd.Categorical(np.tile(['positive', 'neutral',
'negative'],
size//3+1)[:size])
})
df.insert(2, 'age_cat', pd.Categorical(pd.cut(df.age, range(0, high+5, size//2),
right=False, labels=[
"{0} - {1}".format(i, i + 9)
for i in range(0, high, size//2)])))
def tierify(i):
if i <= 25:
return 'lowest'
elif i <= 50:
return 'low'
elif i <= 75:
return 'med'
return 'high'
df.insert(1, 'perception_cat', df['perception'].map(tierify))
df.insert(6, 'outcome_cat', df['outcome'].map(tierify))
np.random.shuffle(df['smokes_cat'])
Run online: http://ideone.com/fftuSv or https://repl.it/repls/MicroLeftSequences
This is faked data but should present the idea. The individual have a perceived view perception, then they are presented with actual outcome, and from that can decide their outlook.
Using Python (pandas, or anything open-source really), how do I show the probability—and p-value—of the interaction between these 3 dependent columns (possibly using the age, smokes_cat as potential confounders)?

You can use interaction plots for this particular purpose. This fits pretty well to your case. I would use such plot for your data. I've tried it for your dummy data generated in the question, and you can write your code like below. Think it as a pseudo-code though, you must tailor the code to your need.
In its simple form:
If the lines in the plot have an intersection or likely to have for other values, then you may assume that there is an interaction effect.
If the lines are parellel or not likely to have an intersection, then you assume there is no interaction effect.
Yet, for additional and deeper understanding, I placed some links that you can check out.
Code
... # The rest of the code in the question.
# Interaction plot
import matplotlib.pyplot as plt
from statsmodels.graphics.factorplots import interaction_plot
p = interaction_plot(
x = df['perception'],
trace=df['outlook_cat'],
response= df['outcome']
)
plt.savefig('./my_interaction_plot.png') # or plt.show()
You can find the documentation of interaction_plot() here. Besides, I also suggest you run an ANOVA.
Further reading
You can check out these links:
(A paper) titled Interaction Effects in ANOVA.
(A case) in practice case.
(Another case) in practice case.

One option is a Multinomial logit model:
# Create one-hot encoded version of categorical variables
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
all_enc_df = pd.DataFrame({column: enc.fit_transform(df[column])
for column in ('perception_cat', 'age_cat',
'smokes_cat', 'outlook_cat')})
# Regression
from sklearn.linear_model import LogisticRegression
X, y = (all_enc_df[['age_cat', 'smokes_cat', 'outlook_cat']],
all_enc_df[['perception_cat']])
#clf = LogisticRegression(random_state=0, solver='lbfgs',
# multi_class='multinomial').fit(X, y)
import statsmodels.api as sm
mullogit = sm.MNLogit(y,X)
mulfit = mullogit.fit(method='bfgs', maxiter=100)
print(mulfit.summary())
https://repl.it/repls/MicroLeftSequences

Multi Label Text Data Visualization

I have multi-label text data. I want to visualize this data in python in some good graph to get an idea how much overlapping exist in my data and also wanted to know if there is any pattern exist in overlapping like when 40% of times class_1 is coming then also class_40 is coming too.
Data is in this form:
paragraph_1 class_1
paragraph_11 class_2
paragraph_1 class_2
paragraph_1 class_3
paragraph_13 class_3
What is the best way to visualize such data? Which library can help in this case seaborn, matplotlib etc?

You can try this:
%matplotlib inline
import matplotlib.pylab as plt
from collections import Counter
x = ['paragraph1', 'paragraph1','paragraph1','paragraph1','paragraph2', 'paragraph2','paragraph3','paragraph1','paragraph4']
y = ['class1','class1','class1', 'class2','class3','class3', 'class1', 'class3','class4']
# count the occurrences of each point
c = Counter(zip(x,y))
# create a list of the sizes, here multiplied by 10 for scale
s = [10*c[(xx,yy)] for xx,yy in zip(x,y)]
plt.grid()
# plot it
plt.scatter(x, y, s=s)
plt.show()
The higher is the occurence, the bigger is the marker.
Different question, but same answer proposed by #James can be found here: How to have scatter points become larger for higher density using matplotlib?
Edit1 (if you have bigger dataset)
Different approach using heatmaps:
import numpy as np
from collections import Counter
import seaborn as sns
import pandas as pd
x = ['paragraph1', 'paragraph1','paragraph1','paragraph1','paragraph2', 'paragraph2','paragraph3','paragraph1','paragraph4']
y = ['class1','class1','class1', 'class2','class3','class3', 'class1', 'class3','class4']
# count the occurrences of each point
c = Counter(zip(x,y))
# fill pandas DataFrame with zeros
dff = pd.DataFrame(0,columns =np.unique(x) , index =np.unique(y))
# count occurencies and prepare data for heatmap
for k,v in c.items():
dff[k[0]][k[1]] = v
sns.heatmap(dff,annot=True, fmt="d")

How to highlight the area with maximum number of changes in a time series plot?

I am trying to play with some time series data. I would like to plot the area with maximum numbers of changes based on some interval.
I have written some sample code but I am not able to move forward in highlighting the region.
import pandas as pd
import numpy as np
import seaborn as sns
f = pd.DataFrame(np.random.randint(0,50,size=(300, 1)))
sns.tsplot(f[0])
I want to highlight the region with maximum changes say with window size 30.

Here is one approach that performs most of the operations in numpy, and then displays the region with matplotlib.axvspan:
f = pd.DataFrame(np.random.randint(0,50,size=(300, 1))) # dataframe
y = f[0].values # working vector in numpy
thr = 5 # criterion for counting as a change
chunk_size = 30 # window length
chunks = np.array_split(y, y.shape[0]/chunk_size) # split into 30-element chunks
# compute how many elements differ from one element to the next
diffs_by_chunk = [(np.abs(np.ediff1d(chunk)) > thr).sum() for chunk in chunks]
ix = np.argmax(diffs_by_chunk) # chunk with most differences
sns.tsplot(f[0])
plt.axvspan(ix * chunk_size, (ix+1) * chunk_size, alpha=0.5)
With a baseline of uniform random data, it is difficult to relate this to a use case, but alternative criteria for what to maximise over might be useful, e.g. just looking at the sum of absolute changes, rather than the number that exceed a threshold:
diffs_by_chunk = [(np.abs(np.ediff1d(chunk))).sum() for chunk in chunks] # criterion #2
It would also be possible to show multiple regions that all have enough differences:
for i, df in enumerate(diffs_by_chunk):
if df >= 25:
sns.mpl.pyplot.axvspan(i*chunk_size, (i+1)*chunk_size, alpha=0.5)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Vectorizing a for loop with a pandas dataframe - python-3.x

Related

Why does kmeans give exactly the same results everytime?

How to bin a netcdf data using xarray

Statistical tests: how do (perception; actual results; and next) interact?

Multi Label Text Data Visualization

How to highlight the area with maximum number of changes in a time series plot?

Categories

Resources