Multi Label Text Data Visualization - python-3.x

I have multi-label text data. I want to visualize this data in python in some good graph to get an idea how much overlapping exist in my data and also wanted to know if there is any pattern exist in overlapping like when 40% of times class_1 is coming then also class_40 is coming too.
Data is in this form:
paragraph_1 class_1
paragraph_11 class_2
paragraph_1 class_2
paragraph_1 class_3
paragraph_13 class_3
What is the best way to visualize such data? Which library can help in this case seaborn, matplotlib etc?

You can try this:
%matplotlib inline
import matplotlib.pylab as plt
from collections import Counter
x = ['paragraph1', 'paragraph1','paragraph1','paragraph1','paragraph2', 'paragraph2','paragraph3','paragraph1','paragraph4']
y = ['class1','class1','class1', 'class2','class3','class3', 'class1', 'class3','class4']
# count the occurrences of each point
c = Counter(zip(x,y))
# create a list of the sizes, here multiplied by 10 for scale
s = [10*c[(xx,yy)] for xx,yy in zip(x,y)]
plt.grid()
# plot it
plt.scatter(x, y, s=s)
plt.show()
The higher is the occurence, the bigger is the marker.
Different question, but same answer proposed by #James can be found here: How to have scatter points become larger for higher density using matplotlib?
Edit1 (if you have bigger dataset)
Different approach using heatmaps:
import numpy as np
from collections import Counter
import seaborn as sns
import pandas as pd
x = ['paragraph1', 'paragraph1','paragraph1','paragraph1','paragraph2', 'paragraph2','paragraph3','paragraph1','paragraph4']
y = ['class1','class1','class1', 'class2','class3','class3', 'class1', 'class3','class4']
# count the occurrences of each point
c = Counter(zip(x,y))
# fill pandas DataFrame with zeros
dff = pd.DataFrame(0,columns =np.unique(x) , index =np.unique(y))
# count occurencies and prepare data for heatmap
for k,v in c.items():
dff[k[0]][k[1]] = v
sns.heatmap(dff,annot=True, fmt="d")

Related

Seaborn / Matplotlib: Subplots depending on one column

I have a Dataframe and based on its data, I draw lineplots for it.
The code currently looks as simple as that:
ax = sns.lineplot(x='datapoints', y='mean', hue='index', data=df)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
Now, there actually is a column, called "klinger", which has 8 different values and I would like to get a plot consisting of eight subplots (4x2) for it, all sharing just one legend.
Is that an easy thing to do?
Currently, I generate sub-dfs by filtering and just draw eight diagrams and cut them together with a graphic tool, but this can't be the solution
You can get what you are looking for with sns.relplot and kind='line'.
Use col='klinger' to plot subplots as many as you need, col_wrap=4 will help to obtain 4x2 shape, and col_order=klinger_categories will select which categories you want to plot.
import numpy as np
import pandas as pd
import seaborn as sns
number = 100
klinger_categories = ['a','b','c','d','e','f','g','h']
data = {'datapoints': np.arange(number),
'mean': np.random.normal(0,1,size=number),
'index': np.random.choice(np.arange(2),size=number),
'klinger': np.random.choice(klinger_categories,size=number),
}
df = pd.DataFrame(data)
sns.relplot(
data=df, x='datapoints', y='mean', hue='index', kind='line',
col='klinger', col_wrap=4, col_order=klinger_categories
)

Is it possible to switch X axis in Python matplotlib.pyplot.hist from bin edges to exact values?

Is it possible to switch X axis in Python matplotlib.pyplot.hist from bin edges to exact values?
In other words this is what I get:
dataset = [0,1,1,1,2,2,3,3,4]
plt.hist(dataset, 5, rwidth=0.9)
and this what I need:
You can first compute the frequencies and then use a bar plot
from collections import Counter
import matplotlib.pyplot as plt
dataset = [0,1,1,1,2,2,3,3,4]
freqs = Counter(dataset)
plt.bar(freqs.keys(), freqs.values(), width=0.9)
plt.show()

Vectorizing a for loop with a pandas dataframe

I am trying to do a project for my physics class where we are supposed to simulate motion of charged particles. We are supposed to randomly generate their positions and charges but we have to have positively charged particles in one region and negatively charged ones anywhere else. Right now, as a proof of concept, I am trying to do only 10 particles but the final project will have at least 1000.
My thought process is to create a dataframe with the first column containing the randomly generated charges and run a loop to see what value I get and place in the same dataframe as the next three columns their generated positions.
I have tried to do a simple for loop going over the rows and inputting the data as I go, but I run into an IndexingError: too many indexers. I also want this to run as efficiently as possible so that if I scale up the number of particles, it doesn't slow as much.
I also want to vectorize the operations of calculating the motion of each particle since it is based on position of every other particle which, through normal loops would take a lot of computational time.
Any vectorization optimization or offloading to GPU would be very helpful, thanks.
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
# In[2]:
num_points=10
df_position = pd.DataFrame(pd,np.empty((num_points,4)),columns=['Charge','X','Y','Z'])
# In[3]:
charge = np.array([np.random.choice(2,num_points)])
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[4]:
def positive():
return np.random.uniform(low=0, high=5)
def negative():
return np.random.uniform(low=5, high=10)
# In[5]:
for row in df_position.itertuples(index=True,name='Charge'):
if(getattr(row,"Charge")==-1):
df_position.iloc[row,1]=positive()
df_position.iloc[row,2]=positive()
df_position.iloc[row,3]=positive()
else:
df_position.iloc[row,1]=negative()
#this is where I would get the IndexingError and would like to optimize this portion
df_position.iloc[row,2]=negative()
df_position.iloc[row,3]=negative()
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[6]:
ax=plt.axes(projection='3d')
ax.set_xlim(0, 10); ax.set_ylim(0, 10); ax.set_zlim(0,10);
xdata=df_position.iloc[:,1]
ydata=df_position.iloc[:,2]
zdata=df_position.iloc[:,3]
chargedata=df_position.iloc[:11,0]
colors = np.where(df_position["Charge"]==1,'r','b')
ax.scatter3D(xdata,ydata,zdata,c=colors,alpha=1)
EDIT:
The dataframe that I want the results in would be something like this
Charge X Y Z
-1
1
-1
-1
1
With the inital coordinates of each charge listed after in their respective columns. It will be a 3D dataframe as I will need to track of all their new positions after each time step so that I can do animations of the motion. Each layer will be exactly the same format.
Some code for creating your dataframe:
import numpy as np
import pandas as pd
num_points = 1_000
# uniform distribution of int, not sure it is the best one for your problem
# positive_point = np.random.randint(0, num_points)
positive_point = int(num_points / 100 * np.random.randn() + num_points / 2)
negavite_point = num_points - positive_point
positive_df = pd.DataFrame(
np.random.uniform(0.0, 5.0, size=[positive_point, 3]), index=[1] * positive_point, columns=['X', 'Y', 'Z']
)
negative_df = pd.DataFrame(
np.random.uniform(5.0, 10.0, size=[negavite_point, 3]), index=[-1] *negavite_point, columns=['X', 'Y', 'Z']
)
df = pd.concat([positive_df, negative_df])
It is quite fast for 1,000 or 1,000,000.
Edit: with my first answer, I totally miss a big part of the question. This new one should fit better.
Second edit: I use a better distribution for the number of positive point than a uniform distribution of int.

Plotting a chart a plot in which the Y text data and X numeric data from dictionary. Matplotlib & Python 3 [duplicate]

I can create a simple columnar diagram in a matplotlib according to the 'simple' dictionary:
import matplotlib.pyplot as plt
D = {u'Label1':26, u'Label2': 17, u'Label3':30}
plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(range(len(D)), D.keys())
plt.show()
But, how do I create curved line on the text and numeric data of this dictionarie, I do not know?
ΠΆ_OLD = {'10': 'need1', '11': 'need2', '12': 'need1', '13': 'need2', '14': 'need1'}
Like the picture below
You may use numpy to convert the dictionary to an array with two columns, which can be plotted.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
x = list(zip(*T_OLD.items()))
# sort array, since dictionary is unsorted
x = np.array(x)[:,np.argsort(x[0])].T
# let second column be "True" if "need2", else be "False
x[:,1] = (x[:,1] == "need2").astype(int)
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks([0,1])
plt.gca().set_yticklabels(['need1', 'need2'])
plt.show()
The following would be a version, which is independent on the actual content of the dictionary; only assumption is that the keys can be converted to floats.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10': 'run', '11': 'tea', '12': 'mathematics', '13': 'run', '14' :'chemistry'}
x = np.array(list(zip(*T_OLD.items())))
u, ind = np.unique(x[1,:], return_inverse=True)
x[1,:] = ind
x = x.astype(float)[:,np.argsort(x[0])].T
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks(range(len(u)))
plt.gca().set_yticklabels(u)
plt.show()
Use numeric values for your y-axis ticks, and then map them to desired strings with plt.yticks():
import matplotlib.pyplot as plt
import pandas as pd
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice([0,1], size=len(times))
data_labels = ['need1','need2']
fig, ax = plt.subplots()
ax.plot(times, data, marker='o', linestyle="None")
plt.yticks(data, data_labels)
plt.xlabel("time")
Note: It's generally not a good idea to use a line graph to represent categorical changes in time (e.g. from need1 to need2). Doing that gives the visual impression of a continuum between time points, which may not be accurate. Here, I changed the plotting style to points instead of lines. If for some reason you need the lines, just remove linestyle="None" from the call to plt.plot().
UPDATE
(per comments)
To make this work with a y-axis category set of arbitrary length, use ax.set_yticks() and ax.set_yticklabels() to map to y-axis values.
For example, given a set of potential y-axis values labels, let N be the size of a subset of labels (here we'll set it to 4, but it could be any size).
Then draw a random sample data of y values and plot against time, labeling the y-axis ticks based on the full set labels. Note that we still use set_yticks() first with numerical markers, and then replace with our category labels with set_yticklabels().
labels = np.array(['A','B','C','D','E','F','G'])
N = 4
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice(np.arange(len(labels)), size=len(times))
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(times, data, marker='o', linestyle="None")
ax.set_yticks(np.arange(len(labels)))
ax.set_yticklabels(labels)
plt.xlabel("time")
This gives the exact desired plot:
import matplotlib.pyplot as plt
from collections import OrderedDict
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
plt.plot(map(int, T_SRT.keys()), map(lambda x: int(x[-1]), T_SRT.values()),'r')
plt.ylim([0.9,2.1])
ax = plt.gca()
ax.set_yticks([1,2])
ax.set_yticklabels(['need1', 'need2'])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
For Python 3.X the plotting lines needs to explicitly convert the map() output to lists:
plt.plot(list(map(int, T_SRT.keys())), list(map(lambda x: int(x[-1]), T_SRT.values())),'r')
as in Python 3.X map() returns an iterator as opposed to a list in Python 2.7.
The plot uses the dictionary keys converted to ints and last elements of need1 or need2, also converted to ints. This relies on the particular structure of your data, if the values where need1 and need3 it would need a couple more operations.
After plotting and changing the axes limits, the program simply modifies the tick labels at y positions 1 and 2. It then also adds the title and the x and y axis labels.
Important part is that the dictionary/input data has to be sorted. One way to do it is to use OrderedDict. Here T_SRT is an OrderedDict object sorted by keys in T_OLD.
The output is:
This is a more general case for more values/labels in T_OLD. It assumes that the label is always 'needX' where X is any number. This can readily be done for a general case of any string preceding the number though it would require more processing,
import matplotlib.pyplot as plt
from collections import OrderedDict
import re
T_OLD = {'10' : 'need1', '11':'need8', '12':'need11', '13':'need1','14':'need3'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
x_val = list(map(int, T_SRT.keys()))
y_val = list(map(lambda x: int(re.findall(r'\d+', x)[-1]), T_SRT.values()))
plt.plot(x_val, y_val,'r')
plt.ylim([0.9*min(y_val),1.1*max(y_val)])
ax = plt.gca()
y_axis = list(set(y_val))
ax.set_yticks(y_axis)
ax.set_yticklabels(['need' + str(i) for i in y_axis])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
This solution finds the number at the end of the label using re.findall to accommodate for the possibility of multi-digit numbers. Previous solution just took the last component of the string because numbers were single digit. It still assumes that the number for plotting position is the last number in the string, hence the [-1]. Again for Python 3.X map output is explicitly converted to list, step not necessary in Python 2.7.
The labels are now generated by first selecting unique y-values using set and then renaming their labels through concatenation of the strings 'need' with its corresponding integer.
The limits of y-axis are set as 0.9 of the minimum value and 1.1 of the maximum value. Rest of the formatting is as before.
The result for this test case is:

Apply power fit to data by using levenberg-marquardt algorithm in python

Hy everybody!
I am a beginer in python and data analysis, and meet with a problem, during fitting a power function to my data.
Here I plotted my dataset as a scatterplot
I want to plot a power function with expontent arround -1 , but after I apply the levenberg-marquardt method, using lmfit library in python, I get the following faulty image. I tried to modify the initial parameters, but it didn't help.
Here is my code:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lmfit import minimize, Parameters, Parameter, report_fit
be = pd.read_table('...',
skipinitialspace=True,
names = ["CoM", "slope", "slope2"])
x=be["CoM"]
data=be["slope"]
def fcn2min(params, x, data):
n2 = params['n2'].value
n1 = params['n1'].value
model = n1 * x ** n2
return model - data #that's what you want to minimize
# create a set of Parameters
# 'value' is the initial condition
params = Parameters()
params.add('n2', value= -1.00)
params.add('n1',value= 23.0)
# do fit, here with leastsq model
result = minimize(fcn2min, params, args=(be["CoM"],be["slope"]))
#calculate final result
final = data + result.residual
resid = result.residual
# write error report
report_fit(result)
#plot results
xplot = x
yplot = result.params['n1'].value * x ** result.params['n2'].value
plt.figure(figsize=(15,6))
plt.ylabel('OD-slope',fontsize=18, color='blue')
plt.xlabel('CoM height_Sz [m]',fontsize=18, color='blue')
plt.plot(be["CoM"],be["slope"],"o", label="slope_flat")
plt.plot(be["CoM"],be["slope2"],"+",color='r', label="slope_curv")
plt.plot(xplot,yplot)
plt.legend()
plt.savefig('plot2')
plt.show()
I don't quite understand what is the problem with this, so if you have any observations, thank you very much.
It's a little hard to tell what the question is. t looks to me like the fit completed and gave a reasonably good fit, but you don't provide the fit statistics or report of the parameters.
If you're asking about all the green lines for the "COM" array (the best fit?), this is almost certainly because the starting x axis "height_Sz" data was not sorted to be strictly increasing. That's OK for the fit, but plotting an X-Y trace with a line expects the data to be in order.

Resources