trapz integration on dataframe index after grouping - python-3.x

I have some data and I want to first group by some interval the Target column and then integrate the target column by index spacing.
import numpy as np
import pandas as pd
from scipy import integrate
df = pd.DataFrame({'A': np.array([100, 105.4, 108.3, 111.1, 113, 114.7, 120, 125, 129, 130, 131, 133,135,140, 141, 142]),
'B': np.array([11, 11.8, 12.3, 12.8, 13.1,13.6, 13.9, 14.4, 15, 15.1, 15.2, 15.3, 15.5, 16, 16.5, 17]),
'C': np.array([55, 56.3, 57, 58, 59.5, 60.4, 61, 61.5, 62, 62.1, 62.2, 62.3, 62.5, 63, 63.5, 64]),
'Target': np.array([4000, 4200.34, 4700, 5300, 5800, 6400, 6800, 7200, 7500, 7510, 7530, 7540, 7590,
8000, 8200, 8300])})
df['y'] = df.groupby(pd.cut(df.iloc[:, 3], np.arange(0, max(df.iloc[:, 3]) + 100, 100))).sum().apply(lambda g: integrate.trapz(g.Target, x = g.index))
Above code gives me:
AttributeError: ("'Series' object has no attribute 'Target'", 'occurred at index A')
If I try this:
colNames = ['A', 'B', 'C', 'Target']
df['z'] = df.groupby(pd.cut(df.iloc[:, 3], np.arange(0, max(df.iloc[:, 3]) + 100, 100))).sum().apply(lambda g: integrate.trapz(g[colNames[3]], x = g.index))
I get:
TypeError: 'str' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
....
KeyError: ('Target', 'occurred at index A')

You have several problems in your code:
You have created a copy of your dataframe with a categorical index than I think integrate.trapz cannot deal with.
with apply, you are applying integrate.trapz to each row. That makes no sense. For that reason I asked in my comment if you need, in each row, the integral from 0 to Target value in such row.
If you want transform you data by intervals of 100 in the column 'Target' from 0 as you have done, first you can get the sum by intervals of 'Target' from 0 to 100
>>>i_df = df.groupby(pd.cut(df.iloc[:, 3], np.arange(0, max(df.iloc[:, 3]) + 100, 100))).sum()
Then you get the trapezoidal integral of column 'Target' with intervals of 100
>>>integrate.trapz(i_df['Target'], dx=100)
10242034.0
You cannot use x=i_df.index because the (internal in trapz) operation substraction is not defined for intervals, and you have created an intervals index.
If you need to use the dataframe index you must reset it.
>>>i_df = df.groupby(pd.cut(df.iloc[:, 3], np.arange(0, max(df.iloc[:, 3]) + 100, 100))).sum().reset_index(drop=True)
>>>integrate.trapz(i_df['Target'], x=100*i_df.index)
10242034.0

Related

Python: Plot histograms with customized bins

I am using matplotlib.pyplot to make a histogram. Due to the distribution of the data, I want manually set up the bins. The details are as follows:
Any value = 0 in one bin;
Any value > 60 in the last bin;
Any value > 0 and <= 60 are in between the bins described above and the bin size is 5.
Could you please give me some help? Thank you.
I'm not sure what you mean by "the bin size is 5". You can either plot a histogramm by specifying the bins with a sequence:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -5] # your data here
plt.hist(data, bins=[0, 0.5, 60, max(data)])
plt.show()
But the bin size will match the corresponding interval, meaning -in this example- that the "0-case" will be barely visible:
(Note that 60 is moved to the last bin when specifying bins as a sequence, changing the sequence to [0, 0.5, 59.5, max(data)] would fix that)
What you (probably) need is first to categorize your data and then plot a bar chart of the categories:
import matplotlib.pyplot as plt
import pandas as pd
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -5] # your data here
df = pd.DataFrame()
df['data'] = data
def find_cat(x):
if x == 0:
return "0"
elif x > 60:
return "> 60"
elif x > 0:
return "> 0 and <= 60"
df['category'] = df['data'].apply(find_cat)
df.groupby('category', as_index=False).count().plot.bar(x='category', y='data', rot=0, width=0.8)
plt.show()
Output:
building off Tranbi's answer, you could specify the bin edges as detailed in the link they shared.
import matplotlib.pyplot as plt
import pandas as pd
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -6] # your data here
df = pd.DataFrame()
df['data'] = data
bin_edges = [-5, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65]
bin_edges_offset = [x+0.000001 for x in bin_edges]
plt.figure()
plt.hist(df['data'], bins=bin_edges_offset)
plt.show()
histogram
IIUC you want a classic histogram for value between 0 (not included) and 60 (included) and add two bins for 0 and >60 on the side.
In that case I would recommend plotting the 3 regions separately:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -3] # your data here
fig, axes = plt.subplots(1,3, sharey=True, width_ratios=[1, 12, 1])
fig.subplots_adjust(wspace=0)
# counting 0 values and drawing a bar between -5 and 0
axes[0].bar(-5, data.count(0), width=5, align='edge')
axes[0].xaxis.set_visible(False)
axes[0].spines['right'].set_visible(False)
axes[0].set_xlim((-5, 0))
# histogram between (0, 60]
axes[1].hist(data, bins=12, range=(0.0001, 60.0001))
axes[1].yaxis.set_visible(False)
axes[1].spines['left'].set_visible(False)
axes[1].spines['right'].set_visible(False)
axes[1].set_xlim((0, 60))
# counting values > 60 and drawing a bar between 60 and 65
axes[2].bar(60, len([x for x in data if x > 60]), width=5, align='edge')
axes[2].xaxis.set_visible(False)
axes[2].yaxis.set_visible(False)
axes[2].spines['left'].set_visible(False)
axes[2].set_xlim((60, 65))
plt.show()
Output:
Edit: If you wanna plot probability density, I would edit the data and simply use hist:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -3] # your data here
data2 = []
for el in data:
if el < 0:
pass
elif el > 60:
data2.append(61)
else:
data2.append(el)
plt.hist(data2, bins=14, density=True, range=(-4.99,65.01))
plt.show()

Row sum, column sum & diagonal sum of a matrix using list comprehension. Also get rid of error: "TypeError: 'int' object is not iterable"

I want to do column sum using List comprehension. Below is my code and corresponding error. If someone can help me mitigate the problem, would be great. I just want to use List comprehension only, don't want to use 'zip' function.
Code:
from pprint import pprint
from random import randint
r=int(input('Enter number of rows :'))
c=int(input('Enter number of columns:'))
l=[[randint(1,50) for i in range(r)] for j in range(c)]
print('The 2D Matrix is: ')
pprint(l)
lr_sum = [sum(l[i]) for i in range(r)]
print('Row Sum: ', lr_sum)
lc_sum=[[sum(l[j][i]) for i in range(r)] for j in range(c)]
print('Column sum: ', lc_sum)
Output:
Enter number of rows :5
Enter number of columns:5
The 2D Matrix is:
[[27, 3, 1, 28, 9],
[18, 20, 9, 50, 48],
[2, 44, 45, 14, 39],
[48, 12, 2, 38, 39],
[2, 37, 46, 26, 23]]
Row Sum: [68, 145, 144, 139, 134]
Traceback (most recent call last):
File "19.py", line 10, in <module>
lc_sum=[[sum(l[j][i]) for i in range(r)] for j in range(c)]
File "19.py", line 10, in <listcomp>
lc_sum=[[sum(l[j][i]) for i in range(r)] for j in range(c)]
File "19.py", line 10, in <listcomp>
lc_sum=[[sum(l[j][i]) for i in range(r)] for j in range(c)]
TypeError: 'int' object is not iterable
Thanks,
Sudip Ray
Python - Beginner
You made a simple mistake in your lc_sum definition. l[j][i] refers to a single (integer) value of your matrix. You cannot sum over a single value. Try it out yourself. E.g. sum(1) raises the same TypeError: 'int' object is not iterable.
In order for your code to work, you need to sum over the whole nested list comprehension:
from pprint import pprint
from random import randint
r = int(input('Enter number of rows :'))
c = int(input('Enter number of columns:'))
l = [[randint(1, 50) for i in range(r)] for j in range(c)]
print('The 2D Matrix is: ')
pprint(l)
lr_sum = [sum(l[i]) for i in range(r)]
print('Row Sum: ', lr_sum)
lc_sum = [sum([l[j][i] for i in range(r)]) for j in range(c)]
print('Column sum: ', lc_sum)
Be aware however, that this is not the correct way to calculate the column sums. For the correct calculation, you also need to swap your list comprehensions:
lc_sum = [sum([l[j][i] for j in range(c)]) for i in range(r)]
Output1:
Enter number of rows :5
Enter number of columns:4
The 2D Matrix is:
[[10, 39, 40, 28],
[35, 34, 14, 1],
[27, 32, 20, 1],
[11, 13, 13, 8],
[28, 10, 16, 13]]
Row Sum : [117, 84, 80, 45, 67]
Column sum : [111, 128, 103, 51]
Please make sure that matrix is square
The diagonal sums are not possible
Output2:
Enter number of rows :5
Enter number of columns:5
The 2D Matrix is:
[[29, 16, 6, 48, 14],
[18, 41, 4, 37, 4],
[4, 41, 29, 44, 48],
[19, 8, 46, 2, 50],
[38, 49, 46, 1, 43]]
Row Sum : [113, 104, 166, 125, 177]
Column sum : [108, 155, 131, 132, 159]
Forward Diagonal sum: [144]
Reverse Diagonal sum: [126]
##################################################################################
from pprint import pprint
from random import randint
r=int(input('Enter number of rows :'))
c=int(input('Enter number of columns:'))
#l=[[int(input('Enter the numbers: ')) for j in range(c)] for i in range(r)]
l=[[randint(1,50) for j in range(c)] for i in range(r)]
print('The 2D Matrix is: ')
pprint(l)
##################################################################################
if(r!=c):
lr_sum = [sum(l[i]) for i in range(r)]
print('Row Sum :', lr_sum)
lc_sum=[sum(l[j][i] for j in range(r)) for i in range(c)]
print('Column sum :', lc_sum)
print('Please make sure that matrix is square')
print('The diagonal sums are not possible')
##################################################################################
else:
lr_sum = [sum(l[i]) for i in range(r)]
print('Row Sum :', lr_sum)
lc_sum=[sum(l[j][i] for j in range(r)) for i in range(c)]
print('Column sum :', lc_sum)
ld1_sum=[sum(l[i][j] for i in range(r) for j in range(c) if i==j)]
ld2_sum=[sum(l[i][j] for i in range(r) for j in range(c-1,-1,-1) if (i+j==r-1))]
print('Forward Diagonal sum:',ld1_sum)
print('Reverse Diagonal sum:',ld2_sum)

How to find the index position of items in a pandas list which satisfy a certain condition?

How can I find the index position of items in a list which satisfy a certain condition?
Like suppose, I have a list like:
myList = [0, 100, 335, 240, 300, 450, 80, 500, 200]
And the condition is to find out the position of all elements within myList which lie between 0 and 300 (both inclusive).
I am expecting the output as:
output = [0, 1, 3, 4, 6, 8]
How can I do this in pandas?
Also, how to find out the index of the maximum element in the subset of elements which satisfy the condition? Like, in the above case, out of the elements which satisfy the given condition 300 is the maximum and its index is 4. So, need to retrieve its index.
I have been trying many ways but not getting the desired result. Please help, I am new to the programming world.
You can try this,
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [0, 100, 335, 240, 300, 450, 80, 500, 200]})
>>> index = list(df[(df.a >= 0) & (df.a <= 300)].index)
>>> df.loc[index,].idxmax()
a 4
dtype: int64
or using the list,
>>> l = [0, 100, 335, 240, 300, 450, 80, 500, 200]
>>> index = [(i, v) for i, v in enumerate(l) if v >= 0 and v <= 300]
>>> [t[0] for t in index]
[0, 1, 3, 4, 6, 8]
>>> sorted(index, key=lambda x: x[1])[-1][0]
4
As Grzegorz Skibinski says, if we use numpy to get rid of many computations,
>>> import numpy as np
>>> l = [0, 100, 335, 240, 300, 450, 80, 500, 200]
>>> index = np.array([[i, v] for i, v in enumerate(l) if v >= 0 and v <= 300])
>>> index[:,0]
array([0, 1, 3, 4, 6, 8])
>>> index[index.argmax(0)[1]][0]
4
You can use numpy for that purpose:
import numpy as np
myList =np.array( [0, 100, 335, 240, 300, 450, 80, 500, 200])
res=np.where((myList>=0)&(myList<=300))[0]
print(res)
###and to get maximum:
res2=res[myList[res].argmax()]
print(res2)
Output:
[0 1 3 4 6 8]
4
[Program finished]
This is between in pandas:
myList = [0, 100, 335, 240, 300, 450, 80, 500, 200]
s= pd.Series(myList)
s.index[s.between(0,300)]
Output:
Int64Index([0, 1, 3, 4, 6, 8], dtype='int64')

how to plot a single line with different types of line dash using bokeh?

I am trying to plot the line for a set of points. Currently, I have set of points as Column names X, Y and Type in the form of a data frame. Whenever the type is 1, I would like to plot the points as dashed and whenever the type is 2, I would like to plot the points as a solid line.
Currently, I am using for loop to iterate over all points and plot each point using plt.dash. However, this is slowing down my run time since I want to plot more than 40000 points.
So, is an easy way to plot the line overall points with different line dash type?
You could realize it by drawing multiple line segments like this
(Bokeh v1.1.0)
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, Range1d, LinearAxis
line_style = {1: 'solid', 2: 'dashed'}
data = {'name': [1, 1, 1, 2, 2, 2, 1, 1, 1, 1],
'counter': [1, 2, 3, 3, 4, 5, 5, 6, 7, 8],
'score': [150, 150, 150, 150, 150, 150, 150, 150, 150, 150],
'age': [20, 21, 22, 22, 23, 24, 24, 25, 26, 27]}
df = pd.DataFrame(data)
plot = figure(y_range = (100, 200))
plot.extra_y_ranges = {"Age": Range1d(19, 28)}
plot.add_layout(LinearAxis(y_range_name = "Age"), 'right')
for i, g in df.groupby([(df.name != df.name.shift()).cumsum()]):
source = ColumnDataSource(g)
plot.line(x = 'counter', y = 'score', line_dash = line_style[g.name.unique()[0]], source = source)
plot.circle(x = 'counter', y = 'age', color = "blue", size = 10, y_range_name = "Age", source = source)
show(plot)

Get list of rows with same name from dataframe using pandas

Was looking for a way to get the list of a partial row.
Name x y r
a 9 81 63
a 98 5 89
b 51 50 73
b 41 22 14
c 6 18 1
c 1 93 55
d 57 2 90
d 58 24 20
So i was trying to get the dictionary as follows,
di = {a:{0: [9,81,63], 1: [98,5,89]},
b:{0:[51,50,73], 1:[41,22,14]},
c:{0:[6,18,1], 1:[1,93,55]},
d:{0:[57,2,90], 1:[58,24,20]}}
Use groupby with custom function for count lists, last convert output Series to_dict:
di = (df.groupby('Name')['x','y','r']
.apply(lambda x: dict(zip(range(len(x)),x.values.tolist())))
.to_dict())
print (di)
{'b': {0: [51, 50, 73], 1: [41, 22, 14]},
'a': {0: [9, 81, 63], 1: [98, 5, 89]},
'c': {0: [6, 18, 1], 1: [1, 93, 55]},
'd': {0: [57, 2, 90], 1: [58, 24, 20]}}
Detail:
print (df.groupby('Name')['x','y','r']
.apply(lambda x: dict(zip(range(len(x)),x.values.tolist()))))
Name
a {0: [9, 81, 63], 1: [98, 5, 89]}
b {0: [51, 50, 73], 1: [41, 22, 14]}
c {0: [6, 18, 1], 1: [1, 93, 55]}
d {0: [57, 2, 90], 1: [58, 24, 20]}
dtype: object
Thank you volcano for suggestion use enumerate:
di = (df.groupby('Name')['x','y','r']
.apply(lambda x: dict(enumerate(x.values.tolist())))
.to_dict())
For better testing is possible use custom function:
def f(x):
#print (x)
a = range(len(x))
b = x.values.tolist()
print (a)
print (b)
return dict(zip(a,b))
[[9, 81, 63], [98, 5, 89]]
range(0, 2)
[[9, 81, 63], [98, 5, 89]]
range(0, 2)
[[51, 50, 73], [41, 22, 14]]
range(0, 2)
[[6, 18, 1], [1, 93, 55]]
range(0, 2)
[[57, 2, 90], [58, 24, 20]]
di = df.groupby('Name')['x','y','r'].apply(f).to_dict()
print (di)
Sometimes it is best to minimize the footprint and overhead.
Using itertools.count, collections.defaultdict
from itertools import count
from collections import defaultdict
counts = {k: count(0) for k in df.Name.unique()}
d = defaultdict(dict)
for k, *v in df.values.tolist():
d[k][next(counts[k])] = v
dict(d)
{'a': {0: [9, 81, 63], 1: [98, 5, 89]},
'b': {0: [51, 50, 73], 1: [41, 22, 14]},
'c': {0: [6, 18, 1], 1: [1, 93, 55]},
'd': {0: [57, 2, 90], 1: [58, 24, 20]}}

Resources