How to calculate distances between rows in dataframe and create a matrix - python-3.x

I have a dataframe like this
import pandas as pd
sample = pd.DataFrame({'Col1': ['1','0','1','0'],'Col2':['0','0','1','1'],'Col3':['0','0','1','0'],'Class':['A','B','A','B']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
Col1 Col2 Col3 Class
Item1 1 0 0 A
Item2 0 0 0 B
Item3 1 1 1 A
Item4 0 1 0 B
And I want to calculate row distances between differents class' rows. I mean, first of all I would like to calculate distance between rows from classA
Item1 Item3
Item1 0 0.67
Item3 0.67 0
Secondly, distances between rows from class B
Item2 Item4
Item2 0 1
Item4 1 0
And lastly distance between different classes.
Item2 Item4
Item1 1 1
Item3 1 0.67
I have tried calculating distances with DistanceMetric one by one
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('jacquard')
But I don't know can I do it iterating over the different rows in a large dataframe and create this 3 different matrix wth the distances

To find distances within Class A and Class B, you can use DataFrame.groupby, (distance used is euclidean):
def find_distance(group):
return pd.DataFrame(dist.pairwise(group.values))
df.groupby('Class').apply(find_distance)
0 1
Class
A 0 0.000000 1.414214
1 1.414214 0.000000
B 0 0.000000 1.000000
1 1.000000 0.000000
If you only have two classes, you can separate the two classes into two dataframes and then calculate the difference:
dist_cols = ['Col1', 'Col2','Col3']
df_a = df[df['Class']=='A']
df_b = df[df['Class']=='B']
distances = dist.pairwise(df_a[dist_cols].values, df_b[dist_cols].values)
distances
> array([[1. , 1.41421356],
[1.73205081, 1.41421356]])
pd.DataFrame(distances, columns = df_b.index, index = df_a.index)
Item2 Item4
Item1 1.000000 1.414214
Item3 1.732051 1.414214

Related

Extract a text out of a column using pattern in python

I'm trying to extract a text out of a column so I can move to another column using a pattern in python but I miss some results at the same time I need to keep the unextracted strings as they are>
My code is:
import pandas as pd
df = pd.DataFrame({
'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})
pattern = r'(\d+(\,[0-9]+)?\-\d+(\,[a-zA-Z])?\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
print(df)
My output is:
col result
0 item1 (30-10) 30-10
1 item2 (200-100) 200-100
2 item3 (100 FS) NaN
3 item4 (100+) NaN
4 item1 (1000-2000) 1000-2000
My output should be:
col result newcolumn
0 item1 (30-10)
1 item2 (200-100)
2 item3 (100 FS)
3 item4 (100+)
4 item1 (1000-2000)
You can use this:
df['newcolumn'] = df.col.str.extract(r'(\(.+\))')
df['result'] = df['col'].str.extract(r'(\w+)')
Output:
col newcolumn result
0 item1 (30-10) (30-10) item1
1 item2 (200-100) (200-100) item2
2 item3 (100 FS) (100 FS) item3
3 item4 (100+) (100+) item4
4 item1 (1000-2000) (1000-2000) item1
Explanation:
The first expression gets the content within parenthesis (including the parenthesis themselves). The second gets the first word.
You can also do this with .str.split in a single line:
df[['result', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
Output:
col result newcolumn
0 item1 (30-10) item1 (30-10)
1 item2 (200-100) item2 (200-100)
2 item3 (100 FS) item3 (100 FS)
3 item4 (100+) item4 (100+)
4 item1 (1000-2000) item1 (1000-2000)
You must use expand=True if your strings have a non-uniform number of splits (see also How to split a dataframe string column into two columns?).
EDIT: If you want to 'drop' the old column, you can also overwrite it and rename it:
df[['col', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
df = df.rename(columns={"col": "result"})
which exactly gives you the result you specified was intended:
result newcolumn
0 item1 (30-10)
1 item2 (200-100)
2 item3 (100 FS)
3 item4 (100+)
4 item1 (1000-2000)
You can extract the parts of interest by grouping them within one regular expression. The regex pattern now matches item\d as first group and anything inside the brackets with \(.*\) as the second one.
import pandas as pd
df = pd.DataFrame({
'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})
pattern = "(item\d*)\s(\(.*\))"
df['items'] = df['col'].str.extract(pattern)[0]
df['result'] = df['col'].str.extract(pattern)[1]
print(df)
Output:
col items result
0 item1 (30-10) item1 (30-10)
1 item2 (200-100) item2 (200-100)
2 item3 (100 FS) item3 (100 FS)
3 item4 (100+) item4 (100+)
4 item1 (1000-2000) item1 (1000-2000)

Combining multi-row excel records into table using VBA

I'm quite new to VBA and trying to combine multiple row records in a large data dump into a single row record with multiple headers.
The data is exported into an excel file from another program and takes the form of:
Order Item Qty
1 Item1 2
1 Item2 5
1 Item4 1
2 Item1 1
2 Item2 2
2 Item3 5
3 Item1 4
3 Item2 5
3 Item3 1
4 Item2 2
4 Item3 1
5 Item1 1
5 Item2 1
5 Item3 1
6 Item1 4
6 Item2 4
6 Item4 2
Which would then be sorted into:
Order Item1 Item2 Item3 Item4
1 1 4 1
2 1 2 5
3 4 5 1
4 2 1
5 1 1 1
6 4 4 2
I'm not expecting anyone to write my code but any pointers as to an overall approach would be much appreciated. Thanks!
I create helper column and formula to get the counting. Hope can help.

Equal Levels in Pandas Group By Object

I want to make levels in each group equal even if the values in the levels are not equal between the groups. Below is the example of what I want to achieve:
df = pd.DataFrame({'A' : ['foo']*3 + ['bar']*4,
...: 'B' : [0,1,2,0,1,2,3],
...: 'C' : np.random.randn(7)})
Now, if I group by columns A and B, the output will be as follows:
>> print(df.groupby(['A', 'B']).sum())
A B
bar 0 -1.452272
1 0.331986
2 0.764295
3 1.863472
foo 0 -1.066971
1 -0.411573
2 0.158449
I want to achieve as follows:
A B
bar 0 -1.452272
1 0.331986
2 0.764295
3 1.863472
foo 0 -1.066971
1 -0.411573
2 0.158449
3 0.000000
I searched a lot about this, but not able to figure it out.
Adding unstack and stack after your code
df.groupby(['A', 'B']).sum().unstack(fill_value=0).stack()
Out[372]:
C
A B
bar 0 -0.243351
1 -0.568541
2 1.529810
3 -0.327521
foo 0 -2.380512
1 1.088617
2 -0.125879
3 0.000000
Another option is to use pd.crosstab and stack:
pd.crosstab(df['A'], df['B'], df['C'], aggfunc='sum').stack(dropna=False).fillna(0)
Output:
A B
bar 0 0.553563
1 0.357182
2 -0.294756
3 1.176766
foo 0 -0.514786
1 1.841072
2 0.792337
3 0.000000
dtype: float64

Sorting pivot table (multi index)

I'm trying to sort a pivot table's values in descending order after putting two "row labels" (Excel term) on the pivot.
sample data:
x = pd.DataFrame({'col1':['a','a','b','c','c', 'a','b','c', 'a','b','c'],
'col2':[ 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'col3':[ 1,.67,0.5, 2,.65, .75,2.25,2.5, .5, 2,2.75]})
print(x)
col1 col2 col3
0 a 1 1.00
1 a 1 0.67
2 b 1 0.50
3 c 1 2.00
4 c 1 0.65
5 a 2 0.75
6 b 2 2.25
7 c 2 2.50
8 a 3 0.50
9 b 3 2.00
10 c 3 2.75
To create the pivot, I'm using the following function:
pt = pd.pivot_table(x, index = ['col1', 'col2'], values = 'col3', aggfunc = np.sum)
print(pt)
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 1 0.50
2 2.25
3 2.00
c 1 2.65
2 2.50
3 2.75
In words, this variable pt is first sorted by col1, then by values of col2 within col1 then by col3 within all of those. This is great, but I would like to sort by col3 (the values) while keeping the groups that were broken out in col2 (this column can be any order and shuffled around).
The target output would look something like this (col3 in descending order with any order in col2 with that group of col1):
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 2 2.25
3 2.00
1 0.50
c 3 2.75
1 2.65
2 2.50
I have tried the code below, but this just sorts the entire pivot table values and loses the grouping (I'm looking for sorting within the group).
pt.sort_values(by = 'col3', ascending = False)
For guidance, a similar question was asked (and answered) here, but I was unable to get a successful output with the provided output:
Pandas: Sort pivot table
The error I get from that answer is ValueError: all keys need to be the same shape
You need reset_index for DataFrame, then sort_values by col1 and col3 and last set_index for MultiIndex:
df = df.reset_index()
.sort_values(['col1','col3'], ascending=[True, False])
.set_index(['col1','col2'])
print (df)
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 2 2.25
3 2.00
1 0.50
c 3 2.75
1 2.65
2 2.50

Pandas create percentile field based on groupby with level 1

Given the following data frame:
import pandas as pd
df = pd.DataFrame({
('Group', 'group'): ['a','a','a','b','b','b'],
('sum', 'sum'): [234, 234,544,7,332,766]
})
I'd like to create a new field which calculates the percentile of each value of "sum" per group in "group". The trouble is, I have 2 header columns and cannot figure out how to avoid getting the error:
ValueError: level > 0 only valid with MultiIndex
when I run this:
df=df.groupby('Group',level=1).sum.rank(pct=True, ascending=False)
I need to keep the headers in the same structure.
Thanks in advance!
To group by the first column, ('Group', 'group'), and compute the rank for the ('sum', 'sum') column use:
In [106]: df['rank'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')]).rank(pct=True, ascending=False))
In [107]: df
Out[107]:
Group sum rank
group sum
0 a 234 0.833333
1 a 234 0.833333
2 a 544 0.333333
3 b 7 1.000000
4 b 332 0.666667
5 b 766 0.333333
Note that .rank(pct=True) computes a percentage rank, not a percentile. To compute a percentile you could use scipy.stats.percentileofscore.
import scipy.stats as stats
df['percentile'] = (df[('sum', 'sum')].groupby(df[('Group', 'group')])
.apply(lambda ser: 100-pd.Series([stats.percentileofscore(ser, x, kind='rank')
for x in ser], index=ser.index)))
yields
Group sum rank percentile
group sum
0 a 234 0.833333 50.000000
1 a 234 0.833333 50.000000
2 a 544 0.333333 0.000000
3 b 7 1.000000 66.666667
4 b 332 0.666667 33.333333
5 b 766 0.333333 0.000000

Resources