pandas custom sorting multilevel index - python-3.x

I have the following example dataset, and I'd like to sort the index columns by a custom order that is not contained within the dataframe. So far looking on SO I haven't been able to solve this. Example:
import pandas as pd
data = {'s':[1,1,1,1],
'am':['cap', 'cap', 'sea', 'sea'],
'cat':['i', 'o', 'i', 'o'],
'col1':[.55, .44, .33, .22],
'col2':[.77, .66, .55, .44]}
df = pd.DataFrame(data=data)
df.set_index(['s', 'am', 'cat'], inplace=True)
Out[1]:
col1 col2
s am cat
1 cap i 0.55 0.77
o 0.44 0.66
sea i 0.33 0.55
o 0.22 0.44
What I would like is the following:
Out[2]:
col1 col2
s am cat
1 sea i 0.33 0.55
o 0.22 0.44
cap i 0.55 0.77
o 0.44 0.66
and I might also want to sort by 'cat' with the order ['o', 'i'], as well.

Use sort_values and sort_index
df.sort_values(df.columns.tolist()).sort_index(level=1, ascending=False,
sort_remaining=False)
col1 col2
s am cat
1 sea i 0.33 0.55
o 0.22 0.44
cap i 0.55 0.77
o 0.44 0.66
Convert the index to categorical to get the custom order.
data = {'s':[1,1,1,1],
'am':['cap', 'cap', 'sea', 'sea'],
'cat':['i', 'j', 'k', 'l'],
'col1':[.55, .44, .33, .22],
'col2':[.77, .66, .55, .44]}
df = pd.DataFrame(data=data)
df.set_index(['s', 'am', 'cat'], inplace=True)
idx = pd.Categorical(df.index.get_level_values(2).values,
categories=['j','i','k','l'],
ordered=True)
df.index.set_levels(idx, level='cat', inplace=True)
df.reset_index().sort_values('cat').set_index(['s','am','cat'])
col1 col2
s am cat
1 cap j 0.44 0.66
i 0.55 0.77
sea k 0.33 0.55
l 0.22 0.44

As of Pandas 1.1 there is another option with the key param of sort_values.
SORT_VALS = {"am": ["sea", "cap"]}
def sorter(column):
if column.name not in SORT_VALS:
return column
mapper = {val: order for order, val in enumerate(SORT_VALS[column.name])}
return column.map(mapper)
new_df = df.sort_values(by=["s", "am", "cat"], key=sorter)
# col1 col2
# s am cat
# 1 sea i 0.33 0.55
# o 0.22 0.44
# cap i 0.55 0.77
# o 0.44 0.66
You can also use pd.Categorical in the sorter and return a categorical Series for custom sort columns which may have different performance implications depending on your scenario, but note that there is a soon-to-be-fixed bug in pandas that can prevent multi-column sorts with Categorical sorting.

Related

Combine two column into a column in dictionary format after performing groupby operation

I have a data frame as shown below
df:
cust_id products rec_product conf sup
1 ['phone', 'tv'] ball 0.68 0.12
1 ['phone', 'tv'] bat 0.21 0.34
1 ['phone', 'tv'] book 0.02 0.25
2 ['bat'] ball 0.97 0.18
2 ['bat'] book 0.65 0.65
2 ['bat'] phone 0.23 0.36
2 ['bat'] tv 0.03 0.48
Where I wants to combine rec_product and conf column as dictionary after performing groupby
Expected output:
cust_id products prod_conf prod_sup
1 ['phone', 'tv'] {'ball':0.68, 'bat':0.21, 'book':0.02} {'ball':0.12, 'bat':0.34, 'book':0.25}
2 ['bat'] {'ball':0.97, 'book':0.65, 'phone':0.23, 'tv':0.03} {'ball':0.18, 'book':0.65, 'phone':0.36, 'tv':0.48}
I tried below code it worked. But I would like to know is there any faster methods than this, which consume less memory and executes faster.
Combine rec_product and conf into one column
prod_conf_df = df.sort_values(['cust_id', 'conf'], ascending=[True,
False]).set_index('rec_product').groupby(['cust_id', 'products']).\
apply(lambda x: x['conf'].to_dict()).reset_index(name='prod_conf')
Combine rec_product and sup into one column
prod_sup_df = df.sort_values(['cust_id', 'conf'], ascending=[True,
False]).set_index('rec_product').groupby(['cust_id']).\
apply(lambda x: x['sup'].to_dict()).reset_index(name='prod_sup')
combine both the above dfs into one
combined_df = pd.merge(prod_conf_df, prod_supp_df, on='cust_id', how='inner')
Instead of using multiple groupby's + apply's..I would suggest doing all the aggregations using a single groupby inside a comprehension
def dictify(k, g):
return {
'cust_id': k,
'products' : g['products'].iat[0],
'prod_conf': dict(zip(g['rec_product'], g['conf'])),
'prod_sup' : dict(zip(g['rec_product'], g['sup']))
}
s = df.sort_values(['cust_id', 'conf'], ascending=[True, False])
s = pd.DataFrame(dictify(k, g) for k, g in s.groupby('cust_id', sort=False))
Result
cust_id products prod_conf prod_sup
0 1 ['phone', 'tv'] {'ball': 0.68, 'bat': 0.21, 'book': 0.02} {'ball': 0.12, 'bat': 0.34, 'book': 0.25}
1 2 ['bat'] {'ball': 0.97, 'book': 0.65, 'phone': 0.23, 'tv': 0.03} {'ball': 0.18, 'book': 0.65, 'phone': 0.36, 'tv': 0.48}

I want to find how many employees were associated with the company code per month year [duplicate]

I have a data frame df and I use several columns from it to groupby:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
In short: How do I get group-wise statistics for a dataframe?
Quick Answer:
The simplest way to get row counts per group is by calling .size(), which returns a Series:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame (instead of a Series) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size() to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts') to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.
On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
Swiss Army Knife: GroupBy.describe
Returns count, mean, std, and other useful statistics per-group.
df.groupby(['A', 'B'])['C'].describe()
count mean std min 25% 50% 75% max
A B
bar one 1.0 0.40 NaN 0.40 0.40 0.40 0.40 0.40
three 1.0 2.24 NaN 2.24 2.24 2.24 2.24 2.24
two 1.0 -0.98 NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one 2.0 1.36 0.58 0.95 1.15 1.36 1.56 1.76
three 1.0 -0.15 NaN -0.15 -0.15 -0.15 -0.15 -0.15
two 2.0 1.42 0.63 0.98 1.20 1.42 1.65 1.87
To get specific statistics, just select them,
df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]
count mean
A B
bar one 1.0 0.400157
three 1.0 2.240893
two 1.0 -0.977278
foo one 2.0 1.357070
three 1.0 -0.151357
two 2.0 1.423148
Note: if you only need to compute 1 or 2 stats then it might be
faster to use groupby.agg and just compute those columns otherwise
you are performing wasteful computation.
describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).
You also get different statistics for string data. Here's an example,
df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)
with pd.option_context('precision', 2):
display(df2.groupby(['A', 'B'])
.describe(include='all')
.dropna(how='all', axis=1))
C D
count mean std min 25% 50% 75% max count unique top freq
A B
bar one 14.0 0.40 5.76e-17 0.40 0.40 0.40 0.40 0.40 14 1 a 14
three 14.0 2.24 4.61e-16 2.24 2.24 2.24 2.24 2.24 14 1 b 14
two 9.0 -0.98 0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98 9 1 c 9
foo one 22.0 1.43 4.10e-01 0.95 0.95 1.76 1.76 1.76 22 2 a 13
three 15.0 -0.15 0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15 15 1 c 15
two 26.0 1.49 4.48e-01 0.98 0.98 1.87 1.87 1.87 26 2 b 15
For more information, see the documentation.
pandas >= 1.1: DataFrame.value_counts
This is available from pandas 1.1 if you just want to capture the size of every group, this cuts out the GroupBy and is faster.
df.value_counts(subset=['col1', 'col2'])
Minimal Example
# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df.value_counts(['A', 'B'])
A B
foo two 2
one 2
three 1
bar two 1
three 1
one 1
dtype: int64
Other Statistical Analysis Tools
If you didn't find what you were looking for above, the User Guide has a comprehensive listing of supported statical analysis, correlation, and regression tools.
To get multiple stats, collapse the index, and retain column names:
df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df
Produces:
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
Please try this code
new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df
I think that code will add a column called 'count it' which count of each group
Create a group object and call methods like below example:
grp = df.groupby(['col1', 'col2', 'col3'])
grp.max()
grp.mean()
grp.describe()
If you are familiar with tidyverse R packages, here is a way to do it in python:
from datar.all import tibble, rnorm, f, group_by, summarise, mean, n, rep
df = tibble(
col1=rep(['A', 'B'], 5),
col2=rep(['C', 'D'], each=5),
col3=rnorm(10),
col4=rnorm(10)
)
df >> group_by(f.col1, f.col2) >> summarise(
count=n(),
col3_mean=mean(f.col3),
col4_mean=mean(f.col4)
)
col1 col2 n mean_col3 mean_col4
0 A C 3 -0.516402 0.468454
1 A D 2 -0.248848 0.979655
2 B C 2 0.545518 -0.966536
3 B D 3 -0.349836 -0.915293
[Groups: ['col1'] (n=2)]
I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.
Another alternative:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
A B C D
0 foo one 0.808197 2.057923
1 bar one 0.330835 -0.815545
2 foo two -1.664960 -2.372025
3 bar three 0.034224 0.825633
4 foo two 1.131271 -0.984838
5 bar two 2.961694 -1.122788
6 foo one -0.054695 0.503555
7 foo three 0.018052 -0.746912
pd.crosstab(df.A, df.B).stack().reset_index(name='count')
Output:
A B count
0 bar one 1
1 bar three 1
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2

How to count_values and then groupby [duplicate]

I have a data frame df and I use several columns from it to groupby:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
In short: How do I get group-wise statistics for a dataframe?
Quick Answer:
The simplest way to get row counts per group is by calling .size(), which returns a Series:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame (instead of a Series) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size() to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts') to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.
On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
Swiss Army Knife: GroupBy.describe
Returns count, mean, std, and other useful statistics per-group.
df.groupby(['A', 'B'])['C'].describe()
count mean std min 25% 50% 75% max
A B
bar one 1.0 0.40 NaN 0.40 0.40 0.40 0.40 0.40
three 1.0 2.24 NaN 2.24 2.24 2.24 2.24 2.24
two 1.0 -0.98 NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one 2.0 1.36 0.58 0.95 1.15 1.36 1.56 1.76
three 1.0 -0.15 NaN -0.15 -0.15 -0.15 -0.15 -0.15
two 2.0 1.42 0.63 0.98 1.20 1.42 1.65 1.87
To get specific statistics, just select them,
df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]
count mean
A B
bar one 1.0 0.400157
three 1.0 2.240893
two 1.0 -0.977278
foo one 2.0 1.357070
three 1.0 -0.151357
two 2.0 1.423148
Note: if you only need to compute 1 or 2 stats then it might be
faster to use groupby.agg and just compute those columns otherwise
you are performing wasteful computation.
describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).
You also get different statistics for string data. Here's an example,
df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)
with pd.option_context('precision', 2):
display(df2.groupby(['A', 'B'])
.describe(include='all')
.dropna(how='all', axis=1))
C D
count mean std min 25% 50% 75% max count unique top freq
A B
bar one 14.0 0.40 5.76e-17 0.40 0.40 0.40 0.40 0.40 14 1 a 14
three 14.0 2.24 4.61e-16 2.24 2.24 2.24 2.24 2.24 14 1 b 14
two 9.0 -0.98 0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98 9 1 c 9
foo one 22.0 1.43 4.10e-01 0.95 0.95 1.76 1.76 1.76 22 2 a 13
three 15.0 -0.15 0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15 15 1 c 15
two 26.0 1.49 4.48e-01 0.98 0.98 1.87 1.87 1.87 26 2 b 15
For more information, see the documentation.
pandas >= 1.1: DataFrame.value_counts
This is available from pandas 1.1 if you just want to capture the size of every group, this cuts out the GroupBy and is faster.
df.value_counts(subset=['col1', 'col2'])
Minimal Example
# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df.value_counts(['A', 'B'])
A B
foo two 2
one 2
three 1
bar two 1
three 1
one 1
dtype: int64
Other Statistical Analysis Tools
If you didn't find what you were looking for above, the User Guide has a comprehensive listing of supported statical analysis, correlation, and regression tools.
To get multiple stats, collapse the index, and retain column names:
df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df
Produces:
We can easily do it by using groupby and count. But, we should remember to use reset_index().
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
Please try this code
new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df
I think that code will add a column called 'count it' which count of each group
Create a group object and call methods like below example:
grp = df.groupby(['col1', 'col2', 'col3'])
grp.max()
grp.mean()
grp.describe()
If you are familiar with tidyverse R packages, here is a way to do it in python:
from datar.all import tibble, rnorm, f, group_by, summarise, mean, n, rep
df = tibble(
col1=rep(['A', 'B'], 5),
col2=rep(['C', 'D'], each=5),
col3=rnorm(10),
col4=rnorm(10)
)
df >> group_by(f.col1, f.col2) >> summarise(
count=n(),
col3_mean=mean(f.col3),
col4_mean=mean(f.col4)
)
col1 col2 n mean_col3 mean_col4
0 A C 3 -0.516402 0.468454
1 A D 2 -0.248848 0.979655
2 B C 2 0.545518 -0.966536
3 B D 3 -0.349836 -0.915293
[Groups: ['col1'] (n=2)]
I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.
Another alternative:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
A B C D
0 foo one 0.808197 2.057923
1 bar one 0.330835 -0.815545
2 foo two -1.664960 -2.372025
3 bar three 0.034224 0.825633
4 foo two 1.131271 -0.984838
5 bar two 2.961694 -1.122788
6 foo one -0.054695 0.503555
7 foo three 0.018052 -0.746912
pd.crosstab(df.A, df.B).stack().reset_index(name='count')
Output:
A B count
0 bar one 1
1 bar three 1
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2

Splitting the values of column in csv and write in new column using Pandas

I have an excel column as shown below :-
FileName Coordinates
abc.text 0 0.41, 0.42, 0.43, 0.44
I want the output to be in this fashion :-
FileName Coordinates Label X-1 Y-1 X-3 X-4
abc.txt 0, 0.41, 0.42, 0.43, 0.44 0 0.41 0.42 0.43 0.44
I have written the below code and before it worked for me, dont know what I am mission in this specific case :-
import pandas as pd
df = pd.read_csv('path/to/Coordinates_v3_updated.csv')
df[['Label', 'x1','y1', 'x2', 'y2']] = df['Coordinates'].str.split(" ",expand=True)
print(df)
df.to_csv('path/to/save/to/Coordinates_v3_updated_v1.csv', index=False)
print("Done")
replace df[['Label', 'x1','y1', 'x2', 'y2']] = df['Coordinates'].str.split(" ",expand=True)
with df[['Label', 'x1','y1', 'x2', 'y2']] = df['Coordinates'].str.split(", ",expand=True)

Sorting numerically but not alphabetically when numbers are equal

I have a file like this:
A 0.77
C 0.98
B 0.77
Z 0.77
G 0.65
I want to sort the file numerically in descending order. I used this code:
sort -gr -k2,2 file.txt
I obtain this:
C 0.98
Z 0.77
B 0.77
A 0.77
G 0.65
In my real file I have several columns with the same number and they are ordered alphabetically. What I want is to sort numerically but not alphabetically when the numbers are equal, I want to obtain those columns unsorted alphabetically:
C 0.98
B 0.77
Z 0.77
A 0.77
G 0.65
But any random order is fine.
You can use this sort:
sort -k2rn -k1R file
C 0.98
B 0.77
Z 0.77
A 0.77
G 0.65
There are 2 sort options used:
-k2rn: First sort key is column 2; numerical, reverse
-k1R: Second sort key is column 1; random
One in GNU awk that preserves the order of the first field (random in, equally random out):
$ awk ' {
a[$2]=a[$2] (a[$2]==""?"":FS) $1 # append $1 values to hash, indexed on $1
}
END {
PROCINFO["sorted_in"]="#ind_num_desc" # set for traverse order for index order...
for(i in a) { # ... and use it here
n=split(a[i],b)
for(j=1;j<=n;j++) # preserve the input order
print b[j],i # output
}
}' file
C 0.98
A 0.77
B 0.77
Z 0.77
G 0.65
Testing reverse order:
$ tac file | awk '# above awk script'
C 0.98
Z 0.77
B 0.77
A 0.77
G 0.65

Resources