Mann-Whitney U test using pandas groupby() - pandas-groupby

I have a dataframe similar to this one but with more samples and more data columns:
data
group ADAM28 ADAM28.1 ADAM28.2 ADAM7
sample1 del 36.53930 -23.25650 -4.13008 -36.380600
sample2 del -2.71788 29.80030 6.40632 1.939350
sample3 del 10.09880 -4.10111 6.82952 47.033500
sample4 del 5.19362 4.52109 -2.55278 134.387000
sample5 del -5.86248 -4.22071 9.70282 -45.357800
sample6 del 26.75510 27.02110 -4.62898 111.514000
sample7 nodel -24.26460 11.76370 -19.17070 0.709847
sample8 nodel -23.24770 -3.52641 -14.89390 -46.733900
sample9 nodel -6.77488 -22.98740 -2.09688 -11.281900
sample10 nodel -3.98377 -4.27557 4.44730 -23.672000
sample11 nodel -1.10222 -5.25087 -5.98834 -25.495100
sample12 nodel -4.62463 -6.83067 -3.20250 -23.887400
I would like to do Mann-Whitney U tests for all columns based on "group", so something like:
data.groupby(["group"]).mannwhitneyu()
How do I call the data columns?

Related

How can i convert Mongodb cursor result to Dataframe in python?

This is my code:
x = list(coll.find({"activities.flowCenterInfo": {
'$exists': True
}},{'activities.activityId':1,'activities.flowCenterInfo':1,'_id':0}).limit(5))
for row in x:
print(row)
This is the result of x for one sample:
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]}
I want to convert to Dataframe to write the oracle table. How can i convert it to Dataframe properly i can't find anyway
This image shows that the mongodb structure of one sample
Assuming that activities key contains a list with a single dict, each field within flowCenterInfo key is marked with fcinfo_:
# sample list
l = [{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]},
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]},
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]}]
df = pd.DataFrame.from_records([dict(**{'activityId': r['activities'][0]['activityId']}, \
**dict(zip(map('fcinfo_{}'.format, r['activities'][0]['flowCenterInfo'].keys()), \
r['activities'][0]['flowCenterInfo'].values()))) for r in l])
print(df)
activityId fcinfo_processId ... fcinfo_demandComplaintDetailSubject fcinfo_demandComplaintId
0 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
1 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
2 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
[3 rows x 5 columns]

How to create multiple dataframes using multiple functions

I quite often write a function to return different dataframes based on the parameters I enter. Here's an example dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='M'), 10000)})
I then created a function to perform sub-totals for me like this:
def some_fun(DF1, agg_column, myList=[], *args):
y = pd.concat([
DF1.assign(**{x:'[Total]' for x in myList[i:]})\
.groupby(myList).agg(sumz = (agg_column,'sum')) for i in range(1,len(myList)+1)]).sort_index().unstack(0)
return y
I then write out lists that I'll pass as arguments to the function:
list_one = [pd.Grouper(key='Date',freq='A'),'Category','Product']
list_two = [pd.Grouper(key='Date',freq='A'),'Category','Sub-Category','Sub-Category-2']
list_three = [pd.Grouper(key='Date',freq='A'),'Sub-Category','Product']
I then have to run each list through my function creating new dataframes:
df1 = some_fun(df,'Units_Sold',list_one)
df2 = some_fun(df,'Dollars_Sold',list_two)
df3 = some_fun(df,'Units_Sold',list_three)
I then use a function to write each of these dataframes to an Excel worksheet. This is just an example - I perform this same exercise 10+ times.
My question - is there a better way to perform this task than to write out df1, df2, df3 with the function information applied? Should I be looking at using a dictionary or some other data type to do this my pythonically with a function?
A dictionary would be my first choice:
variations = ([('Units Sold', list_one), ('Dollars_Sold',list_two),
..., ('Title', some_list)])
df_variations = {}
for i, v in enumerate(variations):
name = v[0]
data = v[1]
df_variations[i] = some_fun(df, name, data)
You might further consider setting the keys to unique / helpful titles for the variations, that goes beyond something like 'Units Sold', which isn't unique in your case.
IIUC,
as Thomas has suggested we can use a dictionary to parse through your data, but with some minor modifications to your function, we can use the dictionary to hold all the required data then pass that through to your function.
the idea is to pass two types of keys, the list of columns and the arguments to your pd.Grouper call.
data_dict = {
"Units_Sold": {"key": "Date", "freq": "A"},
"Dollars_Sold": {"key": "Date", "freq": "A"},
"col_list_1": ["Category", "Product"],
"col_list_2": ["Category", "Sub-Category", "Sub-Category-2"],
"col_list_3": ["Sub-Category", "Product"],
}
def some_fun(dataframe, agg_col, dictionary,column_list, *args):
key = dictionary[agg_col]["key"]
frequency = dictionary[agg_col]["freq"]
myList = [pd.Grouper(key=key, freq=frequency), *dictionary[column_list]]
y = (
pd.concat(
[
dataframe.assign(**{x: "[Total]" for x in myList[i:]})
.groupby(myList)
.agg(sumz=(agg_col, "sum"))
for i in range(1, len(myList) + 1)
]
)
.sort_index()
.unstack(0)
)
return y
Test.
df1 = some_fun(df,'Units_Sold',data_dict,'col_list_3')
print(df1)
sumz
Date 2016-12-31 2017-12-31 2018-12-31
Sub-Category Product
X Product 1 18308 17839 18776
Product 2 18067 19309 18077
Product 3 17943 19121 17675
[Total] 54318 56269 54528
Y Product 1 20699 18593 18103
Product 2 18642 19712 17122
Product 3 17701 19263 20123
[Total] 57042 57568 55348
Z Product 1 19077 17401 19138
Product 2 17207 21434 18817
Product 3 18405 17300 17462
[Total] 54689 56135 55417
[Total] [Total] 166049 169972 165293
as you want to automate the writing of the 10x worksheets, we can again do that with a dictionary call over your function:
matches = {'Units_Sold': ['col_list_1','col_list_3'],
'Dollars_Sold' : ['col_list_2']}
then a simple for loop to write all the files to a single excel sheet, change this to match your required behavior.
writer = pd.ExcelWriter('finished_excel_file.xlsx')
for key,value in matches.items():
for items in value:
dataframe = some_fun(df,k,data_dict,items)
dataframe.to_excel(writer,f'{key}_{items}')
writer.save()

How to apply a function with multiple arguments to a specific column in Pandas?

I'm trying to apply a function to a specific column in this dataframe
datetime PM2.5 PM10 SO2 NO2
0 2013-03-01 7.125000 10.750000 11.708333 22.583333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667
2 2013-03-03 76.916667 120.541667 61.291667 81.000000
3 2013-03-04 22.708333 44.583333 22.854167 46.187500
4 2013-03-06 223.250000 265.166667 116.236700 142.059383
5 2013-03-07 263.375000 316.083333 97.541667 147.750000
6 2013-03-08 221.458333 297.958333 69.060400 120.092788
I'm trying to apply this function(below) to a specific column(PM10) of the above dataframe:
range1 = [list(range(0,50)),list(range(51,100)),list(range(101,200)),list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
Where "x" can be any column and "y" = Range1
Available Options
df.PM10.apply(c1_c2,args(df.PM10,range1),axis=1)
df.PM10.apply(c1_c2)
I've tried these couple of available options and none of them seems to be working. Any suggestions?
Not sure what the expected output is from the function. But to get the function getting called you can try the following
from functools import partial
df.PM10.apply(partial(c1_c2, y=range1))
Update:
Ok, I think I understand a little better. This should work, but 'range1' is a list of lists of integers. Your data doesn't have integers and the new column comes up empty. I created another list based on your initial data that works. See below:
df = pd.read_csv('pm_data.txt', header=0)
range1= [[7.125000,10.750000,11.708333,22.583333],list(range(0,50)),list(range(51,100)),list(range(101,200)),
list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
df['function']=df.PM10.apply(lambda x: c1_c2(x,range1))
print(df.head(10))
datetime PM2.5 PM10 SO2 NO2 new_column function
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000 16.458333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167 NaN
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083 NaN
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167 NaN
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333 NaN
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167 NaN
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917 NaN
Only the first item in 'function' had a match because it came from your initial data because of 'if x in a'.
Old Code:
I'm also not sure what you are doing. But you can use a lambda to modify columns or create new ones.
Like this,
import pandas as pd
I created a data file to import from the data you posted above:
datetime,PM2.5,PM10,SO2,NO2
2013-03-01,7.125000,10.750000,11.708333,22.583333
2013-03-02,30.750000,42.083333,36.625000,66.666667
2013-03-03,76.916667,120.541667,61.291667,81.000000
2013-03-04,22.708333,44.583333,22.854167,46.187500
2013-03-06,223.250000,265.166667,116.236700,142.059383
2013-03-07,263.375000,316.083333,97.541667,147.750000
2013-03-08,221.458333,297.958333,69.060400,120.092788
Here is how I import it,
df = pd.read_csv('pm_data.txt', header=0)
and create a new column and apply a function to the data in 'PM10'
df['new_column'] = df['PM10'].apply(lambda x: x+15 if x < 30 else x/20)
which yields,
datetime PM2.5 PM10 SO2 NO2 new_column
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917
Let me know if this helps.
"I've tried these couple of available options and none of them seems to be working..."
What do you mean by this? What's your output, are you getting errors or what?
I see a couple of problems:
range1 lists contain int while your column values are float, so c1_c2() will return None.
if the data types were the same within range1 and columns, c1_c2() will return None when value is not in range1.
Below is how I would do it, assuming the data-types match:
def c1_c2(x):
range1 = [list of lists]
for a in range1:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
return x # returns the original value if not in range1
df.PM10.apply(c1_c2)

Python pandas, How could I read excel file without column label and then insert column label?

I have lists which I want to insert it as column labels.
But when I use read_excel of pandas, they always consider 0th row as column label.
How could I read the file as pandas dataframe and then put the list as column label
orig_index = pd.read_excel(basic_info, sheetname = 'KI12E00')
0.619159 0.264191 0.438849 0.465287 0.445819 0.412582 0.397366 \
0 0.601379 0.303953 0.457524 0.432335 0.415333 0.382093 0.382361
1 0.579914 0.343715 0.418294 0.401129 0.385508 0.355392 0.355123
Here is my personal list for column name
print set_index
[20140109, 20140213, 20140313, 20140410, 20140508, 20140612]
And I want to make dataframe as below
20140109 20140213 20140313 20140410 20140508 20140612
0 0.619159 0.264191 0.438849 0.465287 0.445819 0.412582 0.397366 \
1 0.601379 0.303953 0.457524 0.432335 0.415333 0.382093 0.382361
2 0.579914 0.343715 0.418294 0.401129 0.385508 0.355392 0.355123
Pass header=None to tell it there isn't a header, and you can pass a list in names to tell it what you want to use at the same time. (Note that you're missing a column name in your example; I'm assuming that's accidental.)
For example:
>>> df = pd.read_excel("out.xlsx", header=None)
>>> df
0 1 2 3 4 5 6
0 0.619159 0.264191 0.438849 0.465287 0.445819 0.412582 0.397366
1 0.601379 0.303953 0.457524 0.432335 0.415333 0.382093 0.382361
2 0.579914 0.343715 0.418294 0.401129 0.385508 0.355392 0.355123
or
>>> names = [20140109, 20140213, 20140313, 20140410, 20140508, 20140612, 20140714]
>>> df = pd.read_excel("out.xlsx", header=None, names=names)
>>> df
20140109 20140213 20140313 20140410 20140508 20140612 20140714
0 0.619159 0.264191 0.438849 0.465287 0.445819 0.412582 0.397366
1 0.601379 0.303953 0.457524 0.432335 0.415333 0.382093 0.382361
2 0.579914 0.343715 0.418294 0.401129 0.385508 0.355392 0.355123
And you can always set the column names after the fact by assigning to df.columns.

Combining several dataframe results with for Loop in Python Pandas

Lets say I have these functions:
def query():
dict = (
{ "NO" : 1, "PART" : "ALPHA" },
{ "NO" : 2, "PART" : "BETA" }
)
finalqueryresult = pandas.DataFrame()
for info in dict: #I use this loop to request query depends on the dict data, in this example twice (2 records from dict)
finalqueryresult.append( sendquery(info["NO"], info["PART"]) )
def sendquery( no, part):
*some code to request query to server and save it under reqresult variable*
*.....*
*.....*
return reqresult
For example above, when sending first query (record with "NO" = 1) it will return: (lets say this is df1)
NAME COUNTRY
1 RYO JPN
2 JON NZ
and the last query (record with "NO" = 2): (lets say this df2)
NAME COUNTRY
1 TING CN
2 ASHYU INA
and what I want is finalqueryresult will be like this: (df1 combined with df2):
NAME COUNTRY
1 RYO JPN
2 JON NZ
3 TING CN
4 ASHYU INA
But I failed, the finalqueryresult is always empty. I suppose something is wrong with this:
for info in dict:
finalqueryresult.append( sendquery(info["NO"], info["PART"]) )
I think you need first append all DataFrames to list dfs and then use concat:
dfs= []
for info in dict:
#sendquery(info["NO"], info["PART"] return DataFrame
dfs.append( sendquery(info["NO"], info["PART"]) )
finalqueryresult = pd.concat(dfs, ignore_index=True)

Resources