I have the following df,
code y_m count
101 2017-11 86
101 2017-12 32
102 2017-11 11
102 2017-12 34
102 2018-01 46
103 2017-11 56
103 2017-12 89
now I want to convert this df into a matrix that transposes column y_m to row, make the count as matrix cell values like,
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 -1 89 -1
in specific, -1 represents a dummy value that indicates either a value doesn't exist for a y_m for a specific code or to maintain matrix shape; 0 represents 'all' values, that aggregates code or y_m or code and y_m, e.g. cell (1, 1) sums the count values for all y_m and code; (1,2) sums the count for 2017-11.
You can use first pivot_table:
df1 = (df.pivot_table(index='code',
columns='y_m',
values='count',
margins=True,
aggfunc='sum',
fill_value=-1,
margins_name='0'))
print (df1)
y_m 2017-11 2017-12 2018-01 0
code
101 86 32 -1 118
102 11 34 46 91
103 56 89 -1 145
0 153 155 46 354
And then for final format, but get mixed values, numeric with strings:
#change order of index and columns values for reindex
idx = df1.index[-1:].tolist() + df1.index[:-1].tolist()
cols = df1.columns[-1:].tolist() + df1.columns[:-1].tolist()
df2 = (df1.reindex(index=idx, columns=cols)
.reset_index()
.rename(columns={'code':-1})
.rename_axis(None,1))
#add columns to first row
df3 = df2.columns.to_frame().T.append(df2).reset_index(drop=True)
#reset columns names to range
df3.columns = range(len(df3.columns))
print (df3)
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 56 89 -1
Related
I have list (lst) of data frames and my list has 2000 dataframes. I want to combine all of these data frames in one. Each dataframe has two columns and the first column of each dataframe is the same. For example:
#First dataframe
>>lst[0]
0 1
11 6363
21 737
34 0
43 0
#Second dataframe
>>lst[1]
0 1
11 33
21 0
34 937
43 0
#third dataframe
>>lst[2]
0 1
11 73
21 18
34 27
43 77
Final dataframe will look like:
0 1 2 3
11 6363 33 73
21 737 0 18
34 0 937 27
43 0 0 77
How can I achieve that? Insights will be appreciated.
First we can set the first column as an index to ignore it in concatenation
lst = [df.set_index(0) for df in lst]
Then we concatenate the columns and drop the 0 column back to being the column instead of the index
df_out = pd.concat(lst, axis=1).reset_index()
And we rename the columns:
df_out.columns = range(df_out.shape[1])
Result is:
>> df_out
0 1 2 3 ...
0 11 6363 33 73 ...
1 21 737 0 18 ...
2 34 0 937 27 ...
3 43 0 0 77 ...
You could try this:
lst = [df0, df1, df2, ...]
# Merge dataframes
df_all = lst[0]
for df in lst[1:]:
df_all = df_all.merge(df, how="outer", on=0)
# Rename columns of final dataframe
df_all.columns = list(range(df_all.shape[1]))
print(df_all)
# Outputs
0 1 2 3 ...
0 11 6363 33 73 ...
1 21 737 0 18 ...
2 34 0 937 27 ...
3 43 0 0 77 ...
I have a dataframe that looks similar to below:
Wave A B C
340 77 70 15
341 80 73 15
342 83 76 16
343 86 78 17
I want to generate columns that will have all the possible combinations of the existing columns. I showed 3 cols here but in my actual data, I have 7 columns and therefore 127 total combinations. The desired output is as follows:
Wave A B C AB AC AD BC ... ABC
340 77 70 15 147 92 ...
341 80 73 15 153 95 ...
342 83 76 16 159 99 ...
I implemented a quite inefficient version where the user inputs the combinations (AB, AC, etc.) and a new col is created with the sum of the rows. This seems almost impossible to accomplish for 127 combinations, esp with descriptive col names.
Create a list of all combinations with chain + combinations from itertools, then sum the appropriate columns:
from itertools import combinations, chain
cols = [*df.iloc[:,1:]]
l = list(chain.from_iterable(combinations(cols, n+2) for n in range(len(cols))))
#[('A', 'B'), ('A', 'C'), ('B', 'C'), ('A', 'B', 'C')]
for items in l:
df[''.join(items)] = df.loc[:, items].sum(1)
Wave A B C AB AC BC ABC
0 340 77 70 15 147 92 85 162
1 341 80 73 15 153 95 88 168
2 342 83 76 16 159 99 92 175
3 343 86 78 17 164 103 95 181
You need to get the all combination first , then we just get the combination , and we need create the maps dict or Series
l=df.columns[1:].tolist()
l1=[list(map(list, itertools.combinations(l, i))) for i in range(len(l) + 1)]
d=[dict.fromkeys(y,''.join(y))for x in l1 for y in x ]
maps=pd.Series(d).apply(pd.Series).stack()
df.set_index('Wave',inplace=True)
df=df.reindex(columns=maps.index.get_level_values(1))
#here using reindex , get the order of your new df to the maps keys
df.columns=maps.tolist()
# here assign the new value to the column , since the order is same that why here I am assign it back
df.sum(level=0,axis=1)
Out[303]:
A B C AB AC BC ABC
Wave
340 77 70 15 147 92 85 162
341 80 73 15 153 95 88 168
342 83 76 16 159 99 92 175
343 86 78 17 164 103 95 181
When I transpose a dataframe, the headers are considered as "index" by default. But I want it to be a column and not an index. How do I achieve this ?
import pandas as pd
dict = {'col-a': [97, 98, 99],
'col-b': [34, 35, 36],
'col-c': [24, 25, 26]}
df = pd.DataFrame(dict)
print(df.T)
0 1 2
col-a 97 98 99
col-b 34 35 36
col-c 24 25 26
Desired Output:
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Try T with reset_index:
df=df.T.reset_index()
print(df)
Or:
df.T.reset_index(inplace=True)
print(df)
Both Output:
index 0 1 2
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
If care about column names, add this to the code:
df.columns=range(4)
Or:
it=iter(range(4))
df=df.rename(columns=lambda x: next(it))
Or if don't know number of columns:
df.columns=range(len(df.columns))
Or:
it=iter(range(len(df.columns)))
df=df.rename(columns=lambda x: next(it))
All Output:
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Use reset_index and then set default columns names:
df1 = df.T.reset_index()
df1.columns = np.arange(len(df1.columns))
print (df1)
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Another solution:
print (df.rename_axis(0, axis=1).rename(lambda x: x + 1).T.reset_index())
#alternative
#print (df.T.rename_axis(0).rename(columns = lambda x: x + 1).reset_index())
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
I need to find the max of two columns (p_1_logreg, p_2_logreg) where the comparison should be limited only to 14 rows.
My csv file
I tried to slice my index into:
int1_str1_str2_int2_str3_int4
The max should be found between rows where int1, str1, str2 int2 and str3 are fixed, and only the int4 would change (from index 0 to index 13, and so on).
I tried to fix each element at a time and use groupby, but I couldn't iterate over int4 value only.
Here is the code to find the max for column p_1_label, but the result is not what I am looking for.
max_1_row=raw_prob.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[1])['p_1_'+label].idxmax()]
max_1_row=max_1_row.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[3])['p_1_'+label].idxmax()]
max_1_row=max_1_row.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[5])['p_1_'+label].idxmax()]
Any ideas?
I think you need DataFrameGroupBy.idxmax by replaced last _ with empty string and then select by loc:
df = pd.read_csv('myProb.csv', index_col=[0])
idx = df.drop('id', 1).groupby(df['id'].str.replace('_\d+$', '')).idxmax()
print (idx.head(15))
p_0_logreg p_1_logreg p_2_logreg
id
6_PanaCleanerJune_sub_12_ICA 2 9 6
6_PanaCleanerJune_sub_13_ICA 17 19 23
6_PanaCleanerJune_sub_14_ICA 34 37 33
6_PanaCleanerJune_sub_15_ICA 52 51 43
6_PanaCleanerJune_sub_17_ICA 66 67 69
6_PanaCleanerJune_sub_18_ICA 82 79 76
6_PanaCleanerJune_sub_19_ICA 89 87 90
6_PanaCleanerJune_sub_20_ICA 98 103 104
6_PanaCleanerJune_sub_21_ICA 114 117 112
6_PanaCleanerJune_sub_22_ICA 129 133 127
6_PanaCleanerJune_sub_23_ICA 145 146 143
6_PanaCleanerJune_sub_24_ICA 155 166 161
6_PanaCleanerJune_sub_25_ICA 176 173 174
6_PanaCleanerJune_sub_26_ICA 186 191 189
6_PanaCleanerJune_sub_27_ICA 202 203 209
df1 = df.loc[idx['p_1_logreg']]
print (df1.head(15))
id p_0_logreg p_1_logreg p_2_logreg
9 6_PanaCleanerJune_sub_12_ICA_10 0.013452 0.985195 0.001353
19 6_PanaCleanerJune_sub_13_ICA_6 0.051184 0.948816 0.000000
37 6_PanaCleanerJune_sub_14_ICA_10 0.013758 0.979351 0.006890
51 6_PanaCleanerJune_sub_15_ICA_10 0.076056 0.923944 0.000000
67 6_PanaCleanerJune_sub_17_ICA_12 0.051060 0.947660 0.001280
79 6_PanaCleanerJune_sub_18_ICA_10 0.051184 0.948816 0.000000
87 6_PanaCleanerJune_sub_19_ICA_4 0.078162 0.917751 0.004087
103 6_PanaCleanerJune_sub_20_ICA_6 0.076400 0.921263 0.002337
117 6_PanaCleanerJune_sub_21_ICA_6 0.155002 0.791753 0.053245
133 6_PanaCleanerJune_sub_22_ICA_8 0.000000 0.998623 0.001377
146 6_PanaCleanerJune_sub_23_ICA_7 0.017549 0.973995 0.008457
166 6_PanaCleanerJune_sub_24_ICA_13 0.025215 0.974785 0.000000
173 6_PanaCleanerJune_sub_25_ICA_6 0.025656 0.960220 0.014124
191 6_PanaCleanerJune_sub_26_ICA_10 0.098872 0.895526 0.005602
203 6_PanaCleanerJune_sub_27_ICA_8 0.066493 0.932470 0.001037
df2 = df.loc[idx['p_2_logreg']]
print (df2.head(15))
id p_0_logreg p_1_logreg p_2_logreg
6 6_PanaCleanerJune_sub_12_ICA_7 0.000000 0.000351 0.999649
23 6_PanaCleanerJune_sub_13_ICA_10 0.000000 0.000351 0.999649
33 6_PanaCleanerJune_sub_14_ICA_6 0.080748 0.000352 0.918900
43 6_PanaCleanerJune_sub_15_ICA_2 0.017643 0.000360 0.981996
69 6_PanaCleanerJune_sub_17_ICA_14 0.882449 0.000290 0.117261
76 6_PanaCleanerJune_sub_18_ICA_7 0.010929 0.000360 0.988711
90 6_PanaCleanerJune_sub_19_ICA_7 0.010929 0.000351 0.988720
104 6_PanaCleanerJune_sub_20_ICA_7 0.006714 0.000360 0.992925
112 6_PanaCleanerJune_sub_21_ICA_1 0.869393 0.000339 0.130269
127 6_PanaCleanerJune_sub_22_ICA_2 0.000000 0.000351 0.999649
143 6_PanaCleanerJune_sub_23_ICA_4 0.017218 0.000360 0.982421
161 6_PanaCleanerJune_sub_24_ICA_8 0.369685 0.000712 0.629603
174 6_PanaCleanerJune_sub_25_ICA_7 0.307056 0.000496 0.692448
189 6_PanaCleanerJune_sub_26_ICA_8 0.850195 0.000368 0.149437
209 6_PanaCleanerJune_sub_27_ICA_14 0.000000 0.000351 0.999649
Detail:
print (df['id'].str.replace('_\d+$', '').head(15))
0 6_PanaCleanerJune_sub_12_ICA
1 6_PanaCleanerJune_sub_12_ICA
2 6_PanaCleanerJune_sub_12_ICA
3 6_PanaCleanerJune_sub_12_ICA
4 6_PanaCleanerJune_sub_12_ICA
5 6_PanaCleanerJune_sub_12_ICA
6 6_PanaCleanerJune_sub_12_ICA
7 6_PanaCleanerJune_sub_12_ICA
8 6_PanaCleanerJune_sub_12_ICA
9 6_PanaCleanerJune_sub_12_ICA
10 6_PanaCleanerJune_sub_12_ICA
11 6_PanaCleanerJune_sub_12_ICA
12 6_PanaCleanerJune_sub_12_ICA
13 6_PanaCleanerJune_sub_12_ICA
14 6_PanaCleanerJune_sub_13_ICA
Name: id, dtype: object
I have two pandas dataframes matches with columns (match_id, team_id,date, ...) and teams_att with columns (id, team_id, date, overall_rating, ...).
I want to join the two dataframes on matches.team_id = teams_att.team_id and teams_att.date closest to matches.date
Example
matches
match_id team_id date
1 101 2012-05-17
2 101 2014-07-11
3 102 2010-05-21
4 102 2017-10-24
teams_att
id team_id date overall_rating
1 101 2010-02-22 67
2 101 2011-02-22 69
3 101 2012-02-20 73
4 101 2013-09-17 79
5 101 2014-09-10 74
6 101 2015-08-30 82
7 102 2015-03-21 42
8 102 2016-03-22 44
Desired results
match_id team_id matches.date teams_att.date overall_rating
1 101 2012-05-17 2012-02-20 73
2 101 2014-07-11 2014-09-10 74
3 102 2010-05-21 2015-03-21 42
4 102 2017-10-24 2016-03-22 44
You can use merge_asof with by and direction parameters:
pd.merge_asof(matches.sort_values('date'),
teams_att.sort_values('date'),
on='date', by='team_id',
direction='nearest')
Output:
match_id team_id date id overall_rating
0 3 102 2010-05-21 7 42
1 1 101 2012-05-17 3 73
2 2 101 2014-07-11 5 74
3 4 102 2017-10-24 8 44
We using merge_asof (Please check Scott's answer, that is the right way for solving this type problem :-) cheers )
g1=df1.groupby('team_id')
g=df.groupby('team_id')
l=[]
for x in [101,102]:
l.append(pd.merge_asof(g.get_group(x),g1.get_group(x),on='date',direction ='nearest'))
pd.concat(l)
Out[405]:
match_id team_id_x date id team_id_y overall_rating
0 1 101 2012-05-17 3 101 73
1 2 101 2014-07-11 5 101 74
0 3 102 2010-05-21 7 102 42
1 4 102 2017-10-24 8 102 44