When I transpose a dataframe, the headers are considered as "index" by default. But I want it to be a column and not an index. How do I achieve this ?
import pandas as pd
dict = {'col-a': [97, 98, 99],
'col-b': [34, 35, 36],
'col-c': [24, 25, 26]}
df = pd.DataFrame(dict)
print(df.T)
0 1 2
col-a 97 98 99
col-b 34 35 36
col-c 24 25 26
Desired Output:
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Try T with reset_index:
df=df.T.reset_index()
print(df)
Or:
df.T.reset_index(inplace=True)
print(df)
Both Output:
index 0 1 2
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
If care about column names, add this to the code:
df.columns=range(4)
Or:
it=iter(range(4))
df=df.rename(columns=lambda x: next(it))
Or if don't know number of columns:
df.columns=range(len(df.columns))
Or:
it=iter(range(len(df.columns)))
df=df.rename(columns=lambda x: next(it))
All Output:
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Use reset_index and then set default columns names:
df1 = df.T.reset_index()
df1.columns = np.arange(len(df1.columns))
print (df1)
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Another solution:
print (df.rename_axis(0, axis=1).rename(lambda x: x + 1).T.reset_index())
#alternative
#print (df.T.rename_axis(0).rename(columns = lambda x: x + 1).reset_index())
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Related
I have list (lst) of data frames and my list has 2000 dataframes. I want to combine all of these data frames in one. Each dataframe has two columns and the first column of each dataframe is the same. For example:
#First dataframe
>>lst[0]
0 1
11 6363
21 737
34 0
43 0
#Second dataframe
>>lst[1]
0 1
11 33
21 0
34 937
43 0
#third dataframe
>>lst[2]
0 1
11 73
21 18
34 27
43 77
Final dataframe will look like:
0 1 2 3
11 6363 33 73
21 737 0 18
34 0 937 27
43 0 0 77
How can I achieve that? Insights will be appreciated.
First we can set the first column as an index to ignore it in concatenation
lst = [df.set_index(0) for df in lst]
Then we concatenate the columns and drop the 0 column back to being the column instead of the index
df_out = pd.concat(lst, axis=1).reset_index()
And we rename the columns:
df_out.columns = range(df_out.shape[1])
Result is:
>> df_out
0 1 2 3 ...
0 11 6363 33 73 ...
1 21 737 0 18 ...
2 34 0 937 27 ...
3 43 0 0 77 ...
You could try this:
lst = [df0, df1, df2, ...]
# Merge dataframes
df_all = lst[0]
for df in lst[1:]:
df_all = df_all.merge(df, how="outer", on=0)
# Rename columns of final dataframe
df_all.columns = list(range(df_all.shape[1]))
print(df_all)
# Outputs
0 1 2 3 ...
0 11 6363 33 73 ...
1 21 737 0 18 ...
2 34 0 937 27 ...
3 43 0 0 77 ...
I want to join two data frames specific indices as per the map (dictionary) I have created. What is an efficient way to do this?
Data:
df = pd.DataFrame({"a":[10, 34, 24, 40, 56, 44],
"b":[95, 63, 74, 85, 56, 43]})
print(df)
a b
0 10 95
1 34 63
2 24 74
3 40 85
4 56 56
5 44 43
df1 = pd.DataFrame({"c":[1, 2, 3, 4],
"d":[5, 6, 7, 8]})
print(df1)
c d
0 1 5
1 2 6
2 3 7
3 4 8
d = {
(1,0):0.67,
(1,2):0.9,
(2,1):0.2,
(2,3):0.34
(4,0):0.7,
(4,2):0.5
}
Desired Output:
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.9
...
5 56 56 3 7 0.5
I'm able to achieve this but it takes a lot of time since my original data frames' map has about 4.7M rows to map. I'd love to know if there is a way to MERGE, JOIN or CONCAT these data frames on different indices.
My Approach:
matched_rows = []
for key in d.keys():
s = df.iloc[key[0]].tolist() + df1.iloc[key[1]].tolist() + [d[key]]
matched_rows.append(s)
df_matched = pd.DataFrame(matched_rows, columns = df.columns.tolist() + df1.columns.tolist() + ['ratio']
I would highly appreciate your help. Thanks a lot in advance.
Create Series and then DaatFrame by dictioanry, DataFrame.join both and last remove first 2 columns by positions:
df = (pd.Series(d).reset_index(name='ratio')
.join(df, on='level_0')
.join(df1, on='level_1')
.iloc[:, 2:])
print (df)
ratio a b c d
0 0.67 34 63 1 5
1 0.90 34 63 3 7
2 0.20 24 74 2 6
3 0.34 24 74 4 8
4 0.70 56 56 1 5
5 0.50 56 56 3 7
And then if necessary reorder columns:
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
print (df)
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.90
2 24 74 2 6 0.20
3 24 74 4 8 0.34
4 56 56 1 5 0.70
5 56 56 3 7 0.50
I have the following df,
code y_m count
101 2017-11 86
101 2017-12 32
102 2017-11 11
102 2017-12 34
102 2018-01 46
103 2017-11 56
103 2017-12 89
now I want to convert this df into a matrix that transposes column y_m to row, make the count as matrix cell values like,
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 -1 89 -1
in specific, -1 represents a dummy value that indicates either a value doesn't exist for a y_m for a specific code or to maintain matrix shape; 0 represents 'all' values, that aggregates code or y_m or code and y_m, e.g. cell (1, 1) sums the count values for all y_m and code; (1,2) sums the count for 2017-11.
You can use first pivot_table:
df1 = (df.pivot_table(index='code',
columns='y_m',
values='count',
margins=True,
aggfunc='sum',
fill_value=-1,
margins_name='0'))
print (df1)
y_m 2017-11 2017-12 2018-01 0
code
101 86 32 -1 118
102 11 34 46 91
103 56 89 -1 145
0 153 155 46 354
And then for final format, but get mixed values, numeric with strings:
#change order of index and columns values for reindex
idx = df1.index[-1:].tolist() + df1.index[:-1].tolist()
cols = df1.columns[-1:].tolist() + df1.columns[:-1].tolist()
df2 = (df1.reindex(index=idx, columns=cols)
.reset_index()
.rename(columns={'code':-1})
.rename_axis(None,1))
#add columns to first row
df3 = df2.columns.to_frame().T.append(df2).reset_index(drop=True)
#reset columns names to range
df3.columns = range(len(df3.columns))
print (df3)
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 56 89 -1
I have an extension to this question. I have lists of lists in my columns and I need to expand the rows one step further. If I just repeat the steps it splits my strings into letters. Could you suggest a smart way around? Thanks!
d1 = pd.DataFrame({'column1': [['ana','bob',[1,2,3]],['dona','elf',[4,5,6]],['gear','hope',[7,8,9]]],
'column2':[10,20,30],
'column3':[44,55,66]})
d2 = pd.DataFrame.from_records(d1.column1.tolist()).stack().reset_index(level=1, drop=True).rename('column1')
d1_d2 = d1.drop('column1', axis=1).join(d2).reset_index(drop=True)[['column1','column2', 'column3']]
d1_d2
It seems you need flatten nested lists:
from collections import Iterable
def flatten(coll):
for i in coll:
if isinstance(i, Iterable) and not isinstance(i, str):
for subc in flatten(i):
yield subc
else:
yield i
d1['column1'] = d1['column1'].apply(lambda x: list(flatten(x)))
print (d1)
column1 column2 column3
0 [ana, bob, 1, 2, 3] 10 44
1 [dona, elf, 4, 5, 6] 20 55
2 [gear, hope, 7, 8, 9] 30 66
And then use your solution:
d2 = (pd.DataFrame(d1.column1.tolist())
.stack()
.reset_index(level=1, drop=True)
.rename('column1'))
d1_d2 = (d1.drop('column1', axis=1)
.join(d2)
.reset_index(drop=True)[['column1','column2', 'column3']])
print (d1_d2)
column1 column2 column3
0 ana 10 44
1 bob 10 44
2 1 10 44
3 2 10 44
4 3 10 44
5 dona 20 55
6 elf 20 55
7 4 20 55
8 5 20 55
9 6 20 55
10 gear 30 66
11 hope 30 66
12 7 30 66
13 8 30 66
14 9 30 66
Assuming the expected result is same as jezrael.
pandas >= 0.25.0
d1 = d1.explode('column1').explode('column1').reset_index(drop=True)
d1:
column1 column2 column3
0 ana 10 44
1 bob 10 44
2 1 10 44
3 2 10 44
4 3 10 44
5 dona 20 55
6 elf 20 55
7 4 20 55
8 5 20 55
9 6 20 55
10 gear 30 66
11 hope 30 66
12 7 30 66
13 8 30 66
14 9 30 66
I have two pandas dataframes matches with columns (match_id, team_id,date, ...) and teams_att with columns (id, team_id, date, overall_rating, ...).
I want to join the two dataframes on matches.team_id = teams_att.team_id and teams_att.date closest to matches.date
Example
matches
match_id team_id date
1 101 2012-05-17
2 101 2014-07-11
3 102 2010-05-21
4 102 2017-10-24
teams_att
id team_id date overall_rating
1 101 2010-02-22 67
2 101 2011-02-22 69
3 101 2012-02-20 73
4 101 2013-09-17 79
5 101 2014-09-10 74
6 101 2015-08-30 82
7 102 2015-03-21 42
8 102 2016-03-22 44
Desired results
match_id team_id matches.date teams_att.date overall_rating
1 101 2012-05-17 2012-02-20 73
2 101 2014-07-11 2014-09-10 74
3 102 2010-05-21 2015-03-21 42
4 102 2017-10-24 2016-03-22 44
You can use merge_asof with by and direction parameters:
pd.merge_asof(matches.sort_values('date'),
teams_att.sort_values('date'),
on='date', by='team_id',
direction='nearest')
Output:
match_id team_id date id overall_rating
0 3 102 2010-05-21 7 42
1 1 101 2012-05-17 3 73
2 2 101 2014-07-11 5 74
3 4 102 2017-10-24 8 44
We using merge_asof (Please check Scott's answer, that is the right way for solving this type problem :-) cheers )
g1=df1.groupby('team_id')
g=df.groupby('team_id')
l=[]
for x in [101,102]:
l.append(pd.merge_asof(g.get_group(x),g1.get_group(x),on='date',direction ='nearest'))
pd.concat(l)
Out[405]:
match_id team_id_x date id team_id_y overall_rating
0 1 101 2012-05-17 3 101 73
1 2 101 2014-07-11 5 101 74
0 3 102 2010-05-21 7 102 42
1 4 102 2017-10-24 8 102 44