Cannot turn off "sort" function in pandas.concat - python-3.x

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'col3':np.random.randint(1,10,5),'col1':np.random.randint(30,80,5)})
df2 = pd.DataFrame({'col4':np.random.randint(30,80,5),'col5':np.random.randint(100,130,5)})
df3 = pd.DataFrame({'col9':np.random.randint(1,10,5),'col8':np.random.randint(30,80,5)})
x1 = pd.concat([df1,df2,df3],axis=1,sort=False)
x1.columns = pd.MultiIndex.from_product([['I2'],x1.columns])
x2 = pd.concat([df1,df2,df3],axis=1,sort=False)
x2.columns = pd.MultiIndex.from_product([['I3'],x2.columns])
x3 = pd.concat([df1,df2,df3],axis=1,sort=False)
x3.columns = pd.MultiIndex.from_product([['I1'],x3.columns])
pd.concat([x1,x2,x3],axis=0,sort=False)
I was trying to get an aggregated dataframe with exactly the same column order as those of x1, x2 and x3 (which are already the same) as figure 1 shows below:
Figure 1: I was trying to get this
But actually the above codes created a dataframe presented in figure 2 below:
Figure 2: The code actually created this
I am wondering why the "sort=False" param did not successfully handle the sorting behaviour neither in the first level nor the second level of the columns in the pandas.concat() function?
Is there any other way that I can get the dataframe that I want?
Great thanks for your time and intelligence!

You could use join instead of using concat
x1.join(x2,how='outer').join(x3,how='outer')
Result:
I2 I3 I1
col3 col1 col4 col5 col9 col8 col3 col1 col4 col5 col9 col8 col3 col1 col4 col5 col9 col8
0 7 54 42 128 8 79 7 54 42 128 8 79 7 54 42 128 8 79
1 1 56 56 102 1 77 1 56 56 102 1 77 1 56 56 102 1 77
2 9 34 52 108 4 68 9 34 52 108 4 68 9 34 52 108 4 68
3 3 42 51 108 8 75 3 42 51 108 8 75 3 42 51 108 8 75
4 3 34 70 100 5 78 3 34 70 100 5 78 3 34 70 100 5 78

Related

Joining multiple Pandas data frames into one

I have list (lst) of data frames and my list has 2000 dataframes. I want to combine all of these data frames in one. Each dataframe has two columns and the first column of each dataframe is the same. For example:
#First dataframe
>>lst[0]
0 1
11 6363
21 737
34 0
43 0
#Second dataframe
>>lst[1]
0 1
11 33
21 0
34 937
43 0
#third dataframe
>>lst[2]
0 1
11 73
21 18
34 27
43 77
Final dataframe will look like:
0 1 2 3
11 6363 33 73
21 737 0 18
34 0 937 27
43 0 0 77
How can I achieve that? Insights will be appreciated.
First we can set the first column as an index to ignore it in concatenation
lst = [df.set_index(0) for df in lst]
Then we concatenate the columns and drop the 0 column back to being the column instead of the index
df_out = pd.concat(lst, axis=1).reset_index()
And we rename the columns:
df_out.columns = range(df_out.shape[1])
Result is:
>> df_out
0 1 2 3 ...
0 11 6363 33 73 ...
1 21 737 0 18 ...
2 34 0 937 27 ...
3 43 0 0 77 ...
You could try this:
lst = [df0, df1, df2, ...]
# Merge dataframes
df_all = lst[0]
for df in lst[1:]:
df_all = df_all.merge(df, how="outer", on=0)
# Rename columns of final dataframe
df_all.columns = list(range(df_all.shape[1]))
print(df_all)
# Outputs
0 1 2 3 ...
0 11 6363 33 73 ...
1 21 737 0 18 ...
2 34 0 937 27 ...
3 43 0 0 77 ...

How to write from string to pd dataframe when columns repeats?

I read a PDF file with PDFMiner and I get a string; following that structure:
text
text
text
col1
1
2
3
4
5
col2
(1)
(2)
(3)
(7)
(4)
col3
name1
name2
name3
name4
name5
col4
name
5
45
7
87
8
col5
FAE
EFD
SDE
FEF
RGE
col6
name
45
7
54
4
130
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
7
1
8
text1
text1
text1
col1
6
7
8
9
10
col2
(1)
(2)
(3)
(7)
(4)
col3
name6
name7
name8
name9
name10
col4
name
54
4
78
8
86
col5
SDE
FFF
EEF
GFE
JHG
col6
name
6
65
65
45
78
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
4
1
54
I have 10 columns named: col1, col2, col3, col4 name, col5, col6 name, # col7, col8, col9, col10 name.
But as I have those 10 columns on each page; I get the structure repeated. Those names will always be the same, on each page. I am not sure how to pull it all in the same dataframe.
For example for col1 I would have in the dataframe:
1
2
3
4
5
6
7
8
9
10
I also have some empty columns (col8 in my example) and I am not sure how to deal with it.
Any idea? thanks!
You can use regex to parse the document (regex101), for example (txt is your string from the question):
import re
d = {}
for col_name, cols in re.findall(r'\n^((?:#\s)?col\d+(?:\n\s*name\n+)?)(.*?)(?=\n\n|^(?:#\s)?col\d+|\Z)', txt, flags=re.M|re.S):
d.setdefault(col_name.strip(), []).extend(cols.strip().split('\n'))
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
Prints:
col1 col2 col3 col4\n name col5 col6\n name # col7 col8 col9 col10\nname
0 1 (1) name1 5 FAE 45 16 55 1
1 2 (2) name2 45 EFD 7 18 30 7
2 3 (3) name3 7 SDE 54 22 None 60 1
3 4 (7) name4 87 FEF 4 17 None 1 8
4 5 (4) name5 8 RGE 130 25 None 185 1
5 6 (1) name6 54 SDE 6 16 None 55 4
6 7 (2) name7 4 FFF 65 18 None 30 1
7 8 (3) name8 78 EEF 65 22 None 60 54
8 9 (7) name9 8 GFE 45 17 None 1 None
9 10 (4) name10 86 JHG 78 25 None 185 None

Pandas JOIN/MERGE/CONCAT Data Frame On Specific Indices

I want to join two data frames specific indices as per the map (dictionary) I have created. What is an efficient way to do this?
Data:
df = pd.DataFrame({"a":[10, 34, 24, 40, 56, 44],
"b":[95, 63, 74, 85, 56, 43]})
print(df)
a b
0 10 95
1 34 63
2 24 74
3 40 85
4 56 56
5 44 43
df1 = pd.DataFrame({"c":[1, 2, 3, 4],
"d":[5, 6, 7, 8]})
print(df1)
c d
0 1 5
1 2 6
2 3 7
3 4 8
d = {
(1,0):0.67,
(1,2):0.9,
(2,1):0.2,
(2,3):0.34
(4,0):0.7,
(4,2):0.5
}
Desired Output:
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.9
...
5 56 56 3 7 0.5
I'm able to achieve this but it takes a lot of time since my original data frames' map has about 4.7M rows to map. I'd love to know if there is a way to MERGE, JOIN or CONCAT these data frames on different indices.
My Approach:
matched_rows = []
for key in d.keys():
s = df.iloc[key[0]].tolist() + df1.iloc[key[1]].tolist() + [d[key]]
matched_rows.append(s)
df_matched = pd.DataFrame(matched_rows, columns = df.columns.tolist() + df1.columns.tolist() + ['ratio']
I would highly appreciate your help. Thanks a lot in advance.
Create Series and then DaatFrame by dictioanry, DataFrame.join both and last remove first 2 columns by positions:
df = (pd.Series(d).reset_index(name='ratio')
.join(df, on='level_0')
.join(df1, on='level_1')
.iloc[:, 2:])
print (df)
ratio a b c d
0 0.67 34 63 1 5
1 0.90 34 63 3 7
2 0.20 24 74 2 6
3 0.34 24 74 4 8
4 0.70 56 56 1 5
5 0.50 56 56 3 7
And then if necessary reorder columns:
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
print (df)
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.90
2 24 74 2 6 0.20
3 24 74 4 8 0.34
4 56 56 1 5 0.70
5 56 56 3 7 0.50

pandas how to convert a dataframe to a matrix using transpose

I have the following df,
code y_m count
101 2017-11 86
101 2017-12 32
102 2017-11 11
102 2017-12 34
102 2018-01 46
103 2017-11 56
103 2017-12 89
now I want to convert this df into a matrix that transposes column y_m to row, make the count as matrix cell values like,
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 -1 89 -1
in specific, -1 represents a dummy value that indicates either a value doesn't exist for a y_m for a specific code or to maintain matrix shape; 0 represents 'all' values, that aggregates code or y_m or code and y_m, e.g. cell (1, 1) sums the count values for all y_m and code; (1,2) sums the count for 2017-11.
You can use first pivot_table:
df1 = (df.pivot_table(index='code',
columns='y_m',
values='count',
margins=True,
aggfunc='sum',
fill_value=-1,
margins_name='0'))
print (df1)
y_m 2017-11 2017-12 2018-01 0
code
101 86 32 -1 118
102 11 34 46 91
103 56 89 -1 145
0 153 155 46 354
And then for final format, but get mixed values, numeric with strings:
#change order of index and columns values for reindex
idx = df1.index[-1:].tolist() + df1.index[:-1].tolist()
cols = df1.columns[-1:].tolist() + df1.columns[:-1].tolist()
df2 = (df1.reindex(index=idx, columns=cols)
.reset_index()
.rename(columns={'code':-1})
.rename_axis(None,1))
#add columns to first row
df3 = df2.columns.to_frame().T.append(df2).reset_index(drop=True)
#reset columns names to range
df3.columns = range(len(df3.columns))
print (df3)
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 56 89 -1

Conditional date join in python Pandas

I have two pandas dataframes matches with columns (match_id, team_id,date, ...) and teams_att with columns (id, team_id, date, overall_rating, ...).
I want to join the two dataframes on matches.team_id = teams_att.team_id and teams_att.date closest to matches.date
Example
matches
match_id team_id date
1 101 2012-05-17
2 101 2014-07-11
3 102 2010-05-21
4 102 2017-10-24
teams_att
id team_id date overall_rating
1 101 2010-02-22 67
2 101 2011-02-22 69
3 101 2012-02-20 73
4 101 2013-09-17 79
5 101 2014-09-10 74
6 101 2015-08-30 82
7 102 2015-03-21 42
8 102 2016-03-22 44
Desired results
match_id team_id matches.date teams_att.date overall_rating
1 101 2012-05-17 2012-02-20 73
2 101 2014-07-11 2014-09-10 74
3 102 2010-05-21 2015-03-21 42
4 102 2017-10-24 2016-03-22 44
You can use merge_asof with by and direction parameters:
pd.merge_asof(matches.sort_values('date'),
teams_att.sort_values('date'),
on='date', by='team_id',
direction='nearest')
Output:
match_id team_id date id overall_rating
0 3 102 2010-05-21 7 42
1 1 101 2012-05-17 3 73
2 2 101 2014-07-11 5 74
3 4 102 2017-10-24 8 44
We using merge_asof (Please check Scott's answer, that is the right way for solving this type problem :-) cheers )
g1=df1.groupby('team_id')
g=df.groupby('team_id')
l=[]
for x in [101,102]:
l.append(pd.merge_asof(g.get_group(x),g1.get_group(x),on='date',direction ='nearest'))
pd.concat(l)
Out[405]:
match_id team_id_x date id team_id_y overall_rating
0 1 101 2012-05-17 3 101 73
1 2 101 2014-07-11 5 101 74
0 3 102 2010-05-21 7 102 42
1 4 102 2017-10-24 8 102 44

Resources