Joining multiple Pandas data frames into one - python-3.x

I have list (lst) of data frames and my list has 2000 dataframes. I want to combine all of these data frames in one. Each dataframe has two columns and the first column of each dataframe is the same. For example:
#First dataframe
>>lst[0]
0 1
11 6363
21 737
34 0
43 0
#Second dataframe
>>lst[1]
0 1
11 33
21 0
34 937
43 0
#third dataframe
>>lst[2]
0 1
11 73
21 18
34 27
43 77
Final dataframe will look like:
0 1 2 3
11 6363 33 73
21 737 0 18
34 0 937 27
43 0 0 77
How can I achieve that? Insights will be appreciated.

First we can set the first column as an index to ignore it in concatenation
lst = [df.set_index(0) for df in lst]
Then we concatenate the columns and drop the 0 column back to being the column instead of the index
df_out = pd.concat(lst, axis=1).reset_index()
And we rename the columns:
df_out.columns = range(df_out.shape[1])
Result is:
>> df_out
0 1 2 3 ...
0 11 6363 33 73 ...
1 21 737 0 18 ...
2 34 0 937 27 ...
3 43 0 0 77 ...

You could try this:
lst = [df0, df1, df2, ...]
# Merge dataframes
df_all = lst[0]
for df in lst[1:]:
df_all = df_all.merge(df, how="outer", on=0)
# Rename columns of final dataframe
df_all.columns = list(range(df_all.shape[1]))
print(df_all)
# Outputs
0 1 2 3 ...
0 11 6363 33 73 ...
1 21 737 0 18 ...
2 34 0 937 27 ...
3 43 0 0 77 ...

Related

How to append the multiple columns of a data-frame to a new empty data-frame

I'm having a dataset which contains multiple columns. I'm also having the list of columns:
columns_list = ['A1','A2','B1','B2']
df
A1 A2 B1 B2
0 1 11 21 31
1 2 12 22 32
2 3 13 23 33
3 4 14 24 34
Based on the columns list, how do I transform data.Frame df to new_df, as below:
new_df
0 1
0 1 11
1 2 12
2 3 13
3 4 14
4 21 31
5 22 32
6 23 33
7 24 34
I tried to append that but I'm getting error. How to create the new data.frame. Thank You.
df1 = pd.DataFrame(df[columns_list[0:2]].to_numpy())
df2 = pd.DataFrame(df[columns_list[2:]].to_numpy())
new_df = pd.concat([df1, df2]).reset_index(drop=True)

Filter Values in Python of a Pandas Dataframe of a large array with multiple conditions

I have a dataset that I need to filter once a value has been exceeded but not after based on a groupby() of a second column. Here is an example of the dataframe:
df2 = df.groupby(['UWI']).[df.DIP > 85].reset_index(drop = True)
where I have a dataframe that looks like this:
UWI DIP
0 17 70
1 17 80
2 17 90
3 17 80
4 17 83
5 2 62
6 2 75
7 2 87
8 2 91
I want the returned dataframe to look like this:
UWI DIP
0 17 90
1 17 80
2 17 83
3 2 87
4 2 91
This is a large dataframe so efficiency would be appreciated.
IIUC using cummax
df[df.DIP.gt(85).groupby(df['UWI']).cummax()]
UWI DIP
2 17 90
3 17 80
4 17 83
7 2 87
8 2 91

Transpose a pandas dataframe with headers as column and not index

When I transpose a dataframe, the headers are considered as "index" by default. But I want it to be a column and not an index. How do I achieve this ?
import pandas as pd
dict = {'col-a': [97, 98, 99],
'col-b': [34, 35, 36],
'col-c': [24, 25, 26]}
df = pd.DataFrame(dict)
print(df.T)
0 1 2
col-a 97 98 99
col-b 34 35 36
col-c 24 25 26
Desired Output:
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Try T with reset_index:
df=df.T.reset_index()
print(df)
Or:
df.T.reset_index(inplace=True)
print(df)
Both Output:
index 0 1 2
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
If care about column names, add this to the code:
df.columns=range(4)
Or:
it=iter(range(4))
df=df.rename(columns=lambda x: next(it))
Or if don't know number of columns:
df.columns=range(len(df.columns))
Or:
it=iter(range(len(df.columns)))
df=df.rename(columns=lambda x: next(it))
All Output:
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Use reset_index and then set default columns names:
df1 = df.T.reset_index()
df1.columns = np.arange(len(df1.columns))
print (df1)
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Another solution:
print (df.rename_axis(0, axis=1).rename(lambda x: x + 1).T.reset_index())
#alternative
#print (df.T.rename_axis(0).rename(columns = lambda x: x + 1).reset_index())
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26

pandas how to convert a dataframe to a matrix using transpose

I have the following df,
code y_m count
101 2017-11 86
101 2017-12 32
102 2017-11 11
102 2017-12 34
102 2018-01 46
103 2017-11 56
103 2017-12 89
now I want to convert this df into a matrix that transposes column y_m to row, make the count as matrix cell values like,
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 -1 89 -1
in specific, -1 represents a dummy value that indicates either a value doesn't exist for a y_m for a specific code or to maintain matrix shape; 0 represents 'all' values, that aggregates code or y_m or code and y_m, e.g. cell (1, 1) sums the count values for all y_m and code; (1,2) sums the count for 2017-11.
You can use first pivot_table:
df1 = (df.pivot_table(index='code',
columns='y_m',
values='count',
margins=True,
aggfunc='sum',
fill_value=-1,
margins_name='0'))
print (df1)
y_m 2017-11 2017-12 2018-01 0
code
101 86 32 -1 118
102 11 34 46 91
103 56 89 -1 145
0 153 155 46 354
And then for final format, but get mixed values, numeric with strings:
#change order of index and columns values for reindex
idx = df1.index[-1:].tolist() + df1.index[:-1].tolist()
cols = df1.columns[-1:].tolist() + df1.columns[:-1].tolist()
df2 = (df1.reindex(index=idx, columns=cols)
.reset_index()
.rename(columns={'code':-1})
.rename_axis(None,1))
#add columns to first row
df3 = df2.columns.to_frame().T.append(df2).reset_index(drop=True)
#reset columns names to range
df3.columns = range(len(df3.columns))
print (df3)
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 56 89 -1

How to divide 1 column into 5 segments with pandas and python?

I have a list of 1 column and 50 rows.
I want to divide it into 5 segments. And each segment has to become a column of a dataframe. I do not want the NAN to appear (figure2). How can I solve that?
Like this:
df = pd.DataFrame(result_list)
AWA=df[:10]
REM=df[10:20]
S1=df[20:30]
S2=df[30:40]
SWS=df[40:50]
result = pd.concat([AWA, REM, S1, S2, SWS], axis=1)
result
Figure2
You can use numpy's reshape function:
result_list = [i for i in range(50)]
pd.DataFrame(np.reshape(result_list, (10, 5), order='F'))
Out:
0 1 2 3 4
0 0 10 20 30 40
1 1 11 21 31 41
2 2 12 22 32 42
3 3 13 23 33 43
4 4 14 24 34 44
5 5 15 25 35 45
6 6 16 26 36 46
7 7 17 27 37 47
8 8 18 28 38 48
9 9 19 29 39 49

Resources