How to rearrange a pandas dataframe having N columns and append N columns together in python? - python-3.x

I have a dataframe df as shown below,A index,B Index and C Index appear as headers
and each of them have sub header as the Last price
Input
A index B Index C Index
Date Last Price Date Last Price Date Last Price
1/10/2021 12 1/11/2021 46 2/9/2021 67
2/10/2021 13 2/11/2021 51 3/9/2021 70
3/10/2021 14 3/11/2021 62 4/9/2021 73
4/10/2021 15 4/11/2021 47 5/9/2021 76
5/10/2021 16 5/11/2021 51 6/9/2021 79
6/10/2021 17 6/11/2021 22 7/9/2021 82
7/10/2021 18 7/11/2021 29 8/9/2021 85
I want to transform the to the below dataframe.
Expected Output
Date Index Name Last Price
1/10/2021 A index 12
2/10/2021 A index 13
3/10/2021 A index 14
4/10/2021 A index 15
5/10/2021 A index 16
6/10/2021 A index 17
7/10/2021 A index 18
1/11/2021 B Index 46
2/11/2021 B Index 51
3/11/2021 B Index 62
4/11/2021 B Index 47
5/11/2021 B Index 51
6/11/2021 B Index 22
7/11/2021 B Index 29
2/9/2021 C Index 67
3/9/2021 C Index 70
4/9/2021 C Index 73
5/9/2021 C Index 76
6/9/2021 C Index 79
7/9/2021 C Index 82
8/9/2021 C Index 85
How can this be done in pandas dataframe?

The structure of your df is not clear from your output. It would be useful if you provided Python code that creates an example, or at the very lest the output of df.columns. Now let us assume it is a 2-level multindex created as such:
columns = pd.MultiIndex.from_tuples([('A index','Date'), ('A index','Last Price'),('B index','Date'), ('B index','Last Price'),('C index','Date'), ('C index','Last Price')])
data = [
['1/10/2021', 12, '1/11/2021', 46, '2/9/2021', 67],
['2/10/2021', 13, '2/11/2021', 51, '3/9/2021', 70],
['3/10/2021', 14, '3/11/2021', 62, '4/9/2021', 73],
['4/10/2021', 15, '4/11/2021', 47, '5/9/2021', 76],
['5/10/2021', 16, '5/11/2021', 51, '6/9/2021', 79],
['6/10/2021', 17, '6/11/2021', 22, '7/9/2021', 82],
['7/10/2021', 18, '7/11/2021', 29, '8/9/2021', 85],
]
df = pd.DataFrame(columns = columns, data = data)
Then what you are trying to do is basically an application of .stack with some re-arrangement after:
(df.stack(level = 0)
.reset_index(level=1)
.rename(columns = {'level_1':'Index Name'})
.sort_values(['Index Name','Date'])
)
this produces
Index Name Date Last Price
0 A index 1/10/2021 12
1 A index 2/10/2021 13
2 A index 3/10/2021 14
3 A index 4/10/2021 15
4 A index 5/10/2021 16
5 A index 6/10/2021 17
6 A index 7/10/2021 18
0 B index 1/11/2021 46
1 B index 2/11/2021 51
2 B index 3/11/2021 62
3 B index 4/11/2021 47
4 B index 5/11/2021 51
5 B index 6/11/2021 22
6 B index 7/11/2021 29
0 C index 2/9/2021 67
1 C index 3/9/2021 70
2 C index 4/9/2021 73
3 C index 5/9/2021 76
4 C index 6/9/2021 79
5 C index 7/9/2021 82
6 C index 8/9/2021 85

Related

Creating a list from series of pandas

Click here for the imageI m trying to create a list from 3 different series which will be of the shape "({A} {B} {C})" where A denotes the 1st element from series 1, B is for 1st element from series 2, C is for 1st element from series 3 and this way it should create a list containing 600 element.
List 1 List 2 List 3
u_p0 1 v_p0 2 w_p0 7
u_p1 21 v_p1 11 w_p1 45
u_p2 32 v_p2 25 w_p2 32
u_p3 45 v_p3 76 w_p3 49
... .... ....
u_p599 56 v_p599 78 w_599 98
Now I want the output list as follows
(1 2 7)
(21 11 45)
(32 25 32)
(45 76 49)
.....
These are the 3 series I created from a dataframe
r1=turb_1.iloc[qw1] #List1
r2=turb_1.iloc[qw2] #List2
r3=turb_1.iloc[qw3] #List3
Pic of the seriesFor the output I think formatted string python method will be useful but I m quite not sure how to proceed.
turb_3= ["({A} {B} {C})".format(A=i,B=j,C=k) for i in r1 for j in r2 for k in r3]
Any kind of help will be useful.
Use pandas.DataFrame.itertuples with str.format:
# Sample data
print(df)
col1 col2 col3
0 1 2 7
1 21 11 45
2 32 25 32
3 45 76 49
fmt = "({} {} {})"
[fmt.format(*tup) for tup in df[["col1", "col2", "col3"]].itertuples(False, None)]
Output:
['(1 2 7)', '(21 11 45)', '(32 25 32)', '(45 76 49)']

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

Add columns - specify name - row value based on other

I have a Dataframe like this:
data = {'TYPE':['X', 'Y', 'Z'],'A': [11,12,13], 'B':[21,22,23], 'C':[31,32,34]}
df = pd.DataFrame(data)
TYPE A B C
0 X 11 21 31
1 Y 12 22 32
2 Z 13 23 34
I like to get the following DataFrame:
TYPE A A_added B B_added C C_added
0 X 11 15 21 25 31 35
1 Y 12 18 22 28 32 38
2 Z 13 20 23 30 34 40
For each column (next to TYPE column), here A,B,C:
add a new column with the name column_name_added
if TYPE = X add 4, if TYPE = Y add 6, if Z add 7
Idea is multiple values by helper Series created by Series.map with dictionary with DataFrame.add, add to original by DataFrame.join and last change order of columns by DataFrame.reindex:
d = {'X':4,'Y':6, 'Z':7}
cols = df.columns[:1].tolist() + [i for x in df.columns[1:] for i in (x, x + '_added')]
df1 = df.iloc[:, 1:].add(df['TYPE'].map(d), axis=0, fill_value=0).add_suffix('_added')
df2 = df.join(df1).reindex(cols, axis=1)
print (df2)
TYPE A A_added B B_added C C_added
0 X 11 15 21 25 31 35
1 Y 12 18 22 28 32 38
2 Z 13 20 23 30 34 41
EDIT: For values not matched dictionary are created missing values, so if add Series.fillna it return value 7 for all another values:
d = {'X':4,'Y':6}
cols = df.columns[:1].tolist() + [i for x in df.columns[1:] for i in (x, x + '_added')]
df1 = df.iloc[:, 1:].add(df['TYPE'].map(d).fillna(7).astype(int), axis=0).add_suffix('_added')
df2 = df.join(df1).reindex(cols, axis=1)
print (df2)
TYPE A A_added B B_added C C_added
0 X 11 15 21 25 31 35
1 Y 12 18 22 28 32 38
2 Z 13 20 23 30 34 41

Replace values in Columns

I want to replace values in columns using if loop:
If value in column [D] is not same as any values in [A,B,C] then replace column with first NaN with D, and if there is no NaN in a row, create a new column [E] and add value from column [D] in column [E].
ID A B C D
0 22 32 NaN 22
1 25 13 NaN 15
2 27 NaN NaN 20
3 29 10 16 29
4 12 92 33 55
I want output to be:
ID A B C D E
0 22 32 NaN 22
1 25 13 15 15
2 27 20 NaN 20
3 29 10 16 29
4 12 92 33 55 55
List = [[22 , 32 , None , 22],
[25 , 13 , None , 15],
[27 , None , None , 20],
[29 , 10 , 16 , 29],
[12 , 92 , 33 , 55]]
for Row in List:
Target_C = Row[3]
if Row.count(Target_C) < 2: # If there is no similar condetion pass
None_Found = False # Small bool to check later if there is no None !
for enumerate_Column in enumerate(Row): # get index for each list
if(None in enumerate_Column): # if there is None gin the row
Row[enumerate_Column[0]] = Target_C # replace None with column D
None_Found = True # Change None_Found to True
if(None_Found): # Break the loop if found None
break
if(None_Found == False): # if you dont found None add new clulmn
Row.append(Target_C)
My Code example
You can do it this way
a = df.isnull()
b = (a[a.any(axis=1)].idxmax(axis=1))
nanindex = b.index
check = (df.A!=df.D) & (df.B!=df.D) & (df.C!=df.D)
commonind = check[~check].index
replace_ind_list = list(nanindex.difference(commonind))
new_col_list = df.index.difference(list(set(commonind.tolist()+nanindex.tolist()))).tolist()
df['E']=''
for index, row in df.iterrows():
for val in new_col_list:
if index == val:
df.at[index,'E'] = df['D'][index]
for val in replace_ind_list:
if index == val:
df.at[index,b[val]] = df['D'][index]
df
Output
ID A B C D E
0 0 22 32.0 NaN 22
1 1 25 13.0 15.0 15
2 2 27 20.0 NaN 20
3 3 29 10.0 16.0 29
4 4 12 92.0 33.0 55 55

A vectorized solution producing a new column in DataFrame that depends on conditions of existing columns and also the new column itself

My current dataframe data is as follows:
df=pd.DataFrame([[1.4,3.5,4.6],[2.8,5.4,6.4],[7.8,6.5,5.8]],columns=['t','i','m'])
t i m
0 14 35 46
1 28 54 64
2 28 34 64
3 78 65 58
My goal is to apply a vectorized operations on a df with a conditions as follows (pseudo code):
New column of answer starts with value of 1.
For row in df.itertuples():
if (m > i) & (answer in row-1 is an odd number):
answer in row = answer in row-1 + m
elif (m > i):
answer in row = answer in row-1 - m
else:
answer in row = answer in row-1
The desired output is as follows:
t i m answer
0 14 35 46 1
1 28 54 59 60
2 78 12 58 2
3 78 91 48 2
Any elegant solution would be appreciated.

Resources