grouping in pandas dataframe - python-3.x

I have this input dataframe:
Test1 Test2 Test3 Subject
0 45 NaN NaN Python
1 50 NaN NaN Python
2 NaN 30 NaN Python
3 NaN 35 NaN OS
4 NaN 38 NaN OS
5 NaN 43 NaN Java
6 NaN 32 NaN DS
7 NaN NaN 49 DS
8 NaN 25 NaN DS
9 NaN 34 NaN DS
Expected output is (Dataframe):
Subject Test1 Test2 Test3
Python 45,50 30
OS 35,38
Java 43
DS 32,25,34 49
I've tried this code:
df.groupby(['subject']).sum().reset_index().assign(subject =lambda x: x['subject'].where(~x['subject'].duplicated(), '')).to_csv('filename.csv', index=False)
It's not giving desired output.

Use custom function with remove missing values by Series.dropna, if necessary convert to integer and then if some numeric values convert to strings and use join:
f = lambda x: ','.join(x.dropna().astype(int).astype(str))
df = df.groupby('Subject', sort=False).agg(f).reset_index()
print (df)
Subject Test1 Test2 Test3
0 Python 45,50 30
1 OS 35,38
2 Java 43
3 DS 32,25,34 49
Another idea without convert to integers working if many different formats of values (e.g. some columns are numeric and some strings):
f = lambda x: ','.join(x.dropna().astype(str))
df = df.groupby('Subject', sort=False).agg(f).reset_index()
print (df)
Subject Test1 Test2 Test3
0 Python 45.0,50.0 30.0
1 OS 35.0,38.0
2 Java 43.0
3 DS 32.0,25.0,34.0 49.0

Related

How to read unmerged column in Pandas and transpose them

I have a excel with multiple sheets in the below format. I need to create a single dataframe by concatenating all the sheets, unmerging the cell and then transposing them into a column based on the sheet
Sheet 1:
Sheet 2:
Final Dataframe should look like below
Result expected - I need the below format with an extra coulmn as below
Code So far:
Reading File:
df = pd.concat(pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/Dig/File_2017_2022.xlsx', sheet_name=None, skiprows=1))
Creating Column :
df_1 = pd.concat([df.assign(name=n) for n,df in dfs.items()])
Use read_excel with header=[0,1] for MultiIndex by first 2 rows of header and index_col=[0,1] for MultiIndex by first 2 columns, so possible in loop reshape by DataFrame.stack, add new column, use concat and last set index names by DataFrame.rename_axis with convert to columns by DataFrame.reset_index:
dfs = pd.read_excel('Input_V1.xlsx',sheet_name=None, header=[0,1], index_col=[0,1])
df_1 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand'], columns=None)
.reset_index())
df_1.insert(len(df_1.columns) - 2, 'Campaign', df_1.pop('Campaign'))
print (df_1)
Date WK Brand A B C D E F G \
0 2017-10-02 Week 40 ABC NaN NaN NaN NaN 56892.800000 83431.664000 NaN
1 2017-10-09 Week 41 ABC NaN NaN NaN NaN 0.713716 0.474025 NaN
2 2017-10-16 Week 42 ABC NaN NaN NaN NaN 0.025936 0.072500 NaN
3 2017-10-23 Week 43 ABC NaN NaN NaN NaN 0.182677 0.926731 NaN
4 2017-10-30 Week 44 ABC NaN NaN NaN NaN 0.755607 0.686115 NaN
.. ... ... ... .. .. .. .. ... ... ..
99 2018-03-26 Week 13 PQR NaN NaN NaN NaN 47702.000000 12246.000000 NaN
100 2018-04-02 Week 14 PQR NaN NaN NaN NaN 38768.000000 46498.000000 NaN
101 2018-04-09 Week 15 PQR NaN NaN NaN NaN 35917.000000 45329.000000 NaN
102 2018-04-16 Week 16 PQR NaN NaN NaN NaN 39639.000000 51343.000000 NaN
103 2018-04-23 Week 17 PQR NaN NaN NaN NaN 50867.000000 30119.000000 NaN
H I J K L Campaign name
0 NaN NaN NaN 0.017888 0.697324 NaN ABC
1 NaN NaN NaN 0.457963 0.810985 NaN ABC
2 NaN NaN NaN 0.743030 0.253668 NaN ABC
3 NaN NaN NaN 0.038683 0.050028 NaN ABC
4 NaN NaN NaN 0.885567 0.712333 NaN ABC
.. .. .. .. ... ... ... ...
99 NaN NaN NaN 9433.000000 17108.000000 WX PQR
100 NaN NaN NaN 12529.000000 23557.000000 WX PQR
101 NaN NaN NaN 20395.000000 44228.000000 WX PQR
102 NaN NaN NaN 55077.000000 45149.000000 WX PQR
103 NaN NaN NaN 45815.000000 35761.000000 WX PQR
[104 rows x 17 columns]
I created my own version of your excel, which looks like
this
The code below is far from perfect but it should do fine as long as you do not have millions of sheets
# First, obtain all sheet names
full_df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx',
sheet_name=None, skiprows=0)
# Store them into a list
sheet_names = list(full_df.keys())
# Create an empty Dataframe to store the contents from each sheet
final_df = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx', sheet_name=sheet, skiprows=0)
# Get the brand name
brand = df.columns[1]
# Remove the header columns and keep the numerical values only
df.columns = df.iloc[0]
df = df[1:]
df = df.iloc[:, 1:]
# Set the brand name into a new column
df['Brand'] = brand
# Append into the final dataframe
final_df = pd.concat([final_df, df])
Your final_df should look like this once exported back to excel
EDIT: You might need to drop the dataframe's index upon saving it by using the df.reset_index(drop=True) function, to remove the first column shown in the image right above.

I am new to pandas Data frame. i am getting this error TypeError: can only concatenate str (not "int") to str

I am getting an error from this code:
TypeError: can only concatenate str (not "int") to str
The code:
import pandas as pd
import numpy as np
df = pd.read_excel("test123.xlsx")
print(df)
subone = df["F1"] + df["I1"]
print(subone)
The Excel File mentioned:
sl no name F1 F2 Unnamed: 4 Unnamed: 5 I1 I2
0 1 abc 0 95 NaN NaN 10 54
1 2 def 10 88 NaN NaN 22 21
2 3 ghi 52 44 NaN NaN 33 21
3 4 jkl 65 55 NaN NaN 54 21
4 5 bgm **AB** 25 NaN NaN 65 23
If you want to concat two columns
subone = df["F1"].astype(str) + df["l1"].astype(str)
print(subone)

Python pandas shift null in certain columns Only

I have a dict like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
And my desired output like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
Only column A to column E will shift null, I have a current script using lamba but all dataframe shift the null values to the last column. I need certain columns only, any one can help me? THank you!
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df = df.T.apply(lambda arr: shift_null(arr)).T
You can remove missing values per rows by Series.dropna, add possible only missing values columns by DataFrame.reindex and then set columns names by list by DataFrame.set_axis:
cols = ['A','B','C','D','E']
df[cols] = (df[cols].apply(lambda x: pd.Series(x.dropna().tolist()), axis=1)
.reindex(range(len(cols)), axis=1)
.set_axis(cols, axis=1))
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Your solution is changed with remove transposing and result_type='expand' in DataFrame.apply:
cols = ['A','B','C','D','E']
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df[cols] = df[cols].apply(lambda arr: shift_null(arr), axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Another idea is sorting by key parameter:
cols = ['A','B','C','D','E']
df[cols] = df[cols].apply(lambda x: x.sort_values(key=lambda x: x.isna()).tolist(),
axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Solution with reshape by DataFrame.stack, add counter for new columns names and last reshape back by Series.unstack:
s = df[cols].stack().droplevel(1)
s.index = [s.index, s.groupby(level=0).cumcount()]
df[cols] = s.unstack().rename(dict(enumerate(cols)), axis=1).reindex(cols, axis=1)
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56

Error while performing operation on DatetimeIndexResampler type

I have a time-series data frame and want to find difference between the date in each record and the last (maximum) date within that data-frame. But getting error - TypeError: unsupported operand type(s) for -: 'DatetimeIndex' and 'SeriesGroupBy'. Seems from the error that data frame is not in the 'right' type to be allowed to have these operations allowed. How can I avoid this or possibly cast the data in some other format to be able to do the operation. Below is sample code which reproduces the error
import pandas as pd
df = pd.DataFrame([[54.7,36.3,'2010-07-20'],[54.7,36.3,'2010-07-21'],[52.3,38.7,'2010-07-26'],[52.3,38.7,'2010-07-30']],
columns=['col1','col2','date'])
df.date = pd.to_datetime(df.date)
df.index = df.date
df = df.resample('D')
print(type(df))
diff = (df.date.max() - df.date).values
I think you need create DatetimeIndex first by DataFrame.set_index, so if aggregate by max then get consecutive values:
df = pd.DataFrame([[54.7,36.3,'2010-07-20'],
[54.7,36.3,'2010-07-21'],
[52.3,38.7,'2010-07-26'],
[52.3,38.7,'2010-07-30']],
columns=['col1','col2','date'])
df.date = pd.to_datetime(df.date)
df1 = df.set_index('date').resample('D').max()
#alternative if not duplicated datetimes
#df1 = df.set_index('date').asfreq('D')
print (df1)
col1 col2
date
2010-07-20 54.7 36.3
2010-07-21 54.7 36.3
2010-07-22 NaN NaN
2010-07-23 NaN NaN
2010-07-24 NaN NaN
2010-07-25 NaN NaN
2010-07-26 52.3 38.7
2010-07-27 NaN NaN
2010-07-28 NaN NaN
2010-07-29 NaN NaN
2010-07-30 52.3 38.7
Then subtract max value of index with itself and convert timedeltas to days by TimedeltaIndex.days:
df1['diff'] = (df1.index.max() - df1.index).days
print (df1)
col1 col2 diff
date
2010-07-20 54.7 36.3 10
2010-07-21 54.7 36.3 9
2010-07-22 NaN NaN 8
2010-07-23 NaN NaN 7
2010-07-24 NaN NaN 6
2010-07-25 NaN NaN 5
2010-07-26 52.3 38.7 4
2010-07-27 NaN NaN 3
2010-07-28 NaN NaN 2
2010-07-29 NaN NaN 1
2010-07-30 52.3 38.7 0

Import csv file with sep=';' to python by columns - Pandas Dataset

I have a csv file like this:
ATTRIBUTE_1;.....;ATTRIBUTE_N
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,69
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,71
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,72
When i try to import in python with this comand:
data = pd.read_csv(r"C:\...\file.csv")
My output is this:
0 null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;60...
How can a import a csv by columns? Like this:
ATTRIBUTE_1 ATTRIBUTE_2 .... ATTRIBUTE_N
NULL 01 778,69
NULL 01 778,71
...
NULL 03 775,33
There is problem your each row start and end with ", so is necessary parameter quoting=3, it means set QUOTE_NONE:
df = pd.read_csv('file.csv', sep=';', quoting=3)
#strip " from first and last column
df.iloc[:,0] = df.iloc[:,0].str.strip('"')
df.iloc[:,-1] = df.iloc[:,-1].str.strip('"')
#strip " from columns names
df.columns = df.columns.str.strip('"')
print (df.head())
SIGLA TARGA CATEGORIA TARIFFARIA - LIVELLO 3 SESSO \
0 null 1 M
1 null 1 M
2 null 1 M
3 null 1 M
4 null 1 M
RCA - PATTO PER I GIOVANI VALORE FRANCHIGIA TIPO TARGA CILINDRATA \
0 N NaN N 1108
1 N NaN N 1108
2 N NaN N 1108
3 N NaN N 1108
4 N NaN N 1108
CODICE FORMA CONTRATTUALE RCA - RECUPERO COMUNE PRA \
0 1 F205
1 1 F205
2 1 F205
3 1 F205
4 1 F205
CODICE WORKSITE MARKETING ... Unnamed: 55 Unnamed: 56 \
0 NaN ... NaN NaN
1 NaN ... NaN NaN
2 NaN ... NaN NaN
3 NaN ... NaN NaN
4 NaN ... NaN NaN
Unnamed: 57 Unnamed: 58 Unnamed: 59 Unnamed: 60 Unnamed: 61 Unnamed: 62 \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
Unnamed: 63 PREMIO FINALE
0 NaN 778,69
1 NaN 778,70
2 NaN 778,71
3 NaN 778,72
4 NaN 778,73
[5 rows x 65 columns]

Resources