I have a dict like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
And my desired output like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
Only column A to column E will shift null, I have a current script using lamba but all dataframe shift the null values to the last column. I need certain columns only, any one can help me? THank you!
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df = df.T.apply(lambda arr: shift_null(arr)).T
You can remove missing values per rows by Series.dropna, add possible only missing values columns by DataFrame.reindex and then set columns names by list by DataFrame.set_axis:
cols = ['A','B','C','D','E']
df[cols] = (df[cols].apply(lambda x: pd.Series(x.dropna().tolist()), axis=1)
.reindex(range(len(cols)), axis=1)
.set_axis(cols, axis=1))
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Your solution is changed with remove transposing and result_type='expand' in DataFrame.apply:
cols = ['A','B','C','D','E']
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df[cols] = df[cols].apply(lambda arr: shift_null(arr), axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Another idea is sorting by key parameter:
cols = ['A','B','C','D','E']
df[cols] = df[cols].apply(lambda x: x.sort_values(key=lambda x: x.isna()).tolist(),
axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Solution with reshape by DataFrame.stack, add counter for new columns names and last reshape back by Series.unstack:
s = df[cols].stack().droplevel(1)
s.index = [s.index, s.groupby(level=0).cumcount()]
df[cols] = s.unstack().rename(dict(enumerate(cols)), axis=1).reindex(cols, axis=1)
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Related
I have a excel with multiple sheets in the below format. I need to create a single dataframe by concatenating all the sheets, unmerging the cell and then transposing them into a column based on the sheet
Sheet 1:
Sheet 2:
Final Dataframe should look like below
Result expected - I need the below format with an extra coulmn as below
Code So far:
Reading File:
df = pd.concat(pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/Dig/File_2017_2022.xlsx', sheet_name=None, skiprows=1))
Creating Column :
df_1 = pd.concat([df.assign(name=n) for n,df in dfs.items()])
Use read_excel with header=[0,1] for MultiIndex by first 2 rows of header and index_col=[0,1] for MultiIndex by first 2 columns, so possible in loop reshape by DataFrame.stack, add new column, use concat and last set index names by DataFrame.rename_axis with convert to columns by DataFrame.reset_index:
dfs = pd.read_excel('Input_V1.xlsx',sheet_name=None, header=[0,1], index_col=[0,1])
df_1 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand'], columns=None)
.reset_index())
df_1.insert(len(df_1.columns) - 2, 'Campaign', df_1.pop('Campaign'))
print (df_1)
Date WK Brand A B C D E F G \
0 2017-10-02 Week 40 ABC NaN NaN NaN NaN 56892.800000 83431.664000 NaN
1 2017-10-09 Week 41 ABC NaN NaN NaN NaN 0.713716 0.474025 NaN
2 2017-10-16 Week 42 ABC NaN NaN NaN NaN 0.025936 0.072500 NaN
3 2017-10-23 Week 43 ABC NaN NaN NaN NaN 0.182677 0.926731 NaN
4 2017-10-30 Week 44 ABC NaN NaN NaN NaN 0.755607 0.686115 NaN
.. ... ... ... .. .. .. .. ... ... ..
99 2018-03-26 Week 13 PQR NaN NaN NaN NaN 47702.000000 12246.000000 NaN
100 2018-04-02 Week 14 PQR NaN NaN NaN NaN 38768.000000 46498.000000 NaN
101 2018-04-09 Week 15 PQR NaN NaN NaN NaN 35917.000000 45329.000000 NaN
102 2018-04-16 Week 16 PQR NaN NaN NaN NaN 39639.000000 51343.000000 NaN
103 2018-04-23 Week 17 PQR NaN NaN NaN NaN 50867.000000 30119.000000 NaN
H I J K L Campaign name
0 NaN NaN NaN 0.017888 0.697324 NaN ABC
1 NaN NaN NaN 0.457963 0.810985 NaN ABC
2 NaN NaN NaN 0.743030 0.253668 NaN ABC
3 NaN NaN NaN 0.038683 0.050028 NaN ABC
4 NaN NaN NaN 0.885567 0.712333 NaN ABC
.. .. .. .. ... ... ... ...
99 NaN NaN NaN 9433.000000 17108.000000 WX PQR
100 NaN NaN NaN 12529.000000 23557.000000 WX PQR
101 NaN NaN NaN 20395.000000 44228.000000 WX PQR
102 NaN NaN NaN 55077.000000 45149.000000 WX PQR
103 NaN NaN NaN 45815.000000 35761.000000 WX PQR
[104 rows x 17 columns]
I created my own version of your excel, which looks like
this
The code below is far from perfect but it should do fine as long as you do not have millions of sheets
# First, obtain all sheet names
full_df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx',
sheet_name=None, skiprows=0)
# Store them into a list
sheet_names = list(full_df.keys())
# Create an empty Dataframe to store the contents from each sheet
final_df = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx', sheet_name=sheet, skiprows=0)
# Get the brand name
brand = df.columns[1]
# Remove the header columns and keep the numerical values only
df.columns = df.iloc[0]
df = df[1:]
df = df.iloc[:, 1:]
# Set the brand name into a new column
df['Brand'] = brand
# Append into the final dataframe
final_df = pd.concat([final_df, df])
Your final_df should look like this once exported back to excel
EDIT: You might need to drop the dataframe's index upon saving it by using the df.reset_index(drop=True) function, to remove the first column shown in the image right above.
I have this input dataframe:
Test1 Test2 Test3 Subject
0 45 NaN NaN Python
1 50 NaN NaN Python
2 NaN 30 NaN Python
3 NaN 35 NaN OS
4 NaN 38 NaN OS
5 NaN 43 NaN Java
6 NaN 32 NaN DS
7 NaN NaN 49 DS
8 NaN 25 NaN DS
9 NaN 34 NaN DS
Expected output is (Dataframe):
Subject Test1 Test2 Test3
Python 45,50 30
OS 35,38
Java 43
DS 32,25,34 49
I've tried this code:
df.groupby(['subject']).sum().reset_index().assign(subject =lambda x: x['subject'].where(~x['subject'].duplicated(), '')).to_csv('filename.csv', index=False)
It's not giving desired output.
Use custom function with remove missing values by Series.dropna, if necessary convert to integer and then if some numeric values convert to strings and use join:
f = lambda x: ','.join(x.dropna().astype(int).astype(str))
df = df.groupby('Subject', sort=False).agg(f).reset_index()
print (df)
Subject Test1 Test2 Test3
0 Python 45,50 30
1 OS 35,38
2 Java 43
3 DS 32,25,34 49
Another idea without convert to integers working if many different formats of values (e.g. some columns are numeric and some strings):
f = lambda x: ','.join(x.dropna().astype(str))
df = df.groupby('Subject', sort=False).agg(f).reset_index()
print (df)
Subject Test1 Test2 Test3
0 Python 45.0,50.0 30.0
1 OS 35.0,38.0
2 Java 43.0
3 DS 32.0,25.0,34.0 49.0
I have a df as shown below.
Date t_factor
2020-02-01 5
2020-02-03 23
2020-02-06 14
2020-02-09 23
2020-02-10 23
2020-02-11 23
2020-02-13 30
2020-02-20 29
2020-02-29 100
2020-03-01 38
2020-03-10 38
2020-03-11 38
2020-03-26 70
2020-03-29 70
From that I would like to create a function that will calculate the column called t_function based on the calculated values t1, t2 and t3.
where input parameters are stored in a dictionary as shown below.
d1 = {'b1': {'s': '2020-02-01', 'e':'2020-02-06', 'coef':[3, 1, 0]},
'b2': {'s': '2020-02-13', 'e':'2020-02-29', 'coef':[2, 0, 1]},
'b3': {'s': '2020-03-11', 'e':'2020-03-29', 'coef':[4, 0, 0]}}
Expected output:
Date t_factor t1 t2 t3 t_function
2020-02-01 5 4 NaN NaN 4
2020-02-03 23 6 NaN NaN 6
2020-02-06 14 9 NaN NaN 9
2020-02-09 23 NaN NaN NaN 0
2020-02-10 23 NaN NaN NaN 0
2020-02-11 23 NaN NaN NaN 0
2020-02-13 30 NaN 3 NaN 3
2020-02-20 29 NaN 66 NaN 66
2020-02-29 100 NaN 291 NaN 291
2020-03-01 38 NaN NaN NaN 0
2020-03-10 38 NaN NaN NaN 0
2020-03-11 38 NaN NaN 4 4
2020-03-26 70 NaN NaN 4 4
2020-03-29 70 NaN NaN 4 4
I tried below code
def fun(x, start="2020-02-01", end="2020-02-06", a0=3, a1=1, a2=0):
start = datetime.strptime(start, "%Y-%m-%d")
end = datetime.strptime(end, "%Y-%m-%d")
if start <= x.Date <= end:
t2 = (x.Date - start)/np.timedelta64(1, 'D') + 1
diff = a0 + a1*t2 + a2*(t2)**2
else:
diff = np.NaN
return diff
df["t1"] = df.apply(lambda x: fun(x), axis=1)
df["t2"] = df.apply(lambda x: fun(x, "2020-02-13", "2020-02-29", 2, 0, 1), axis=1)
df["t3"] = df.apply(lambda x: fun(x, "2020-03-11", "2020-03-29", 4, 0, 0), axis=1)
df["t_function"] = df['t1'].fillna(0) + df['t2'].fillna(0) + df['t3'].fillna(0)
Above code I would like change by looping over the dictionary d1.
Note:
The dictionary d1 may have more than three keys such as 'b1', 'b2', 'b3', 'b4' then we have to create t1, t2, t3 and t4 columns. I would like to automate this with looping over the dictionary d1:
I would propose that you store the data as a list of tuples. Like so,
params = [('2020-02-01', '2020-02-06', 3, 1, 0),
('2020-02-13', '2020-02-29', 2, 0, 1),
('2020-03-11', '2020-03-29', 4, 0, 0)]
Now all you need is to loop over params and add the columns to your dataframe df.
total = None
for i, param in enumerate(params):
s, e, a0, a1, a2 = param
df[f"t{i+1}"] = df.apply(lambda x: fun(x, s, e, a0, a1, a2), axis=1)
if i==0:
total = df[f"t{i+1}"].fillna(0)
else:
total += df[f"t{i+1}"].fillna(0)
df["t_function"] = total
This gives the desired output:
Date t_factor t1 t2 t3 t_function
0 2020-02-01 5 4.0 NaN NaN 4.0
1 2020-02-03 23 6.0 NaN NaN 6.0
2 2020-02-06 14 9.0 NaN NaN 9.0
3 2020-02-09 23 NaN NaN NaN 0.0
4 2020-02-10 23 NaN NaN NaN 0.0
5 2020-02-11 23 NaN NaN NaN 0.0
6 2020-02-13 30 NaN 3.0 NaN 3.0
7 2020-02-20 29 NaN 66.0 NaN 66.0
8 2020-02-29 100 NaN 291.0 NaN 291.0
9 2020-03-01 38 NaN NaN NaN 0.0
10 2020-03-10 38 NaN NaN NaN 0.0
11 2020-03-11 38 NaN NaN 4.0 4.0
12 2020-03-26 70 NaN NaN 4.0 4.0
13 2020-03-29 70 NaN NaN 4.0 4.0
I want to do a left join with two pandas dataframes: d1 and d2. However after the join, I want one column values to replace the NULL values in another column. Here's my datasets:
vehicle_type vehicle_id sales margin
a 11 200 0.1
b 22 150 0.2
c NaN NaN NaN
d NaN NaN NaN
vehicle_type vehicle_id sales alignment
c 33 210 x
d 44 300 y
I would like the final result to be like the following, where the left join would replace the Null vehicle IDs and sales in D1:
vehicle_type vehicle_id sales margin alignment
a 11 200 0.1 NaN
b 22 150 0.2 NaN
c 33 210 NaN x
d 44 300 NaN y
I'm using the following code, but it is not working:
D3 = D1.merge(D2, on='vehicle_type',how='left')
Use DataFrame.combine_first with DataFrame.set_index for correct align DataFrame by vehicle_type columns:
df3 = (df1.set_index('vehicle_type')
.combine_first(df2.set_index('vehicle_type'))
.reset_index())
print (df3)
vehicle_type alignment margin sales vehicle_id
0 a NaN 0.1 200.0 11.0
1 b NaN 0.2 150.0 22.0
2 c x NaN 210.0 33.0
3 d y NaN 300.0 44.0
I have a csv file like this:
ATTRIBUTE_1;.....;ATTRIBUTE_N
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,69
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,71
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,72
When i try to import in python with this comand:
data = pd.read_csv(r"C:\...\file.csv")
My output is this:
0 null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;60...
How can a import a csv by columns? Like this:
ATTRIBUTE_1 ATTRIBUTE_2 .... ATTRIBUTE_N
NULL 01 778,69
NULL 01 778,71
...
NULL 03 775,33
There is problem your each row start and end with ", so is necessary parameter quoting=3, it means set QUOTE_NONE:
df = pd.read_csv('file.csv', sep=';', quoting=3)
#strip " from first and last column
df.iloc[:,0] = df.iloc[:,0].str.strip('"')
df.iloc[:,-1] = df.iloc[:,-1].str.strip('"')
#strip " from columns names
df.columns = df.columns.str.strip('"')
print (df.head())
SIGLA TARGA CATEGORIA TARIFFARIA - LIVELLO 3 SESSO \
0 null 1 M
1 null 1 M
2 null 1 M
3 null 1 M
4 null 1 M
RCA - PATTO PER I GIOVANI VALORE FRANCHIGIA TIPO TARGA CILINDRATA \
0 N NaN N 1108
1 N NaN N 1108
2 N NaN N 1108
3 N NaN N 1108
4 N NaN N 1108
CODICE FORMA CONTRATTUALE RCA - RECUPERO COMUNE PRA \
0 1 F205
1 1 F205
2 1 F205
3 1 F205
4 1 F205
CODICE WORKSITE MARKETING ... Unnamed: 55 Unnamed: 56 \
0 NaN ... NaN NaN
1 NaN ... NaN NaN
2 NaN ... NaN NaN
3 NaN ... NaN NaN
4 NaN ... NaN NaN
Unnamed: 57 Unnamed: 58 Unnamed: 59 Unnamed: 60 Unnamed: 61 Unnamed: 62 \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
Unnamed: 63 PREMIO FINALE
0 NaN 778,69
1 NaN 778,70
2 NaN 778,71
3 NaN 778,72
4 NaN 778,73
[5 rows x 65 columns]