I'm trying to join two dataframes. 'df' is my initial dataframe containing all the header information I require. 'row' is my first row of data that I want to append to 'df'.
df =
FName E1 E2 E3 E4 E5 E6
0 Nan 2 2 2 2 2 2
1 Nan 1 1 1 1 1 1
2 Nan 3 4 5 6 7 8
3 Nan 4 5 6 7 8 10
4 Nan 1002003004 1002004005 1002005006 1002006007 1002007008 1002008010
row =
0 1 2 3 4 5 6
0 501#_ZMB_2019-04-03_070528_reciprocals 30.0193 30.0193 30.0193 34.8858 34.8858 34.8858
I'm trying to create this:
FName E1 E2 E3 E4 E5 E6
0 Nan 2 2 2 2 2 2
1 Nan 1 1 1 1 1 1
2 Nan 3 4 5 6 7 8
3 Nan 4 5 6 7 8 10
4 Nan 1002003004 1002004005 1002005006 1002006007 1002007008 1002008010
5 501#_ZMB_2019-04-03_070528_reciprocals 30.0193 30.0193 30.0193 34.8858 34.8858 34.8858
I have tried the following:
df = df.append(row, ignore_index=True)
and
df = pd.concat([df, row], ignore_index=True)
Both of these result in the loss of all the data in the first df, which should contain all the header information.
0 1 2 3 4 5 6
0 Nan Nan Nan Nan Nan Nan Nan
1 Nan Nan Nan Nan Nan Nan Nan
2 Nan Nan Nan Nan Nan Nan Nan
3 Nan Nan Nan Nan Nan Nan Nan
4 Nan Nan Nan Nan Nan Nan Nan
5 501#_ZMB_2019-04-03_070528_reciprocals 30.0193 30.0193 30.0193 34.8858 34.8858 34.8858
I've also tried
df = pd.concat([df.reset_index(drop=True, inplace=True), row.reset_index(drop=True, inplace=True)])
Which produced the following Traceback
Traceback (most recent call last):
File "<ipython-input-146-3c1ecbd1987c>", line 1, in <module>
df = pd.concat([df.reset_index(drop=True, inplace=True), row.reset_index(drop=True, inplace=True)])
File "C:\Users\russells\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 228, in concat
copy=copy, sort=sort)
File "C:\Users\russells\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 280, in __init__
raise ValueError('All objects passed were None')
ValueError: All objects passed were None
Does anyone know what I'm doing wrong?
When you concatenate extra rows, pandas aligns the columns, which currently do not overlap. rename will get the job done:
pd.concat([df, row.rename(columns=dict(zip(row.columns, df.columns)))],
ignore_index=True)
FName E1 E2 E3 E4 E5 E6
0 Nan 2 2 2 2 2 2
1 Nan 1 1 1 1 1 1
2 Nan 3 4 5 6 7 8
3 Nan 4 5 6 7 8 10
4 Nan 1002003004 1002004005 1002005006 1002006007 1002007008 1002008010
5 501#_ZMB_2019-04-03_070528_reciprocals 30.0193 30.0193 30.0193 34.8858 34.8858 34.8858
Or if you just need to assign one row at the end and you have a RangeIndex on df:
df.loc[df.shape[0], :] = row.to_numpy()
Related
I have a excel with multiple sheets in the below format. I need to create a single dataframe by concatenating all the sheets, unmerging the cell and then transposing them into a column based on the sheet
Sheet 1:
Sheet 2:
Final Dataframe should look like below
Result expected - I need the below format with an extra coulmn as below
Code So far:
Reading File:
df = pd.concat(pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/Dig/File_2017_2022.xlsx', sheet_name=None, skiprows=1))
Creating Column :
df_1 = pd.concat([df.assign(name=n) for n,df in dfs.items()])
Use read_excel with header=[0,1] for MultiIndex by first 2 rows of header and index_col=[0,1] for MultiIndex by first 2 columns, so possible in loop reshape by DataFrame.stack, add new column, use concat and last set index names by DataFrame.rename_axis with convert to columns by DataFrame.reset_index:
dfs = pd.read_excel('Input_V1.xlsx',sheet_name=None, header=[0,1], index_col=[0,1])
df_1 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand'], columns=None)
.reset_index())
df_1.insert(len(df_1.columns) - 2, 'Campaign', df_1.pop('Campaign'))
print (df_1)
Date WK Brand A B C D E F G \
0 2017-10-02 Week 40 ABC NaN NaN NaN NaN 56892.800000 83431.664000 NaN
1 2017-10-09 Week 41 ABC NaN NaN NaN NaN 0.713716 0.474025 NaN
2 2017-10-16 Week 42 ABC NaN NaN NaN NaN 0.025936 0.072500 NaN
3 2017-10-23 Week 43 ABC NaN NaN NaN NaN 0.182677 0.926731 NaN
4 2017-10-30 Week 44 ABC NaN NaN NaN NaN 0.755607 0.686115 NaN
.. ... ... ... .. .. .. .. ... ... ..
99 2018-03-26 Week 13 PQR NaN NaN NaN NaN 47702.000000 12246.000000 NaN
100 2018-04-02 Week 14 PQR NaN NaN NaN NaN 38768.000000 46498.000000 NaN
101 2018-04-09 Week 15 PQR NaN NaN NaN NaN 35917.000000 45329.000000 NaN
102 2018-04-16 Week 16 PQR NaN NaN NaN NaN 39639.000000 51343.000000 NaN
103 2018-04-23 Week 17 PQR NaN NaN NaN NaN 50867.000000 30119.000000 NaN
H I J K L Campaign name
0 NaN NaN NaN 0.017888 0.697324 NaN ABC
1 NaN NaN NaN 0.457963 0.810985 NaN ABC
2 NaN NaN NaN 0.743030 0.253668 NaN ABC
3 NaN NaN NaN 0.038683 0.050028 NaN ABC
4 NaN NaN NaN 0.885567 0.712333 NaN ABC
.. .. .. .. ... ... ... ...
99 NaN NaN NaN 9433.000000 17108.000000 WX PQR
100 NaN NaN NaN 12529.000000 23557.000000 WX PQR
101 NaN NaN NaN 20395.000000 44228.000000 WX PQR
102 NaN NaN NaN 55077.000000 45149.000000 WX PQR
103 NaN NaN NaN 45815.000000 35761.000000 WX PQR
[104 rows x 17 columns]
I created my own version of your excel, which looks like
this
The code below is far from perfect but it should do fine as long as you do not have millions of sheets
# First, obtain all sheet names
full_df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx',
sheet_name=None, skiprows=0)
# Store them into a list
sheet_names = list(full_df.keys())
# Create an empty Dataframe to store the contents from each sheet
final_df = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx', sheet_name=sheet, skiprows=0)
# Get the brand name
brand = df.columns[1]
# Remove the header columns and keep the numerical values only
df.columns = df.iloc[0]
df = df[1:]
df = df.iloc[:, 1:]
# Set the brand name into a new column
df['Brand'] = brand
# Append into the final dataframe
final_df = pd.concat([final_df, df])
Your final_df should look like this once exported back to excel
EDIT: You might need to drop the dataframe's index upon saving it by using the df.reset_index(drop=True) function, to remove the first column shown in the image right above.
I am trying to read a data file using pandas,
import pandas as pd
file_path = "/home/gopakumar/Downloads/test.DAT"
df = pd.read_csv(file_path, header=None, sep=';', engine='python',encoding="windows-1252")
and getting the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 468, in _read
return parser.read(nrows)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 1057, in read
index, columns, col_dict = self._engine.read(nrows)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 2496, in read
alldata = self._rows_to_cols(content)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 3189, in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 2948, in _alert_malformed
raise ParserError(msg)
pandas.errors.ParserError: Expected 5 fields in line 3, saw 6
From the error description, I understand that the file has a different number of columns in each row, but this is how the file is, and is there any way to read such a file with a different number of columns in each row.
Following is a sample file:
0050;V2019.8.0.0;V2019.8.0.0;20200407;184821
0070;;7;0;7
0080;11;50;Abcd.pdf;Abcd;C:\Daten\Ablage\
0090;1;H;Holz;0;0;0;Holz;;;Holz
0090;1;Z;Abcdör;0;0;0;Abcd;;;Abcd
0090;1;N;Abcd;0;0;0;Abcd;;;Abcd
If you use header = None all rows must have same no of cols like below:
data = """
0050;V2019.8.0.0;V2019.8.0.0;20200407;184821;;;;;;;;;;;
0070;;7;0;7
0080;11;50;Abcd.pdf;Abcd;C:\Daten\Ablage\
0090;1;H;Holz;0;0;0;Holz;;;Holz
0090;1;Z;Abcdör;0;0;0;Abcd;;;Abcd
0090;1;N;Abcd;0;0;0;Abcd;;;Abcd
"""
df = pd.read_csv(StringIO(data), header=None, sep=';')
Output:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 50 V2019.8.0.0 V2019.8.0.0 20200407 184821 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 70 NaN 7 0 7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 80 11 50 Abcd.pdf Abcd C:\Daten\Ablage0090 1.0 H Holz 0.0 0 0.0 Holz NaN NaN Holz
3 90 1 Z Abcdör 0 0 0.0 Abcd NaN NaN Abcd NaN NaN NaN NaN NaN
4 90 1 N Abcd 0 0 0.0 Abcd NaN NaN Abcd NaN NaN NaN NaN NaN
Or if you know how many columns are there in the data you can also use:
cols = [f'col_{i}' for i in range(0,16)]
df = pd.read_csv(StringIO(data), names=cols, sep=';')
Output:
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 col_11 col_12 col_13 col_14 col_15
0 50 V2019.8.0.0 V2019.8.0.0 20200407 184821 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 70 NaN 7 0 7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 80 11 50 Abcd.pdf Abcd C:\Daten\Ablage0090 1.0 H Holz 0.0 0 0.0 Holz NaN NaN Holz
3 90 1 Z Abcdör 0 0 0.0 Abcd NaN NaN Abcd NaN NaN NaN NaN NaN
4 90 1 N Abcd 0 0 0.0 Abcd NaN NaN Abcd NaN NaN NaN NaN NaN
This question already has answers here:
Python: Justifying NumPy array
(2 answers)
How to move Nan values to end in all columns
(2 answers)
Closed 1 year ago.
I have the following DF:
AA BB CC
1 1 1
NaN 3 NaN
4 4 6
NaN NaN 3
NaN
NaN
4
The output should be:
AA BB CC
1 1 1
4 3 6
4 3
4
I've tried:
df = df.dropna(subset=['AA', 'BB', 'CC'])
AA BB CC
0 2 3 1
2 5 5 6
and this is the output I get.
Is there anything else I should be doing differently?
You can use:
df.apply(lambda x: x.dropna().reset_index(drop = True))
AA BB CC
0 1.0 1.0 1.0
1 4.0 3.0 6.0
2 NaN 4.0 3.0
3 NaN NaN 4.0
I have a dataframe with several columns, some of them contain NaN values. I would like for each row to create another column containing the total number of columns minus the number of NaN values before the first non NaN value.
Original dataframe:
ID Value0 Value1 Value2 Value3
1 10 10 8 15
2 NaN 45 52 NaN
3 NaN NaN NaN NaN
4 NaN NaN 100 150
The extra column would look like:
ID NewColumn
1 4
2 3
3 0
4 2
Thanks in advance!
Set the index to ID
Attach a non-null column to stop/catch the argmax
Use argmax to find the first non-null value
Subtract those values from the length of the relevant columns
df.assign(
NewColumn=
df.shape[1] - 1 -
df.set_index('ID').assign(notnull=1).notnull().values.argmax(1)
)
ID Value0 Value1 Value2 Value3 NewColumn
0 1 10.0 10.0 8.0 15.0 4
1 2 NaN 45.0 52.0 NaN 3
2 3 NaN NaN NaN NaN 0
3 4 NaN NaN 100.0 150.0 2
I have a csv file like this:
ATTRIBUTE_1;.....;ATTRIBUTE_N
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,69
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,71
null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;6000000;;A010;40;B;2;10;42;N;;61;MI;01;N;N;S;;-1;N;N;01;;;;;;;;;;;;;;;;;;;;;;;;;;778,72
When i try to import in python with this comand:
data = pd.read_csv(r"C:\...\file.csv")
My output is this:
0 null;01;M;N;;N;1108;1;F205;;N;F;13;;N;S;2;N;60...
How can a import a csv by columns? Like this:
ATTRIBUTE_1 ATTRIBUTE_2 .... ATTRIBUTE_N
NULL 01 778,69
NULL 01 778,71
...
NULL 03 775,33
There is problem your each row start and end with ", so is necessary parameter quoting=3, it means set QUOTE_NONE:
df = pd.read_csv('file.csv', sep=';', quoting=3)
#strip " from first and last column
df.iloc[:,0] = df.iloc[:,0].str.strip('"')
df.iloc[:,-1] = df.iloc[:,-1].str.strip('"')
#strip " from columns names
df.columns = df.columns.str.strip('"')
print (df.head())
SIGLA TARGA CATEGORIA TARIFFARIA - LIVELLO 3 SESSO \
0 null 1 M
1 null 1 M
2 null 1 M
3 null 1 M
4 null 1 M
RCA - PATTO PER I GIOVANI VALORE FRANCHIGIA TIPO TARGA CILINDRATA \
0 N NaN N 1108
1 N NaN N 1108
2 N NaN N 1108
3 N NaN N 1108
4 N NaN N 1108
CODICE FORMA CONTRATTUALE RCA - RECUPERO COMUNE PRA \
0 1 F205
1 1 F205
2 1 F205
3 1 F205
4 1 F205
CODICE WORKSITE MARKETING ... Unnamed: 55 Unnamed: 56 \
0 NaN ... NaN NaN
1 NaN ... NaN NaN
2 NaN ... NaN NaN
3 NaN ... NaN NaN
4 NaN ... NaN NaN
Unnamed: 57 Unnamed: 58 Unnamed: 59 Unnamed: 60 Unnamed: 61 Unnamed: 62 \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
Unnamed: 63 PREMIO FINALE
0 NaN 778,69
1 NaN 778,70
2 NaN 778,71
3 NaN 778,72
4 NaN 778,73
[5 rows x 65 columns]