pandas read_csv create new column and usecols at the same time - python-3.x

I'm trying to load multiple csv files into a single dataframe df while:
adding column names
adding and populating a new column (Station)
excluding one of the columns (QD)
All of this works fine until I attempt to exclude a column with usecols, which throws the error Too many columns specified: expected 5 and found 4.
Is it possible to create a new column and pass usecols at the same time?
The reason I'm creating & populating a new 'Station' column during read_csv is my dataframe will contain data from multiple stations. I can work around the error by doing read_csv in one statement and dropping the QD column in the next with df.drop('QD', axis=1, inplace=True) but want to make sure I understand how to do this the most pandas way possible.
Here's the code that throws the error:
df = pd.concat(pd.read_csv("http://lgdc.uml.edu/common/DIDBGetValues?ursiCode=" + row['StationCode'] + "&charName=MUFD&DMUF=3000",
skiprows=17,
delim_whitespace=True,
parse_dates=[0],
usecols=['Time','CS','MUFD','Station'],
names=['Time','CS','MUFD','QD','Station']
).fillna(row['StationCode']
).set_index(['Time', 'Station'])
for index, row in stationdf.iterrows())
Example StationCode from stationdf BC840.
Data sample 2016-09-19T00:00:05.000Z 100 19.34 //

You can create a new column using operator chaining with assign:
df = pd.read_csv(...).assign(StationCode=row['StationCode'])

Related

Pandas combining rows as header info

This is how I am reading and creating the dataframe with pandas
def get_sheet_data(sheet_name='SomeName'):
df = pd.read_excel(f'{full_q_name}',
sheet_name=sheet_name,
header=[0,1],
index_col=0)#.fillna(method='ffill')
df = df.swapaxes(axis1="index", axis2="columns")
return df.set_index('Product Code')
printing this tabularized gives me(this potentially will have hundreds of columns):
I cant seem to add those first two rows into the header, I've tried:
python:pandas - How to combine first two rows of pandas dataframe to dataframe header?https://stackoverflow.com/questions/59837241/combine-first-row-and-header-with-pandas
and I'm failing at each point. I think its because of the multiindex, not necessarily the axis swap? But using: https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html is kind of going over my head right now. Please help me add those two rows into the header?
The output of df.columns is massive so Ive cut it down alot:
Index(['Product Code','Product Narrative\nHigh-level service description','Product Name','Huawei Product ID','Type','Bill Cycle Alignment',nan,'Stackable',nan,
and ends with:
nan], dtype='object')
We Create new column names and set them to df.columns, the new column names are generated by joining the 3 Multindex headers and the 1st row of the DataFrame.
df.columns = ['_'.join(i) for i in zip(df.columns.get_level_values(0).tolist(), df.columns.get_level_values(1).tolist(), df.iloc[0,:].replace(np.nan,'').tolist())]

Pandas - concatanation multiple excel files with different names but the same data type

I have about 50 excel sheet with .'xlsb' extension. I'd like to concatanate a specific worksheet into pandas DataFrame (all worksheets names are the same). The problem I have is that the names of columns are not exactly the same in each worksheet. I wrote a code using pandas but the way it works is that it concats all values into the same column in pandas data frame but based on the name of column. So for example: sometimes I have column called: FgsNr and sometimes FgNr - the datatype and the meaning in both columns are exactly the same and I would like to have them in the same column in Data Frame but pandas creates to separate columns in data frame and stack together only those values that are listed in column with the same name.
files = glob(r'C:\Users\Folder\*xlsb')
for file in files:
Datafile = pd.concat(pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0) for file in files)
How could I correct the code so it copied and concatanted all values based on columns from excel at the same time ignoring the names of columns ?
When concatenating multiple dataframes with the same format, you can use the below snippet for speed and efficiency.
The basic logic is that you put them into a list, and then concatenate at the final stage.
files = glob(r'C:\Users\Folder\*xlsb')
dfs = []
for file in files:
df = pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0)
dfs.append(df)
large_df = pd.concat(dfs, ignore_index=True)
Also refer to the below :
Creating an empty Pandas DataFrame, then filling it?

Write Array and Variable to Dataframe

I have an array in the format [27.214 27.566] - there can be several numbers. Additionally I have a Datetime variable.
now=datetime.now()
datetime=now.strftime('%Y-%m-%d %H:%M:%S')
time.sleep(0.5)
agilent.write("MEAS:TEMP? (#101:102)")
values = np.fromstring(agilent.read(), dtype=float, sep=',')
The output from the array is [27.214 27.566]
Now I would like to write this into a dataframe with the following structure:
Datetime, FirstValueArray, SecondValueArray, ....
How to do this? In the dataframe every one minute a new array is added.
I will assume you want to append a row to an existing dataframe df with appropriate columns : value1, value2, ..., lastvalue, datetime
We can easily convert the array to a series :
s = pd.Series(array)
What you want to do next is append the datetime value to the series :
s.append(datetime, ignore_index=True) cf Series.append
Now you have a series whose length matches df.columns. You want to convert that series to a dataframe to be able to use pd.concat :
df_to_append = s.to_frame().T
We need to get the transpose of the original dataframe, because Series.to_frame() returns a dataframe with the series as a single column, and we want a single index but multiple columns.
Before you concatenate, however, you need to make sure both those dataframes columns names match, or it will create additional columns :
df_to_append.columns = df.columns
Now we can concatenate our two dataframes :
pd.concat([df, df_to_append], ignore_index=True) cf pandas.Concat
For further details, see the documentation

Get n rows based on column filter in a Dataframe pandas

I have a dataframe df as below.
I want the final dataframe to be like this as follows. i.e, for each unique Name only last 2 rows must be present in the final output.
i tried the following snippet but its not working.
df = df[df['Name']].tail(2)
Use GroupBy.tail:
df1 = df.groupby('Name').tail(2)
Just one more way to solve this using GroupBy.nth:
df1 = df.groupby('Name').nth([-1,-2]) ## this will pick the last 2 rows

Only adding new rows from a dataframe to a csv file

Each day I get a pandas dataframe that has five columns called column1, column2, column3, column4, column5. I want to add rows that I previously did not receive to a file where I keep the unique rows, called known_data.csv. In order to do so, I wrote some code that should
Load the data from known_data.csv as a dataframe called existing_data
Add a new column called 'existing' to the existing_data df
Merge the old existing_data dataframe with the dataframe called new_data on the five columns
Check whether new_data contains new rows by looking at merge[merge.existing.isnull()] (the complement of the new data and the existing data)
Append the new rows to the known_data.csv file
My code looks like this
existing_data = pd.read_csv("known_data.csv")
existing_data['existing'] = 'yes'
merge_data = pd.merge(new_data, existing_data, on = ['column1', 'column2', 'column3', 'column4', 'column5'], how = 'left')
complement = merge_data[merge_data.existing.isnull()]
del complement['existing']
complement.to_csv("known_data.csv", mode='a', index=False,
header=False)
Unfortunately, this code does not function as expected: the complement is never empty. Even when I receive data that has already been recorded in known_data.csv, some of the rows of new_data are being appended to the file anyways.
Question: What am I doing wrong? How can I solve this problem? Does it have to do with the way I'm reading the file and write to the file?
Edit: Adding a new column called existing to the existing_data dataframe is probably not the best way of checking the complement between existing_data and new_data. If anyone has a better suggestion that would be hugely appreciated!
Edit2: The problem was that although the dataframes looked identical, there were some values that were of a different type. Somehow this error only showed when I tried to merge a subset of the new dataframe for which this was the case.
I think what you are looking for is a concat operation followed by a drop duplicate.
# Concat the two dataframes into a new dataframe holding all the data (memory intensive):
complement = pd.concat([existing_data, new_data], ignore_index=True)
# Remove all duplicates:
complement.drop_duplicates(inplace=True, keep=False)
This will first create a dataframe holding all the old and new data and in a second step will delete all duplicate entries. You can also specify certain columns on which to compare the duplicate values only!
See the documentation here:
concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
drop_duplicates
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

Resources