Each day I get a pandas dataframe that has five columns called column1, column2, column3, column4, column5. I want to add rows that I previously did not receive to a file where I keep the unique rows, called known_data.csv. In order to do so, I wrote some code that should
Load the data from known_data.csv as a dataframe called existing_data
Add a new column called 'existing' to the existing_data df
Merge the old existing_data dataframe with the dataframe called new_data on the five columns
Check whether new_data contains new rows by looking at merge[merge.existing.isnull()] (the complement of the new data and the existing data)
Append the new rows to the known_data.csv file
My code looks like this
existing_data = pd.read_csv("known_data.csv")
existing_data['existing'] = 'yes'
merge_data = pd.merge(new_data, existing_data, on = ['column1', 'column2', 'column3', 'column4', 'column5'], how = 'left')
complement = merge_data[merge_data.existing.isnull()]
del complement['existing']
complement.to_csv("known_data.csv", mode='a', index=False,
header=False)
Unfortunately, this code does not function as expected: the complement is never empty. Even when I receive data that has already been recorded in known_data.csv, some of the rows of new_data are being appended to the file anyways.
Question: What am I doing wrong? How can I solve this problem? Does it have to do with the way I'm reading the file and write to the file?
Edit: Adding a new column called existing to the existing_data dataframe is probably not the best way of checking the complement between existing_data and new_data. If anyone has a better suggestion that would be hugely appreciated!
Edit2: The problem was that although the dataframes looked identical, there were some values that were of a different type. Somehow this error only showed when I tried to merge a subset of the new dataframe for which this was the case.
I think what you are looking for is a concat operation followed by a drop duplicate.
# Concat the two dataframes into a new dataframe holding all the data (memory intensive):
complement = pd.concat([existing_data, new_data], ignore_index=True)
# Remove all duplicates:
complement.drop_duplicates(inplace=True, keep=False)
This will first create a dataframe holding all the old and new data and in a second step will delete all duplicate entries. You can also specify certain columns on which to compare the duplicate values only!
See the documentation here:
concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
drop_duplicates
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
Related
I don't even know if groupby is the correct function to use for this. It's a bit hard to understand so Ill include a screenshot of my dataframe: screenshot
Basically, this dataframe has way too many columns because each column is specific to only one or a few rows. You can see in the screenshot that the first few columns are specific towards the first row and the last few columns are specific to the last row. I want to make it so that each row only has the columns that actually pertain to it. I've tried several methods of using groupby('equipment name') and several methods using dropna but none work in the way I need it to. I'm also open to separating it into multiple dataframes.
Any method is acceptable, this bug has been driving me crazy. It took me a while to get to this point because this started out as an unintelligible 10,000 line json. I'm pretty new to programming as well.
This is a very cool answer that could be one option - and it does use groupby so sorry for dismissing!!! This will group your data into DataFrames where each DataFrame has a unique group of columns, and any row which only contains values for those columns will be in that DataFrame. If your data are such that there are multiple groups of rows which share the exact same columns, this solution is ideal I think.
Just to note, though, if your null values are more randomly spread out throughout the dataset, or if one row in a group of rows is missing a single entry (compared to related rows), you will end up with more combinations of unique non-null columns, and then more output DataFrames.
There are also (in my opinion) nice ways to search a DataFrame, even if it is very sparse. You can check the non-null values for a row:
df.loc[index_name].dropna()
Or for an index number:
df.iloc[index_number].dropna()
You could further store these values, say in a dictionary (this is a dictionary of Series, but could be converted to DataFrame:
row_dict = {row : df.loc[row].dropna() for row in df.index}
I could imagine some scenarios where something based off these options is more helpful for searching. But that linked answer is slick, I would try that.
EDIT: Expanding on the answer above based on comments with OP.
The dictionary created in the linked post contain the DataFrames . Basically you can use this dictionary to do comparisons with the original source data. My only issue with that answer was that it may be hard to search the dictionary if the column names are janky (as it looks like in your data), so here's a slight modification:
for i, (name,df) in enumerate(df.groupby(df.isnull().dot(df.columns))):
d['df' + str(i)] = df.dropna(1)
Now the dictionary keys are "df#", and the values are the DataFrames. So if you wanted to inspect the content one DataFrame, you can call:
d['df1'].head()
#OR
print(d['df0'])
If you wanted to look at all the DataFrames, you could call
for df in d.values():
print(df.head()) #you can also pass an integer to head to show more rows than 5
Or if you wanted to save each DataFrame you could call:
for name in sorted(d.keys()):
d[name].to_csv('path/to/file/' + name + '.csv')
The point is, you've gotten to a data structure where you can look at the original data, separated into DataFrames without missing data. Joining these back into a single DataFrame would be redundant, as it would create a single DataFrame (equal to the original) or multiple with some amount of missing data.
I think it comes down to what you are looking for and how you need to search the data. You could rename the dictionary keys / output .CSV files based on the types of machinery inside, for example.
I thought your last comment might mean that objects of similar type might not share the same columns; say for example if not all "Exhaust Fans" have the same columns, they will end up in different DataFrames in the dictionary. This maybe the type of case where it might be easier to just look at individual rows, rather than grouping them into weird categories:
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
You could again then save these DataFrames as CSV files or look at them one by one (or e.g. search for Exhaust Fans by seeing if "Exhaust" is in they key). You could also print them all at once:
import pandas as pd
import numpy as np
import natsort
#making some randomly sparse data
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(100)]
df = pd.DataFrame(np.random.rand(100,10), columns=columns,index=index)
df[df<.7] = np.nan
#creating the dictionary where each key is a row name
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
#printing all the output
for key in natsort.natsorted(df_dict.keys())[:5]: #using [:5] to limit output
print(df_dict[key], '\n')
Out[1]:
Column 1 Column 4 Column 7 Column 9 Column 10
Row 1 0.790282 0.710857 0.949141 0.82537 0.998411
Column 5 Column 8 Column 10
Row 2 0.941822 0.722561 0.796324
Column 2 Column 4 Column 5 Column 6
Row 3 0.8187 0.894869 0.997043 0.987833
Column 1 Column 7
Row 4 0.832628 0.8349
Column 1 Column 4 Column 6
Row 5 0.863212 0.811487 0.924363
Instead of printing, you could write the output to a text file; maybe that's the type of document that you could look at (and search) to compare to the input tables. Bute not that even though the printed data are tabular, they can't be made into a DataFrame without accepting that there will be missing data for rows which don't have entries for all columns.
I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df
With pandas, I can insert a new column to a specific location like below:
df_all.insert(loc=10, column="label", value=label_column, allow_duplicates=True)
How can I add a new column to a specific location with dask? (to a dask dataframe)
Naively, I would add a column
df["label"] = label_column
And then I would rearrange columns with another getitem call
df = df[[..., "label", ...]]
There might be a cleaner way to do this, but this should work fine.
This question is kind of odd and complex, so bear with me, please.
I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).
The problem is that, if the column doesn't exist in the file but is specified in usecols, read_csv throws an error.
Is there a straightforward way to force a specified column set in a dataframe and have pandas just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
Assuming some kind of master list all_cols_to_use, can you do something like:
def parse_big_csv(csvpath):
with open(csvpath, 'r') as infile:
header = infile.readline().strip().split(',')
cols_to_use = sorted(set(header) & set(all_cols_to_use))
missing_cols = sorted(set(all_cols_to_use) - set(header))
df = pd.read_csv(csvpath, usecols=cols_to_use)
df.loc[:, missing_cols] = np.nan
return df
This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)
I'm trying to load multiple csv files into a single dataframe df while:
adding column names
adding and populating a new column (Station)
excluding one of the columns (QD)
All of this works fine until I attempt to exclude a column with usecols, which throws the error Too many columns specified: expected 5 and found 4.
Is it possible to create a new column and pass usecols at the same time?
The reason I'm creating & populating a new 'Station' column during read_csv is my dataframe will contain data from multiple stations. I can work around the error by doing read_csv in one statement and dropping the QD column in the next with df.drop('QD', axis=1, inplace=True) but want to make sure I understand how to do this the most pandas way possible.
Here's the code that throws the error:
df = pd.concat(pd.read_csv("http://lgdc.uml.edu/common/DIDBGetValues?ursiCode=" + row['StationCode'] + "&charName=MUFD&DMUF=3000",
skiprows=17,
delim_whitespace=True,
parse_dates=[0],
usecols=['Time','CS','MUFD','Station'],
names=['Time','CS','MUFD','QD','Station']
).fillna(row['StationCode']
).set_index(['Time', 'Station'])
for index, row in stationdf.iterrows())
Example StationCode from stationdf BC840.
Data sample 2016-09-19T00:00:05.000Z 100 19.34 //
You can create a new column using operator chaining with assign:
df = pd.read_csv(...).assign(StationCode=row['StationCode'])