How would I copy over rows of a dataframe to another dataframe - python-3.x

I am trying to go through an excel spreadsheet that contains lots of data and sort through it. The image following is just a short pic of what I have. I imported this excel sheet into a dataframe. What I need to do is split up the data by data point name into different data frames
The Datapoints go from 1066 to 1070 in increments of 1. I need to split these into different dataframes so theres an dataframe for each. Any help would be appreciated. I have already imported it into a dataframe which I called test_df_new. I just need to know how to go further.
Thank you
I want

Since your asked to sort by a column in comments you can use sort column like this
import pandas as pd
_data = {'Toolset': ['Toolset','Toolset','Toolset','Toolset'],
'Data Point Name': ['EN1068','EN1067','EN1068QR','EN1068QR'],
'Toolset Start Date':[0,0,0,0],
'ToolsetCount':[1674,1160,977,977],
'ToolsetCap':[0,0,0,0],
'Toolset Cap Start Date':[0,0,0,0],
'Cap Count':[0,0,51,42]
}
df = pd.DataFrame(data=_data)
df.sort_values(by=['Data Point Name'])
Note: Since they start with EN pandas will take care of the sorting by alphabet. I have attached an image is this what your are looking for?
and you can also use copy() function to copy df to new df like this:
new_df = df.copy()

Related

Pandas - concatanation multiple excel files with different names but the same data type

I have about 50 excel sheet with .'xlsb' extension. I'd like to concatanate a specific worksheet into pandas DataFrame (all worksheets names are the same). The problem I have is that the names of columns are not exactly the same in each worksheet. I wrote a code using pandas but the way it works is that it concats all values into the same column in pandas data frame but based on the name of column. So for example: sometimes I have column called: FgsNr and sometimes FgNr - the datatype and the meaning in both columns are exactly the same and I would like to have them in the same column in Data Frame but pandas creates to separate columns in data frame and stack together only those values that are listed in column with the same name.
files = glob(r'C:\Users\Folder\*xlsb')
for file in files:
Datafile = pd.concat(pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0) for file in files)
How could I correct the code so it copied and concatanted all values based on columns from excel at the same time ignoring the names of columns ?
When concatenating multiple dataframes with the same format, you can use the below snippet for speed and efficiency.
The basic logic is that you put them into a list, and then concatenate at the final stage.
files = glob(r'C:\Users\Folder\*xlsb')
dfs = []
for file in files:
df = pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0)
dfs.append(df)
large_df = pd.concat(dfs, ignore_index=True)
Also refer to the below :
Creating an empty Pandas DataFrame, then filling it?

How to create single row panda DataFrame with headers from big panda DataFrame

I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df

Only adding new rows from a dataframe to a csv file

Each day I get a pandas dataframe that has five columns called column1, column2, column3, column4, column5. I want to add rows that I previously did not receive to a file where I keep the unique rows, called known_data.csv. In order to do so, I wrote some code that should
Load the data from known_data.csv as a dataframe called existing_data
Add a new column called 'existing' to the existing_data df
Merge the old existing_data dataframe with the dataframe called new_data on the five columns
Check whether new_data contains new rows by looking at merge[merge.existing.isnull()] (the complement of the new data and the existing data)
Append the new rows to the known_data.csv file
My code looks like this
existing_data = pd.read_csv("known_data.csv")
existing_data['existing'] = 'yes'
merge_data = pd.merge(new_data, existing_data, on = ['column1', 'column2', 'column3', 'column4', 'column5'], how = 'left')
complement = merge_data[merge_data.existing.isnull()]
del complement['existing']
complement.to_csv("known_data.csv", mode='a', index=False,
header=False)
Unfortunately, this code does not function as expected: the complement is never empty. Even when I receive data that has already been recorded in known_data.csv, some of the rows of new_data are being appended to the file anyways.
Question: What am I doing wrong? How can I solve this problem? Does it have to do with the way I'm reading the file and write to the file?
Edit: Adding a new column called existing to the existing_data dataframe is probably not the best way of checking the complement between existing_data and new_data. If anyone has a better suggestion that would be hugely appreciated!
Edit2: The problem was that although the dataframes looked identical, there were some values that were of a different type. Somehow this error only showed when I tried to merge a subset of the new dataframe for which this was the case.
I think what you are looking for is a concat operation followed by a drop duplicate.
# Concat the two dataframes into a new dataframe holding all the data (memory intensive):
complement = pd.concat([existing_data, new_data], ignore_index=True)
# Remove all duplicates:
complement.drop_duplicates(inplace=True, keep=False)
This will first create a dataframe holding all the old and new data and in a second step will delete all duplicate entries. You can also specify certain columns on which to compare the duplicate values only!
See the documentation here:
concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
drop_duplicates
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

Error while selecting rows from pandas data frame

I have a pandas dataframe df with the column Name. I did:
for name in df['Name'].unique ():
X = df[df['Name'] == name]
print (X.head())
but then X contains all kinds of different Name, not an unique name I want.
What did I do wrong?
Thanks a lot
You probably don't want to overwrite X with every iteration of your loop and only keep the dataframe containing the last value of df['Name'].unique().
Depending on your data and goal, you might want to use groupby as jezrael suggests, or maybe do something like df[~df['Name'].duplicated()].

Appending blank rows to dataframe if column does not exist

This question is kind of odd and complex, so bear with me, please.
I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).
The problem is that, if the column doesn't exist in the file but is specified in usecols, read_csv throws an error.
Is there a straightforward way to force a specified column set in a dataframe and have pandas just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
Assuming some kind of master list all_cols_to_use, can you do something like:
def parse_big_csv(csvpath):
with open(csvpath, 'r') as infile:
header = infile.readline().strip().split(',')
cols_to_use = sorted(set(header) & set(all_cols_to_use))
missing_cols = sorted(set(all_cols_to_use) - set(header))
df = pd.read_csv(csvpath, usecols=cols_to_use)
df.loc[:, missing_cols] = np.nan
return df
This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)

Resources