I have not worked with Pandas before and I am seeking guidance on the best course of action.
Currently, I have an excel(.xlsx)spreadsheet that I am reading into a data Pandas DataFrame. Within that excel spread sheet, it contains account data, document control number, contract id, manufacturer contract id, series number, include exclude, start date, end date and vendors customer id.
From that data, all of the account numbers need to be copied back to every row of data from document key co, document control number, contract id, manufacturer contract id, series number, include exclude, start date, end date and vendors customer id.
Here is a sample of the data:
I've read in the DataFrame and iterated over the DataFrame with the following code:
#reads in template data. Keeps leading zeros in column B and prevents "NaN" from appearing in blank cells
df = pd.read_excel('Contracts.xlsx', converters = {'document_key_co' : lambda x: str(x)}, na_filter = False)
#iterates over rows
for row in df.itertuples():
print(row)
After doing those things, that is where I am stuck. The desired outcome is this:
As you can see there are three accounts copied to the each of the contract id's.
Reading through the Pandas documentation, I considered separating each account into a separate DataFrame and using concat/merging it into another DataFrame that included document key co - vendors customer id, but felt like that was a lot of extra code when there's a likely a better solution.
I was able to accomplish the task utilizing this snippet of code:
concats = []
for x in df.account.values:
concats.append(df.copy())
concats[-1].account = x
pd.concat(concats)
Related
Need help merging multiple rows of data with various datatypes for multiple columns
I have a dataframe that contains 14 columns and x number of rows of data. An example slice of the dataframe is linked below:
Current Example of my dataframe
I want to be able to merge all four rows of data into a single row based on the "work order" column. See linked image below. I am currently using pandas to take data from four different data sources and create a dataframe that has all the relevant data I want based on each work order number. I have tried various methods including groupby, merge, join, and others without any good results.
How I want my dataframe to look in the end
I essentially want to groupby the work order value, merge all the site names into a single value, then have all data essentially condense to a single row. If there is identical data in a column then I just want it to merge together. If there are values in a column that are different (such as in "Operator Ack Timestamp") then I don't mind the data being a continuous string of data (ex. one date after the next within the same cell).
example dataframe data:
df = pd.DataFrame({'Work Order': [10025,10025,10025,10025],
'Site': ['SC1', 'SC1', 'SC1', 'SC1'],
'Description_1':['','','Inverter 10A-1 - No Comms',''],
'Description_2':['','','Inverter 10A-1 - No Comms',''],
'Description_3':['Inverter 10A-1 has lost communications.','','',''],
'Failure Type':['','','Communications',''],
'Failure Class':['','','2',''],
'Start of Fault':['','','2021-05-30 06:37:00',''],
'Operator Ack Timestamp':['2021-05-30 8:49:21','','2021-05-30 6:47:57',''],
'Timestamp of Notification':['2021-05-30 07:18:58','','',''],
'Actual Start Date':['','2021-05-30 6:37:00','','2021-05-30 6:37:00'],
'Actual Start Time':['','06:37:00','','06:37:00'],
'Actual End Date':['','2021-05-30 08:24:00','',''],
'Actual End Time':['','08:24:00','','']})
df.head()
4 steps to get expected output:
Replace empty values by pd.NA,
Group your data by Work Order column because it seems to be the index key,
For each group, fill NA value by last valid observation and keep the last record,
Reset index to have the same format as input.
I choose to group by "Work Order" because it seems to be the index key of your dataframe.
The index of your dataframe is "Work Order":
df = df.set_index("Work Order")
out = df.replace({'': pd.NA}) \
.groupby("Work Order", as_index=False) \
.apply(lambda x: x.ffill().tail(1)) \
.reset_index(level=0, drop=True)```
>>> out.T # transpose for better vizualisation
Work Order 10025
Site SC1
Description_1 Inverter 10A-1 - No Comms
Description_2 Inverter 10A-1 - No Comms
Description_3 Inverter 10A-1 has lost communications.
Failure Type Communications
Failure Class 2
Start of Fault 2021-05-30 06:37:00
Operator Ack Timestamp 2021-05-30 6:47:57
Timestamp of Notification 2021-05-30 07:18:58
Actual Start Date 2021-05-30 6:37:00
Actual Start Time 06:37:00
Actual End Date 2021-05-30 08:24:00
Actual End Time 08:24:00
I don't even know if groupby is the correct function to use for this. It's a bit hard to understand so Ill include a screenshot of my dataframe: screenshot
Basically, this dataframe has way too many columns because each column is specific to only one or a few rows. You can see in the screenshot that the first few columns are specific towards the first row and the last few columns are specific to the last row. I want to make it so that each row only has the columns that actually pertain to it. I've tried several methods of using groupby('equipment name') and several methods using dropna but none work in the way I need it to. I'm also open to separating it into multiple dataframes.
Any method is acceptable, this bug has been driving me crazy. It took me a while to get to this point because this started out as an unintelligible 10,000 line json. I'm pretty new to programming as well.
This is a very cool answer that could be one option - and it does use groupby so sorry for dismissing!!! This will group your data into DataFrames where each DataFrame has a unique group of columns, and any row which only contains values for those columns will be in that DataFrame. If your data are such that there are multiple groups of rows which share the exact same columns, this solution is ideal I think.
Just to note, though, if your null values are more randomly spread out throughout the dataset, or if one row in a group of rows is missing a single entry (compared to related rows), you will end up with more combinations of unique non-null columns, and then more output DataFrames.
There are also (in my opinion) nice ways to search a DataFrame, even if it is very sparse. You can check the non-null values for a row:
df.loc[index_name].dropna()
Or for an index number:
df.iloc[index_number].dropna()
You could further store these values, say in a dictionary (this is a dictionary of Series, but could be converted to DataFrame:
row_dict = {row : df.loc[row].dropna() for row in df.index}
I could imagine some scenarios where something based off these options is more helpful for searching. But that linked answer is slick, I would try that.
EDIT: Expanding on the answer above based on comments with OP.
The dictionary created in the linked post contain the DataFrames . Basically you can use this dictionary to do comparisons with the original source data. My only issue with that answer was that it may be hard to search the dictionary if the column names are janky (as it looks like in your data), so here's a slight modification:
for i, (name,df) in enumerate(df.groupby(df.isnull().dot(df.columns))):
d['df' + str(i)] = df.dropna(1)
Now the dictionary keys are "df#", and the values are the DataFrames. So if you wanted to inspect the content one DataFrame, you can call:
d['df1'].head()
#OR
print(d['df0'])
If you wanted to look at all the DataFrames, you could call
for df in d.values():
print(df.head()) #you can also pass an integer to head to show more rows than 5
Or if you wanted to save each DataFrame you could call:
for name in sorted(d.keys()):
d[name].to_csv('path/to/file/' + name + '.csv')
The point is, you've gotten to a data structure where you can look at the original data, separated into DataFrames without missing data. Joining these back into a single DataFrame would be redundant, as it would create a single DataFrame (equal to the original) or multiple with some amount of missing data.
I think it comes down to what you are looking for and how you need to search the data. You could rename the dictionary keys / output .CSV files based on the types of machinery inside, for example.
I thought your last comment might mean that objects of similar type might not share the same columns; say for example if not all "Exhaust Fans" have the same columns, they will end up in different DataFrames in the dictionary. This maybe the type of case where it might be easier to just look at individual rows, rather than grouping them into weird categories:
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
You could again then save these DataFrames as CSV files or look at them one by one (or e.g. search for Exhaust Fans by seeing if "Exhaust" is in they key). You could also print them all at once:
import pandas as pd
import numpy as np
import natsort
#making some randomly sparse data
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(100)]
df = pd.DataFrame(np.random.rand(100,10), columns=columns,index=index)
df[df<.7] = np.nan
#creating the dictionary where each key is a row name
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
#printing all the output
for key in natsort.natsorted(df_dict.keys())[:5]: #using [:5] to limit output
print(df_dict[key], '\n')
Out[1]:
Column 1 Column 4 Column 7 Column 9 Column 10
Row 1 0.790282 0.710857 0.949141 0.82537 0.998411
Column 5 Column 8 Column 10
Row 2 0.941822 0.722561 0.796324
Column 2 Column 4 Column 5 Column 6
Row 3 0.8187 0.894869 0.997043 0.987833
Column 1 Column 7
Row 4 0.832628 0.8349
Column 1 Column 4 Column 6
Row 5 0.863212 0.811487 0.924363
Instead of printing, you could write the output to a text file; maybe that's the type of document that you could look at (and search) to compare to the input tables. Bute not that even though the printed data are tabular, they can't be made into a DataFrame without accepting that there will be missing data for rows which don't have entries for all columns.
We have student anwser MCQs after each lessons on socrative
They enter their name first, then anwser. For each lesson, we collect data from the Socrative platform but have issues "normalizing the names" such as 'John Doe', johndoe' or John,Doe' can be transformed into 'doe', as it is written is our main file.
Our main file for following up students (treated as a dataframe with python) has initially just 1 column, the name (as a string 'doe' for Mr. John Doe).
I'l like to write a function that goes through the 'name' column of my lesson1 dataframe and for each value of the name column, replace the badly typed name by the reference name.
To lower the case, suppress excessive spaces and suppress excessive punctuation, i've used the following code
lesson1["name"] = lesson1["name"].str.lower()
lesson1["name"] = lesson1["name"].str.strip()
import re
lesson1["name"]=lesson1["name"].apply(lambda x : re.sub('[^A-Za-z0-9]+', '', x))
Then I want to change the 'name' values for the reference name is necessary
I've tried the following code on 2 lists
bad=lesson1['name']
good=reference['name']
def changenames(lesson_list, reference_list):
for i,name in enumerate(lesson_list):
for j,ref in enumerate(reference_list):
if ref in name:
lesson_list[i]=ref
changenames(bad,good)
but 1/ it's not working due to SettingWithCopyWarning
2/ i fail to apply it to a column of the dataframe
Could you help me ?
Thx
L.
I've found out a way
I've 2 dataframes
- the reference_list dataframe, with the names of the students. It has a column 'name'
- the lesson dataframe with the names as the students type them when they answer the MCQs (not standardized) and the answers to the MCQs.
To transform the names of the students in the lesson dataframe, based on the well-types names in reference_list['name'], i have used :
for i in lesson['name']:
for ref in reference_list['name']:
if ref in i:
lesson.loc[lesson['name'] == i, 'name']=ref
and it works fine,
After that, you can apply functions to treat duplicates, merge data...
I've found help in this thread Replace single value in a pandas dataframe, when index is not known and values in column are unique
Hope it'll help some of you.
Louis
enter image description here
So i have dataset of movie name, date , revenue accumulated
As you see, there are multiple rows for same movie and revenue is accumulated.
I want to make extract the last revenue accumulated value from a movie name column and make a new column and add that extracted value to the first row of a certain movie. I know how to make new column, but i dont know how would i extract data and add that data to the FIRST ROW of a certain movie. Im using python and pandas
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df.groupby('Movie name')['Revenue accumulated'].last()
I now know how to collect the last revenue accumulated of a certain movie using groupby,but i still do not know how to use insert that information to first row of each movie in a new column
Use
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df.groupby('Movie name')['Revenue accumulated'].last()
I have been melting my brain trying to work out the formula i need for a multiple conditional lookup.
I have two data sets, one is job data and the other is contract data.
The job data contains customer name, location of job and date of job. I need to find out if the job was contracted when it took place, and if it was return a value from column N in the contract data.
The problem comes when i try to use the date ranges, as there are frequently more than one contract per customer.
So for example, in my job data:-
CUSTOMER | LOCATION | JOB DATE
Cust A | Port A | 01/01/2014
Cust A | Port B | 01/02/2014
Customer A had a contract in port B that expired on 21st Feb 2014, so here i would want it to return the value from column N in my contract data as the job was under contract.
Customer A did not have a contract in port A at the time of the job, so i would want it to return 'no contract'.
Contract data has columns containing customer name, port name, and a start and end date value, as well as my lookup category.
I think i need to be using index / match but i can't seem to get them to work with my date ranges. Is there another type of lookup i can use to get this to work?
Please help, I'm losing the plot!
Thanks :)
You can use two approaches here:
In both result and source tables make a helper column that concatenates all three values like this: =A2&B2&C2. So that you get something like 'Cust APort A01/01/2014'. That is, you get a unique value by which you can identify the row. You can add delimiter if needed: =A2&"|"&B2&"|"&C2. Then you can perform VLOOKUP by this value.
You can add a helper column with row number (1, 2, 3 ...) in source table. Then you can use =SUMIFS(<row_number_column>,<source_condition_column_1>,<condition_1>,<source_condition_column_2>,<condition_2>,...) to return the row number of source table that matches all three conditions. You can use this row number to perform INDEX or whatever is needed. But BE CAREFUL: check that there are only unique combinations of all three columns in source table, otherwise this approach may return wrong results. I.e. if matching conditions are met in rows 3 and 7 it will return 10 which is completely wrong.