Use pandas to make sense of malfomed Excel data - python-3.x

My job has me doing some data analysis and the exported spreadsheet that is given to me (the ONLY way able to be given) has data that looks like this:
But what I need it to look like, ideally, would be something like this:
I've tried some other codes and to be honest I've made a mangled mess and got rid of it as I only succeeded in jumbling the data. I've done several other pandas projects where I was able to sort and make sense of the data, but it had the same structure and was easier to do. At this point I just dont feel I have the logical part of how to go about fixing the data. I would do it manually but it's over 48k lines. Any help you may be able to provide would be greatly appreciated.
Edit: This is what the data looks like if we 'delete blanks and shift-up'

Try this :
import pandas as pd
df = pd.read_excel('your_excel_file.xlsx')
for i, col in enumerate(df.columns[-4:]):
if col == 'Subscription Name':
df[col] = df[col].shift(-1)
elif col == 'Resource Group':
df[col] = df[col].shift(-2)
else:
df[col] = df[col].shift(-3)
out = df.ffill().drop_duplicates().reset_index(drop=True)
>>> display(out)
Edit :
You can also use :
out = df[df['Resource Name'].notna()].ffill()
Or for better efficiency (as per #Vladimir Fokow) :
out = df.dropna(how='all').ffill()
Instead of :
out = df.ffill().drop_duplicates().reset_index(drop=True)

Related

Storing stock market data in a column

'''
Could someone perhaps assist me in finding a solution to this problem? I'm currently
learning how to code. I'm attempting to create a new column that displays the current
price as it fluctuates in real-time. I tried "stock_info.get_live_price('NIO')"; it
works when only one ticker is inserted, but not when the variable 'stock_name' is
inserted.
import pandas
from yahoo_fin import stock_info
def My_portfolio1():
df = pd.DataFrame({
'stock_names':['NIO','JMIA','SVRA'],
'price': [1,3,4],
'quantity':[200,100,400],
'entry_price':[3,4,5],
'current_price':[2,3,1]
}
)
df['new_value'] = df['current_price'] - df['entry_price']
df['pnl'] = df['new_value'] * df['quantity']
df['live_update']= stock_info.get_live_price('stock_name')
return df
My_portfolio1()
'''
Thank you so much, everyone! As a result, I decided to make a variable for each of the tickers and use the loc function to place them in the appropriate rows and columns. Thank you so much, everyone! As a result, I decided to make a variable for each of the tickets and use the loc function to place them in the appropriate rows and columns.

Change one element of column heading in CSV using Pandas

I have created a CSV file which looks like this:
RigName,Date,DrillingMiles,TrippingMiles,CasingMiles,LinerMiles,JarringMiles,TotalMiles,Comments
0,08 July 2021,19.21,63.05,43.16,45.41,8.52,0,"Tested all totals. Edge cases for multiple clicks.
"
1,09 July 2021,19.21,63.05,43.16,45.41,8.52,0,"Test entry#2.
"
I wish to change the 'RigName' to something the user inputs. I have tried various ways of changing the word 'RigName' to user input. One of them is this:
df= pd.read_csv('ton_miles_record.csv')
user_input = 'Rig805'
df.columns = df.columns.str.replace('RigName', user_input)
df.to_csv('new_csv.csv', header=True, index=False)
However no matter what I do, the result in the csv file always comes to this:
Unnamed:0,Date,DrillingMiles,TrippingMiles,CasingMiles,LinerMiles,JarringMiles,TotalMiles,Comments
Why am I getting 'Unnamed: 0' instead of the user input value?
Also, is there a way to change 'RigName' to something else by calling its position? To make multiple changes to any word in its position in future?
Zubin, you would need to change the column name be looking at the columns as a list. The code below should do the trick. Also, the same code shows how to access the column by position...
import pandas as pd
df= pd.read_csv('ton_miles_record.csv')
user_input = 'Rig805'
df.columns.values[0] = user_input
df.to_csv('new_csv.csv', header=True, index=False)
After 3 hours of trial and error (and a lot of searching in vain), I solved it by doing this:
df= pd.read_csv('ton_miles_record.csv')
user_input = 'SD555'
df.rename(columns={ df.columns[1]: user_input}, inplace=True)
df.to_csv('new_csv.csv', index=False)
I hope this helps someone else struggling as I was.

Sum all counts when their fuzz.WRatio > 90 otherwise leave intact

What I want to do was actually group by all similar strings in one columns and sum their
corresponding counts if there are similarity, otherwise, leave them.
A little similar to this post. Unfortunately I have not been able to apply this to my case:
How to group Pandas data frame by column with regex match
Unfortunately, I ended up with the following steps:
I wrote a function to print out all the fuzz.Wratio for each row of string,
when each row does a linear search from the top to check if there are other similar
strings in the rest of the rows. If the WRatio > 90, I would like to sum these row's
corresponding counts. Otherwise, leave them there.
I created a test data looking like this:
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
So what I want to do is make the result as a dataframe like:
result=pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
My function so far only gave me the fuzz ratio for each row,
and to my understanding is that,
each row compares to itself three times( here we have four rows).
So My function output would look like:
pd.Dataframe({
'Nname':['Apple.Inc.','Apple.Inc.','Apple.Inc.','apple.inc',\
'apple.inc','apple.inc'],
'Ncount':[4,4,4,3,3,3],
'FRatio': [100,100,100,100,100,100] })
This is just one portion of the whole output from the function I wrote with this test data.
And the last row "OMEGA" would give me a fuzz ratio about 18.
My function is like this:
def checkDupTitle2(data):
Nname=[]
Ncount=[]
f_ratio=[]
for i in range(0, len(data)):
current=0
count=0
space=0
for space in range(0, len(data)-1-current):
ratio=fuzz.WRatio(str(data.loc[i]['name']).strip(), \
str(data.loc[current+space]['name']).strip())
Nname.append(str(data.loc[i]['name']).strip())
Ncount.append(str(data.loc[i]['count']).strip())
f_ratio.append(ratio)
df=pd.DataFrame({
'Nname': Nname,
'Ncount': Ncount,
'FRatio': f_ratio
})
return df
So after running this function and get the output,
I tried to get what I eventually want.
here I tried group by on the df created above:
output.groupby(output.FRatio>90).sum()
But this way, I still need a "name" in my dataframe,
how can I decide on which names for this total counts, say, 9 here.
"Apple.Inc" or "apple.inc" or "APPLE.INC"?
Or, did I make it too complex?
Is there a way to group by "name" at the very first and treat "Apple.Inc.", "apple.inc" and "APPLE.INC" all the same, then my problem has solved. I have stump quite a while. Any helps would be highly
appreciated! Thanks!
The following code is using my library RapidFuzz instead of FuzzyWuzzy since it is faster and it has a process method extractIndices which does help here. This solution is quite a bit faster, but since I do not work with pandas regulary I am sure there are still some things that could be improved :)
import pandas as pd
from rapidfuzz import process, utils
def checkDupTitle(data):
values = data.values.tolist()
companies = [company for company, _ in values]
pcompanies = [utils.default_process(company) for company in companies]
counts = [count for _, count in values]
results = []
while companies:
company = companies.pop(0)
pcompany = pcompanies.pop(0)
count = counts.pop(0)
duplicates = process.extractIndices(
pcompany, pcompanies,
processor=None, score_cutoff=90, limit=None)
for (i, _) in sorted(duplicates, reverse=True):
count += counts.pop(i)
del pcompanies[i]
del companies[i]
results.append([company, count])
return pd.DataFrame(results, columns=['Nname','Ncount'])
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
checkDupTitle(test_data)
The result is
pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Loading Pandas Data Frame into Excel using writer.save() and getting indexing error

I am aggregating a Pandas DF using numpy size and then want to load the results into an Excel using writer.save. But I am getting the following error: NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented.
My data looks something like this:
agt_id unique_id
abc123 ab12345
abc123 cd23456
abc123 de34567
xyz987 ef45678
xyz987 fg56789
My results should look like:
agt_id unique_id
abc123 3
xyz987 2
This is an example of my code:
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id':[np.size]})
writer = pd.ExcelWriter(outfilepath, engine='xlsxwriter')
df_agtvol.to_excel(writer, sheet_name='agt_vols')
I have tried to reset the index by using:
df_agt_vol_final = df_agtvol.set_index([df_agtvol.index, 'agt_id'], inplace=True)
based on some research, but am getting a completely different error.
I am relatively new to working with Pandas dataframes, so any help would be appreciated.
You don't need a MultiIndex. The reason you get one is because np.size is wrapped in a list.
Although not explicitly documented, Pandas interprets everything in the list as a subindex for 'unique_id'. This use case falls under the "nested dict of names -> dicts of functions" case in the linked documentation.
So
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id':[np.size]})
Should be
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id': np.size})
This is still overly complicated and you can get the same results with a call to the count method.
df_agtvol = df_agt.groupby('agt_id').count()

Resources