I am trying to use python to conduct a calculation which will sum the values in a column only for the time period that a certain condition is met.
However, the summation should begin when the conditions are met (runstat == 0 and oil >1). The summation should then stop at the point when oil == 0.
I am new to python so I am not sure how to do this.
I connected the code to a spreadsheet for testing purposes but the intent is to connect to live data. I figured a while loop in combination with an if function might work but I am not winning.
Basically I want to have the code start when runstat is zero and oil is higher than 0. It should stop summing the values of oil when the oil row becomes zero and then it should write the data to a SQL database (this I will figure out later - for now I just want to see if it can work).
This is what code I have tried so far.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
oil = df['oiltag']
runstat = df['runstattag']
def startup(oil,runstat):
while oil.all() > 0:
if oil > 0 and runstat == 0:
totaloil = sum(oil.all())
print(totaloil)
else:
return None
return
print(startup(oil.all(), runstat.all()))
It should sum the values in the column but it is returning: None
OK, so I think that what you want to do is get the subset of rows between the two conditions, then get a sum of those.
Method: Slice the dataframe to get the relevant rows and then sum.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
def startup(dframe):
start_row = dframe[(dframe.oiltag > 0) & (dframe.runstattag == 0)].index[0]
end_row = dframe[(dframe.oiltag == 0) & (dframe.index > start_row)].index[0]
subset = dframe[start_row:end_row+1] # +1 because the end slice is non-inclusive
totaloil = subset.oiltag.sum()
return totaloil
print(startup(df))
This code will raise an error if it can't find a subset of rows which match your criteria. If you need to handle that case, then we could add some exception handling.
EDIT: Please note this assumes that your criteria is only expected to occur once per excel. If you have multiple “chunks” that you will want to sum then this will need tweaking.
Related
What I want to do was actually group by all similar strings in one columns and sum their
corresponding counts if there are similarity, otherwise, leave them.
A little similar to this post. Unfortunately I have not been able to apply this to my case:
How to group Pandas data frame by column with regex match
Unfortunately, I ended up with the following steps:
I wrote a function to print out all the fuzz.Wratio for each row of string,
when each row does a linear search from the top to check if there are other similar
strings in the rest of the rows. If the WRatio > 90, I would like to sum these row's
corresponding counts. Otherwise, leave them there.
I created a test data looking like this:
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
So what I want to do is make the result as a dataframe like:
result=pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
My function so far only gave me the fuzz ratio for each row,
and to my understanding is that,
each row compares to itself three times( here we have four rows).
So My function output would look like:
pd.Dataframe({
'Nname':['Apple.Inc.','Apple.Inc.','Apple.Inc.','apple.inc',\
'apple.inc','apple.inc'],
'Ncount':[4,4,4,3,3,3],
'FRatio': [100,100,100,100,100,100] })
This is just one portion of the whole output from the function I wrote with this test data.
And the last row "OMEGA" would give me a fuzz ratio about 18.
My function is like this:
def checkDupTitle2(data):
Nname=[]
Ncount=[]
f_ratio=[]
for i in range(0, len(data)):
current=0
count=0
space=0
for space in range(0, len(data)-1-current):
ratio=fuzz.WRatio(str(data.loc[i]['name']).strip(), \
str(data.loc[current+space]['name']).strip())
Nname.append(str(data.loc[i]['name']).strip())
Ncount.append(str(data.loc[i]['count']).strip())
f_ratio.append(ratio)
df=pd.DataFrame({
'Nname': Nname,
'Ncount': Ncount,
'FRatio': f_ratio
})
return df
So after running this function and get the output,
I tried to get what I eventually want.
here I tried group by on the df created above:
output.groupby(output.FRatio>90).sum()
But this way, I still need a "name" in my dataframe,
how can I decide on which names for this total counts, say, 9 here.
"Apple.Inc" or "apple.inc" or "APPLE.INC"?
Or, did I make it too complex?
Is there a way to group by "name" at the very first and treat "Apple.Inc.", "apple.inc" and "APPLE.INC" all the same, then my problem has solved. I have stump quite a while. Any helps would be highly
appreciated! Thanks!
The following code is using my library RapidFuzz instead of FuzzyWuzzy since it is faster and it has a process method extractIndices which does help here. This solution is quite a bit faster, but since I do not work with pandas regulary I am sure there are still some things that could be improved :)
import pandas as pd
from rapidfuzz import process, utils
def checkDupTitle(data):
values = data.values.tolist()
companies = [company for company, _ in values]
pcompanies = [utils.default_process(company) for company in companies]
counts = [count for _, count in values]
results = []
while companies:
company = companies.pop(0)
pcompany = pcompanies.pop(0)
count = counts.pop(0)
duplicates = process.extractIndices(
pcompany, pcompanies,
processor=None, score_cutoff=90, limit=None)
for (i, _) in sorted(duplicates, reverse=True):
count += counts.pop(i)
del pcompanies[i]
del companies[i]
results.append([company, count])
return pd.DataFrame(results, columns=['Nname','Ncount'])
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
checkDupTitle(test_data)
The result is
pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
I have a small df (173, 21).
I wrote a function that works, however I am using apply() and I would like to, if possible,
do it another way only because of apply()'s reputation for being slow.
On this particular data set it doesn't matter at all as it is so small, but I am trying
to avoid apply() if possible.
The function takes in a row, checks each of five columns (see code below), and if the value
in any given cell is 'YES' increment a counter. Possible cell values are 'YES', 'NO' or 'NaN'
Here is the working code:
def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
total = true_avengers['Deaths'].sum()
print(total, '\n') # 88
You are right: you should avoid apply(..., axis=1).
Try this:
true_avengers['Deaths'] = (true_avengers[['Death1', 'Death2', 'Death3', 'Death4', 'Death5']] =='YES').sum(axis=1)
Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)
I'm trying to only import part of a CSV file to a list.
In short, the CSV I recveive contains two columns [depth and speed]. Depth always starts at zero, gets larger and then back to zero again.
I would like to add the first part of the CSV to the list (depth 0-13+). I then want to add the second part of the CSV (13-0) to another list.
I assume a for loop would be the way to go, but I don't know how to check each row for ascending/descending numbers.
pullData = open("svp3.csv","r").read()
dataArray = pullData.split('\n')
depthArrayY = []
speedArrayX = []
depthArrayLength = len(depthArrayY)
for eachLine in dataArray:
if len(eachLine)>1:
x,y = eachLine.split(',')
speedArrayX.append(round(float(x), 2))
depthArrayY.append(round(float(y), 2))
I'd suggest using Pandas, I think it will allow you for much more when you need to deal with imported data.
import pandas as pd
df = pd.read_csv('svp3.csv')
tmp = df[df.depth <= df.depth.shift(-1)].values
depth_increase = tmp[:,0]
speed_while_depth_increase = tmp[:,1]
tmp = df[df.depth > df.depth.shift(-1)].values
depth_decrease = tmp[:,0]
speed_while_depth_decrease = tmp[:,1]
I assumed that your CSV has the first the depth column then the speed column.
Depth column had values from 0 to a certain max value say 14, then from 13 to 0 depth column->[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,13,12,11,10,9,8,7,6,5,4,3,2,1]
and I populated speed column with some random values.
The following code makes use of pandas library and splits the column of depth into 2 lists of ascending and descending values using a simple logic of storing the current max value to determine when the ascending part of the column ends.
import pandas as pd
data = pd.read_csv('svp3.csv')
max_val = -10000
depthArrayAscendingY = []
speedArrayX = []
depthArrayDescendingY = []
for a in data.values:
if a[0]>max_val:
depthArrayAscendingY.append(a[0])
speedArrayX.append(a[1])
max_val = a[0]
else:
depthArrayDescendingY.append(a[0])
speedArrayX.append(a[1])
The answer to this question by Baleato is more efficient and cleaner than this answer, you should definitely check their answer.
Yes this question has been asked many times! No, I have still not been able to figure out how to run this boolean filter without generating the Pandas SettingWithCopyWarning warning.
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
df_D['count'].iloc[x] = len(df_C) # triggers warning
I've tried:
Copying df_A and df_B in every possible place
Using a mask
Using a query
I know I can suppress the warning, but I don't want to do that.
What am I missing? I know it's probably something obvious.
Many thanks!
For more details on why you got SettingWithCopyWarning, I would suggest you to read this answer. It is mostly because selecting the columns df_D['count'] and then using iloc[x] does a "chained assignment" that is flagged this way.
To prevent it, you can get the position of the column you want in df_D and then use iloc for both the row and the column in the loop for:
pos_col_D = df_D.columns.get_loc['count']
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
df_D.iloc[x,pos_col_D ] = len(df_C) #no more warning
Also, because you compare all the values of df_A.age with the bounds of df_B.age_limits, I think you could improve the speed of your code using numpy.ufunc.outer, with ufunc being greater_equal and less_egal, and then sum over the axis=0.
#Setup
import numpy as np
import pandas as pd
df_A = pd.DataFrame({'age': [12,25,32]})
df_B = pd.DataFrame({'age_limits':[[3,99], [20,45], [15,30]]})
#your result
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
print (len(df_C))
3
2
1
#with numpy
print ( ( np.greater_equal.outer(df_A.age, df_B.age_limits.str[0])
& np.less_equal.outer(df_A.age, df_B.age_limits.str[1]))
.sum(0) )
array([3, 2, 1])
so you can assign the previous line of code directly in df_D['count'] without loop for. Hope this work for you