Calculating across rows and columns at same time - python-3.x

I am trying to do some calculation across rows and columns in python. It is taking painfully longer time to execute for large dataset.
I am trying to do some calculation as follows:
Df =pd.DataFrame({'A': [1,1,1,2,2,2,2],
'unit': [1,2,1,1,1,1,2],
'D1':[100,100,100,200,300,400,3509],
'D2':[200,200,200,300,300,400,2500],
'D3':[50,50,50,60,50,67,98],
'Level1':[1,4,0,4,4,4,5],
'Level2':[45,3,0,6,7,8,9],
'Level3':[0,0,34,8,7,0,5]
})
For each value of A (in above example A=1 and 2) I am running a function sequentially (i.e., I can not run the same function for A=1 and A=2 at the same time since outcome of A=1 changes some other values for A=2). I am calculating a Score as:
def score(data):
data['score_Level1']=np.where(data['Level1']>=data['unit'], data['unit'], 0)*(((np.where(data['Level1']>=data['unit'], data['unit'], 0)).sum()*100) +(10/data['D1']))
data['score_Level2']=np.where(data['Level2']>=data['unit'], data['unit'], 0)*(((np.where(data['Level2']>=data['unit'], data['unit'], 0)).sum()*100) +(10/data['D2']))
data['score_Level3']=np.where(data['Level3']>=data['unit'], data['unit'], 0)*(((np.where(data['Level3']>=data['unit'], data['unit'], 0)).sum()*100) +(10/data['D3']))
return(data)
What above code does is it goes row by row and gives score for Leveli (i=1,2,3) as follows:
Step1:
compare Value of "Leveli' with corresponding "unit" column, if Leveli >=unit then unit else 0.
Step2:
Then it (sums up result for above operation across all rows for Leveli)*100+ (1/Di) = Lets say "S"
Step3:
It goes row by row again and assign a score for Leveli as:
Step1*Step2 (for each row)
Above code should yield results for A=1 as:
score(Df[Df['A']==1])
I am listing only scoring for Level1, same thing happends for Level2 and Level3
Step1:
Compare 1>=1 = True Yields 1, 4>=2 = true Yields 2, 0>=1 =False Yields 0
Step2:
(1+2+0)*100+1/100=300.1
Step3:
Compare 1>=1 = True Yields 1 *300.1=300.1
Compare 4>=2 = True Yields 2 *300.1=600.2
Compare 0>=1 = False Yields 0 *300.1=0
I am doing this activity for 200 million values of A. Since it has to be done sequentially (A=n depends on outcome of A=n-1), it is taking a long time to compute.
Any suggestion of making it faster is much appreciated.

I think, you can avoid the where, which should run faster.
Can you please try this code:
def score2(data, score_field, level_field, d_field):
indexer= data[level_field]>=data['unit']
data[score_field]= 0.0
data.loc[indexer, score_field]= data['unit'] * data.loc[indexer, 'unit'].sum()*100 + 10/data[d_field]
return(data)
score2(Df, 'score_Level1', 'Level1', 'D1')
score2(Df, 'score_Level2', 'Level2', 'D2')
score2(Df, 'score_Level3', 'Level3', 'D3')
The .loc in combination with the indexer replaces the where. On the left side of the assignment it will only set the values for the rows in which the "level-field" is greather than unit. All others stay as they are. Without the line data[score_field]= 0.0 they would contain NaN.
Btw. pandas has it's own .where method, which works on series. It is slightly different from the numpy implementation.

Related

Apache Beam Combine vs GroupByKey

So, i'm facing this seems-to-be-classic-problem, extract timeframed toppers for unbounded stream,
using Apache Beam (Flink as the engine):
Assuming sites+hits tuples input:
{"aaa.com", 1001}, {"bbb.com", 21}, {"aaa.com", 1002}, {"ccc.com", 3001}, {"bbb.com", 22} ....
(Expected rate: +100K entries per hour)
Goal is to output sites which are >1% of total hits, in each 1 hour timeframe.
i.e. for 1 hour fix window, pick the site that sums >1% hits out total hits.
So first, sum by key:
{"aaa.com", 2003}, {"bbb.com", 43}, {"ccc.com", 3001} ....
And finally output the >1%:
{"aaa.com"}, {"ccc.com"}
Alternative:
1) Group + parDo:
Fixed time windowing 1 hour, group all elements, following by iterable parDo for all window elements, calculate sum and output the >1% sites.
Cons seems to be all agg process done single thread and also seems require double iterations to get the sum and get >1%.
2) GroupByKey + Combine
Fixed time windowing 1 hour, GrouByKey using key=Site, applying Combine with custom accumulator to sum hits per key.
Although the Combine option(#2) seems more suitable,
i'm missing the part of getting in the sum-per-1-hour-window, needed to calculate the >%1 elements:
Can same window be used for 2 combines: one per key and one total hits sum in this window?
and what is the best approach to combine them both to make the >1% call per element?
10x
You can do this via side inputs. For instance, you'd do something like this (code in Python, but answer for Java is similar):
input_data = .... # ("aaa.com", 1001), ("bbb.com", 21), ("aaa.com", 1002), ("ccc.com", 3001), ("bbb.com", 22) ....
total_per_key = input_data | beam.CombinePerKey(sum)
global_sum_per_window = beam.pvalue.AsSingleton(
input_data
| beam.Values()
| beam.CombineGlobally(sum).without_defaults())
def find_more_than_1pct(elem, global_sum):
key, value = elem
if value > global_sum * 0.01:
yield elem
over_1_pct_keys = total_per_key | beam.FlatMap(find_more_than_1pct)
In this case, the global_sum_per_window PCollection will have one value for each window, and the total_per_key will have one value per-key-per-window.
Let me know if that works!

Sum all counts when their fuzz.WRatio > 90 otherwise leave intact

What I want to do was actually group by all similar strings in one columns and sum their
corresponding counts if there are similarity, otherwise, leave them.
A little similar to this post. Unfortunately I have not been able to apply this to my case:
How to group Pandas data frame by column with regex match
Unfortunately, I ended up with the following steps:
I wrote a function to print out all the fuzz.Wratio for each row of string,
when each row does a linear search from the top to check if there are other similar
strings in the rest of the rows. If the WRatio > 90, I would like to sum these row's
corresponding counts. Otherwise, leave them there.
I created a test data looking like this:
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
So what I want to do is make the result as a dataframe like:
result=pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
My function so far only gave me the fuzz ratio for each row,
and to my understanding is that,
each row compares to itself three times( here we have four rows).
So My function output would look like:
pd.Dataframe({
'Nname':['Apple.Inc.','Apple.Inc.','Apple.Inc.','apple.inc',\
'apple.inc','apple.inc'],
'Ncount':[4,4,4,3,3,3],
'FRatio': [100,100,100,100,100,100] })
This is just one portion of the whole output from the function I wrote with this test data.
And the last row "OMEGA" would give me a fuzz ratio about 18.
My function is like this:
def checkDupTitle2(data):
Nname=[]
Ncount=[]
f_ratio=[]
for i in range(0, len(data)):
current=0
count=0
space=0
for space in range(0, len(data)-1-current):
ratio=fuzz.WRatio(str(data.loc[i]['name']).strip(), \
str(data.loc[current+space]['name']).strip())
Nname.append(str(data.loc[i]['name']).strip())
Ncount.append(str(data.loc[i]['count']).strip())
f_ratio.append(ratio)
df=pd.DataFrame({
'Nname': Nname,
'Ncount': Ncount,
'FRatio': f_ratio
})
return df
So after running this function and get the output,
I tried to get what I eventually want.
here I tried group by on the df created above:
output.groupby(output.FRatio>90).sum()
But this way, I still need a "name" in my dataframe,
how can I decide on which names for this total counts, say, 9 here.
"Apple.Inc" or "apple.inc" or "APPLE.INC"?
Or, did I make it too complex?
Is there a way to group by "name" at the very first and treat "Apple.Inc.", "apple.inc" and "APPLE.INC" all the same, then my problem has solved. I have stump quite a while. Any helps would be highly
appreciated! Thanks!
The following code is using my library RapidFuzz instead of FuzzyWuzzy since it is faster and it has a process method extractIndices which does help here. This solution is quite a bit faster, but since I do not work with pandas regulary I am sure there are still some things that could be improved :)
import pandas as pd
from rapidfuzz import process, utils
def checkDupTitle(data):
values = data.values.tolist()
companies = [company for company, _ in values]
pcompanies = [utils.default_process(company) for company in companies]
counts = [count for _, count in values]
results = []
while companies:
company = companies.pop(0)
pcompany = pcompanies.pop(0)
count = counts.pop(0)
duplicates = process.extractIndices(
pcompany, pcompanies,
processor=None, score_cutoff=90, limit=None)
for (i, _) in sorted(duplicates, reverse=True):
count += counts.pop(i)
del pcompanies[i]
del companies[i]
results.append([company, count])
return pd.DataFrame(results, columns=['Nname','Ncount'])
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
checkDupTitle(test_data)
The result is
pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})

Using while and if function together with a condition change

I am trying to use python to conduct a calculation which will sum the values in a column only for the time period that a certain condition is met.
However, the summation should begin when the conditions are met (runstat == 0 and oil >1). The summation should then stop at the point when oil == 0.
I am new to python so I am not sure how to do this.
I connected the code to a spreadsheet for testing purposes but the intent is to connect to live data. I figured a while loop in combination with an if function might work but I am not winning.
Basically I want to have the code start when runstat is zero and oil is higher than 0. It should stop summing the values of oil when the oil row becomes zero and then it should write the data to a SQL database (this I will figure out later - for now I just want to see if it can work).
This is what code I have tried so far.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
oil = df['oiltag']
runstat = df['runstattag']
def startup(oil,runstat):
while oil.all() > 0:
if oil > 0 and runstat == 0:
totaloil = sum(oil.all())
print(totaloil)
else:
return None
return
print(startup(oil.all(), runstat.all()))
It should sum the values in the column but it is returning: None
OK, so I think that what you want to do is get the subset of rows between the two conditions, then get a sum of those.
Method: Slice the dataframe to get the relevant rows and then sum.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
def startup(dframe):
start_row = dframe[(dframe.oiltag > 0) & (dframe.runstattag == 0)].index[0]
end_row = dframe[(dframe.oiltag == 0) & (dframe.index > start_row)].index[0]
subset = dframe[start_row:end_row+1] # +1 because the end slice is non-inclusive
totaloil = subset.oiltag.sum()
return totaloil
print(startup(df))
This code will raise an error if it can't find a subset of rows which match your criteria. If you need to handle that case, then we could add some exception handling.
EDIT: Please note this assumes that your criteria is only expected to occur once per excel. If you have multiple “chunks” that you will want to sum then this will need tweaking.

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Spark - Optimize calculation time over a data frame, by using groupBy() instead of filter()

I have a data frame which contains different columns ('features').
My goal is to calculate column X statistical measures:
Mean, Standart-Deviation, Variance
But, to calculate all of those, with dependency on column Y.
e.g. Get all rows which Y = 1, and for them calculate mean,stddev, var,
then do the same for all rows which Y = 2 for them.
My current implementation is:
print "For CONGESTION_FLAG = 0:"
log_df.filter(log_df[flag_col] == 0).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 1:"
log_df.filter(log_df[flag_col] == 1).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 2:"
log_df.filter(log_df[flag_col] == 2).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
I was told the filter() way is wasteful in terms of computation times, and received an advice that for making those calculation run faster (i'm using this on 1GB data file), it would be better use groupBy() method.
Can someone please help me transform those lines to do the same calculations by using groupBy instead?
I got mixed up with the syntax and didn't manage to do so correctly.
Thanks.
Filter by itself is not wasteful. The problem is that you are calling it multiple times (once for each value) meaning you are scanning the data 3 times. The operation you are describing is best achieved by groupby which basically aggregates data per value of the grouped column.
You could do something like this:
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"), pow(stddev(size_col),2).alias("pow"))
You might also get better performance by calculating stddev^2 after the aggregation (you should try it on your data):
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"))
agg_df2 = agg_df.withColumn("pow", agg_df["stddev"] * agg_df["stddev"])
You can:
log_df.groupBy(log_df[flag_col]).agg(
mean(size_col), stddev(size_col), pow(stddev(size_col), 2)
)

Resources