Python function to perform calculation among each group of data frame

Python function to perform calculation among each group of data frame - python-3.x

I need to have a function which performs below mentioned action ;
The dataset is :
and output expected is value in 'Difference' column , where remaining are input column.
Please note that within each group we first need to identify the maximum 'Closing_time' and the corrosponding amount will be the maximum value for that period , and then each row value will be subtracted from maximum detected value of previous period and result would be difference for that cell.
Also in case if the record do not have previous period then max value will be NA and difference caculation would be NA for all record for that period,
Adding points - within in each group (Cost_centre, Account, Year, Month) - Closing_time values are like ( D-0 00 CST is min and D-0 18 CST is maximim , similary within D-0,D+1, D+3 etc - D+3 will be maximum)
I tried to find first if previous value exist for each of the group or not and then find maximum time within each period and then crrosponding amount value to it.
Further using the maximum value , tried to subtract record Amount from Maximum value ,
but not getting how to implement , kindly help.

post sharing the above question i came up for this solution.
I splitted this in 3 part -
a) First find previous year and month for each of cost_center and account
b) Find maximum Closing_time within each group of cost_cente,account, year and month. Then pick corrosponding Amount value as amount .
c) using amount coming from b , subtract current amount with b to get diffrence.
def prevPeriod(df):
period =[]
for i in range(df.shape[0]):
if df['Month'][i]==1:
val_year = df['Year'][i]-1
val_month = 12
new_val =(val_year,val_month)
period.append(new_val)
else:
val_year = df['Year'][i]
val_month = df['Month'][i]-1
new_val =(val_year,val_month)
period.append(new_val)
print(period)
df['Previous_period'] = period
return df
def max_closing_time(group_list):
group_list = [item.replace('CST','') for item in group_list]
group_list = [item.replace('D','') for item in group_list]
group_list = [item.split()[:len(item)] for item in group_list]
l3 =[]
l4 =[]
for item in group_list:
l3.append(item[0])
l4.append(item[1])
l3 =[int(item) for item in l3]
l4 = [int(item) for item in l4]
max_datevalue = max(l3)
max_datevalue_index = l3.index(max(l3))
max_time_value = max(l4[max_datevalue_index:])
maximum_period = 'D+'+str(max_datevalue)+' '+str(max_time_value)+' '+'CST'
return maximum_period
def calculate_difference(df):
diff =[]
for i in range(df.shape[0]):
prev_year =df['Previous_period'][i][0]
print('prev_year is',prev_year)
prev_month = df['Previous_period'][i][1]
print('prev_month is', prev_month)
max_closing_time = df[(df['Year']==prev_year)& (df['Month']==prev_month)]['Max_Closing_time']
print('max_closing_time is', max_closing_time)
#max_amount_consider = df[(df['Year']==prev_year)& (df['Month']==prev_month) &(df['Max_Closing_time']==max_closing_time)]['Amount']
if bool(max_closing_time.empty):
found_diff = np.nan
diff.append(found_diff)
else:
max_closing_time_value = list(df[(df['Year']==prev_year)& (df['Month']==prev_month)]['Max_Closing_time'])[0]
max_amount_consider = df[(df['Cost_centre']==df['Cost_centre'][i])&(df['Account']==df['Account'][i])&(df['Year']==prev_year) & (df['Month']==prev_month) &(df['Closing_time']==str(max_closing_time_value))]['Amount']
print('max_amount_consider is',max_amount_consider)
found_diff = int(max_amount_consider) - df['Amount'][i]
diff.append(found_diff)
df['Variance'] = diff
return df
def calculate_variance(df):
'''
Input data frame is coming as query used above to fetch data
'''
try:
df = prevPeriod(df)
except:
print('Error occured in prevPeriod function')
# prerequisite for max_time_period
df2 = pd.DataFrame(df.groupby(['Cost_centre','Account','Year','Month'])['Closing_time'].apply(max_closing_time).reset_index())
df = pd.merge(df,df2, on =['Cost_centre','Account','Year','Month'])
# final calculation
try:
final_result = calculate_difference(df)
except:
print('Error in calculate_difference')
return final_result

Related

How do I extract specific values from a DataFrame and add them to a list?

Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.

Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.

Why does my script return "AttributeError: 'str' object has no attribute 'append'?

I'm trying to get the profitability of every project by dividing profit by revenue.
The code is working, I get the values back.
I just need help with the last part (the dividing part). There is where I'm having some issues.
Here is my code.
The outcome I get is
AttributeError: 'str' object has no attribute 'append'
from observations.constants import PROJECTS_DB_ID
from datetime import datetime
from dateutil.relativedelta import relativedelta
def get(gs_client):
#Sheet access
sheet = gs_client.open_by_key(
PROJECTS_DB_ID).worksheet('Finance')
#Columns necessary
projects = sheet.col_values(1)[2:]
months = sheet.col_values(2)[2:]
profit = sheet.col_values(11)[2:]
revenue = sheet.col_values(6)[2:]
last_modified = sheet.col_values(13)[2:]
#Lists
list_projects = []
list_months = []
list_profit = []
list_revenue = []
list_last_modified = []
value = []
#Gets each project
for project in projects:
list_projects.append(project)
#Gets each month
for month in months:
list_months.append(month)
#Gets each value of profit column
for val in profit:
list_profit.append(val.strip('$').replace(',',''))
#Gets each value in revenue column
for value in revenue:
list_revenue.append(value.strip('$').replace(',',''))
#Gets each date in last modified column
for update in last_modified:
list_last_modified.append(update)
#Get profitability per project (profit divided by revenue)
for x in range(len(projects)):
value1 = float(list_profit[x])/float(list_revenue[x])
value.append(value1)
print(value)
Any help would be greatly appreciated!

Your error is due to variable value, you have used it as list and as string.
#Lists
list_projects = []
list_months = []
list_profit = []
list_revenue = []
list_last_modified = []
value = []
#Gets each project
for project in projects:
list_projects.append(project)
#Gets each month
for month in months:
list_months.append(month)
#Gets each value of profit column
for val in profit:
list_profit.append(val.strip('$').replace(',',''))
#Gets each value in revenue column
for val in revenue: # here, changed value to val
list_revenue.append(val.strip('$').replace(',',''))
#Gets each date in last modified column
for update in last_modified:
list_last_modified.append(update)
#Get profitability per project (profit divided by revenue)
for x in range(len(projects)):
value1 = float(list_profit[x])/float(list_revenue[x])
value.append(value1)
whenever you use for i in somthing in python, the i isn't local variable inside the for loop like in other language, value of i is the last value of i inside the loop, which can also be accessed after the end of the loop. You have to be very careful about the use of variable names in python.

Calculate the values from a column in a csv file based on data from another column

So, I asked this question yesterday but I think I can word it a lot better. I have a csv file with 4 columns, 1 of which contains the day that a ticket has been purchased for (Wed, Thur and Fri), and another containing how many tickets each customer has bought. Wed & Thur tickets are a different price from Fri tickets. I need to get the code to loop through the tickets bought column and only take the data from the rows containing 'W' or 'T' in the day of purchase column so I can calculate how much money was made from Wed & Thur sales, and then the same for the Fri sales. I hope I've explained it well. If it helps, here is my code so far:
wedThur = int(5.00)
friday = int(10.00)
def readFile():
ticketid = []
ticketsBought = []
method = []
f = open("ticketdata.csv")
csvFile = csv.reader(f)
for row in csvFile:
ticketid.append(row[1])
ticketsBought.append(int(row[2]))
method.append(row[3])
f.close()
return ticketid, ticketsBought, method
def calculatePurchases(ticketid, ticketsBought):
price = 0
amount = len(ticketid)
if 'W' or 'T' in ticketid:
price = wedThur * amount
print(price)
else:
price = friday * amount
print(price)
return price

Python has many amazing features to work with such data.
First of all, I would change your read file function to return more suitable data structure. Instead of returning tuple of lists, I would return list of tuple.
def read_file():
data = []
f = open("ticketdata.csv")
csvFile = csv.reader(f)
for row in csvFile:
data.append(row)
f.close()
return data
Python has built-in function sum, which sums all elements in a sequence.
sum([1, 2, 3]) returns 6.
All is needed to compose right sequence for it.
def iterate_by_day(days, data):
for d in data:
if d[0] in days:
yield d[1]
This creates a special object called a generator. Visit python tutorial and make yourself familiar with it.
This should print the expected result.
data = read_file()
wed_thur = 5
print(sum(iterate_by_day("WT", data) * wed_thur))
# This works the same
print(sum(iterate_by_day(["W", "T"], data)) * wed_thur)

Analysis of Eye-Tracking data in python (Eye-link)

I have data from eye-tracking (.edf file - from Eyelink by SR-research). I want to analyse it and get various measures such as fixation, saccade, duration, etc.
Is there an existing package to analyse Eye-Tracking data?
Thanks!

At least for importing the .edf-file into a pandas DF, you can use the following package by Niklas Wilming: https://github.com/nwilming/pyedfread/tree/master/pyedfread
This should already take care of saccades and fixations - have a look at the readme. Once they're in the data frame, you can apply whatever analysis you want to it.

pyeparse seems to be another (yet currently unmaintained as it seems) library that can be used for eyelink data analysis.
Here is a short excerpt from their example:
import numpy as np
import matplotlib.pyplot as plt
import pyeparse as pp
fname = '../pyeparse/tests/data/test_raw.edf'
raw = pp.read_raw(fname)
# visualize initial calibration
raw.plot_calibration(title='5-Point Calibration')
# create heatmap
raw.plot_heatmap(start=3., stop=60.)
EDIT: After I posted my answer I found a nice list compiling lots of potential tools for eyelink edf data analysis: https://github.com/davebraze/FDBeye/wiki/Researcher-Contributed-Eye-Tracking-Tools

Hey the question seems rather old but maybe I can reactivate it, because I am currently facing the same situation.
To start I recommend to convert your .edf to an .asc file. In this way it is easier to read it to get a first impression.
For this there exist many tools, but I used the SR-Research Eyelink Developers Kit (here).
I don't know your setup but the Eyelink 1000 itself detects saccades and fixation. I my case in the .asc file it looks like that:
SFIX L 10350642
10350642 864.3 542.7 2317.0
...
...
10350962 863.2 540.4 2354.0
EFIX L 10350642 10350962 322 863.1 541.2 2339
SSACC L 10350964
10350964 863.4 539.8 2359.0
...
...
10351004 683.4 511.2 2363.0
ESACC L 10350964 10351004 42 863.4 539.8 683.4 511.2 5.79 221
The first number corresponds to the timestamp, the second and third to x-y coordinates and the last is your pupil diameter (what the last numbers after ESACC are, I don't know).
SFIX -> start fixation
EFIX -> end fixation
SSACC -> start saccade
ESACC -> end saccade
You can also check out PyGaze, I haven't worked with it, but searching for a toolbox, this one always popped up.
EDIT
I found this toolbox here. It looks cool and works fine with the example data, but sadly does not work with mine
EDIT No 2
Revisiting this question after working on my own Eyetracking data I thought I might share a function wrote, to work with my data:
def eyedata2pandasframe(directory):
'''
This function takes a directory from which it tries to read in ASCII files containing eyetracking data
It returns eye_data: A pandas dataframe containing data from fixations AND saccades fix_data: A pandas dataframe containing only data from fixations
sac_data: pandas dataframe containing only data from saccades
fixation: numpy array containing information about fixation onsets and offsets
saccades: numpy array containing information about saccade onsets and offsets
blinks: numpy array containing information about blink onsets and offsets
trials: numpy array containing information about trial onsets
'''
eye_data= []
fix_data = []
sac_data = []
data_header = {0: 'TimeStamp',1: 'X_Coord',2: 'Y_Coord',3: 'Diameter'}
event_header = {0: 'Start', 1: 'End'}
start_reading = False
in_blink = False
in_saccade = False
fix_timestamps = []
sac_timestamps = []
blink_timestamps = []
trials = []
sample_rate_info = []
sample_rate = 0
# read the file and store, depending on the messages the data
# we have the following structure:
# a header -- every line starts with a '**'
# a bunch of messages containing information about callibration/validation and so on all starting with 'MSG'
# followed by:
# START 10350638 LEFT SAMPLES EVENTS
# PRESCALER 1
# VPRESCALER 1
# PUPIL AREA
# EVENTS GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# SAMPLES GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# followed by the actual data:
# normal data --> [TIMESTAMP]\t [X-Coords]\t [Y-Coords]\t [Diameter]
# Start of EVENTS [BLINKS FIXATION SACCADES] --> S[EVENTNAME] [EYE] [TIMESTAMP]
# End of EVENTS --> E[EVENT] [EYE] [TIMESTAMP_START]\t [TIMESTAMP_END]\t [TIME OF EVENT]\t [X-Coords start]\t [Y-Coords start]\t [X_Coords end]\t [Y-Coords end]\t [?]\t [?]
# Trial messages --> MSG timestamp\t TRIAL [TRIALNUMBER]
try:
with open(directory) as f:
csv_reader = csv.reader(f, delimiter ='\t')
for i, row in enumerate (csv_reader):
if any ('RATE' in item for item in row):
sample_rate_info = row
if any('SYNCTIME' in item for item in row): # only start reading after this message
start_reading = True
elif any('SFIX' in item for item in row): pass
#fix_timestamps[0].append (row)
elif any('EFIX' in item for item in row):
fix_timestamps.append ([row[0].split(' ')[4],row[1]])
#fix_timestamps[1].append (row)
elif any('SSACC' in item for item in row):
#sac_timestamps[0].append (row)
in_saccade = True
elif any('ESACC' in item for item in row):
sac_timestamps.append ([row[0].split(' ')[3],row[1]])
in_saccade = False
elif any('SBLINK' in item for item in row): # stop reading here because the blinks contain NaN
# blink_timestamps[0].append (row)
in_blink = True
elif any('EBLINK' in item for item in row): # start reading again. the blink ended
blink_timestamps.append ([row[0].split(' ')[2],row[1]])
in_blink = False
elif any('TRIAL' in item for item in row):
# the first element is 'MSG', we don't need it, then we split the second element to seperate the timestamp and only keep it as an integer
trials.append (int(row[1].split(' ')[0]))
elif start_reading and not in_blink:
eye_data.append(row)
if in_saccade:
sac_data.append(row)
else:
fix_data.append(row)
# drop the last data point, because it is the 'END' message
eye_data.pop(-1)
sac_data.pop(-1)
fix_data.pop(-1)
# convert every item in list into a float, substract the start of the first trial to set the start of the first video to t0=0
# then devide by 1000 to convert from milliseconds to seconds
for row in eye_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in sac_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in sac_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in blink_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
sample_rate = float (sample_rate_info[4])
# convert into pandas fix_data Frames for a better overview
eye_data = pd.DataFrame(eye_data)
fix_data = pd.DataFrame(fix_data)
sac_data = pd.DataFrame(sac_data)
fix_timestamps = pd.DataFrame(fix_timestamps)
sac_timestamps = pd.DataFrame(sac_timestamps)
trials = np.array(trials)
blink_timestamps = pd.DataFrame(blink_timestamps)
# rename header for an even better overview
eye_data = eye_data.rename(columns=data_header)
fix_data = fix_data.rename(columns=data_header)
sac_data = sac_data.rename(columns=data_header)
fix_timestamps = fix_timestamps.rename(columns=event_header)
sac_timestamps = sac_timestamps.rename(columns=event_header)
blink_timestamps = blink_timestamps.rename(columns=event_header)
# substract the first timestamp of trials to set the start of the first video to t0=0
eye_data.TimeStamp -= trials[0]
fix_data.TimeStamp -= trials[0]
sac_data.TimeStamp -= trials[0]
trials -= trials[0]
trials = trials /1000 # does not work with trials/=1000
# devide TimeStamp to get time in seconds
eye_data.TimeStamp /=1000
fix_data.TimeStamp /=1000
sac_data.TimeStamp /=1000
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
except:
print ('Could not read ' + str(directory) + ' properly!!! Returned empty data')
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
Hope it helps you guys. Some parts of the code you may need to change, like the index where to split the strings to get the crutial information about event on/offsets. Or you don't want to convert your timestamps into seconds or do not want to set the onset of your first trial to 0. That is up to you.
Additionally in my data we sent a message to know when we started measuring ('SYNCTIME') and I had only ONE condition in my experiment, so there is only one 'TRIAL' message
Cheers

Panda dataframe yield error

I am trying to yield 1 row by 1 row for a panda dataframe but get an error. The dataframe is a stock price data, including daily open, close, high, low price and volume information.
The following is my code. This class will get data from MySQL database
class HistoricMySQLDataHandler(DataHandler):
def __init__(self, events, symbol_list):
"""
Initialises the historic data handler by requesting
a list of symbols.
Parameters:
events - The Event Queue.
symbol_list - A list of symbol strings.
"""
self.events = events
self.symbol_list = symbol_list
self.symbol_data = {}
self.latest_symbol_data = {}
self.continue_backtest = True
self._connect_MySQL()
def _connect_MySQL(self): #get stock price for symbol s
db_host = 'localhost'
db_user = 'sec_user'
db_pass = 'XXX'
db_name = 'securities_master'
con = mdb.connect(db_host, db_user, db_pass, db_name)
for s in self.symbol_list:
sql="SELECT * FROM daily_price where symbol= s
self.symbol_data[s] = pd.read_sql(sql, con=con, index_col='price_date')"
def _get_new_bar(self, symbol):
"""
Returns the latest bar from the data feed as a tuple of
(sybmbol, datetime, open, low, high, close, volume).
"""
for row in self.symbol_data[symbol].itertuples():
yield tuple(symbol, datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S'),
row[15], row[17], row[16], row[18],row[20])
def update_bars(self):
"""
Pushes the latest bar to the latest_symbol_data structure
for all symbols in the symbol list.
"""
for s in self.symbol_list:
try:
bar = self._get_new_bar(s).__next__()
except StopIteration:
self.continue_backtest = False
In the main function:
# Declare the components with respective parameters
symbol_list=["GOOG"]
events=queue.Queue()
bars = HistoricMySQLDataHandler(events,symbol_list)
while True:
# Update the bars (specific backtest code, as opposed to live trading)
if bars.continue_backtest == True:
bars.update_bars()
else:
break
time.sleep(1)
Data example:
symbol_data["GOOG"] =
price_date id exchange_id ticker instrument name ... high_price low_price close_price adj_close_price volume
2014-03-27 29 None GOOG stock Alphabet Inc Class C ... 568.0000 552.9200 558.46 558.46 13100
The update_bars function will call _get_new_bar to move to next row (next day price)
My objective is to get stock price day by day (iterate rows of the dataframe) but self.symbol_data[s] in _connect_MySQL is a dataframe while in _get_new_bar is a generator hence I get this error
AttributeError: 'generator' object has no attribute 'itertuples'
Anyone have any ideas?
I am using python 3.6. Thanks
self.symbol_data is a dict, symbol is a string key to get the dataframe. the data is stock price data. For example self.symbol_data["GOOG"] return a dataframe with google's daily stock price information index by date, each row including open, low, high, close price and volume. My goal is to iterate this price data day by day using yield.
_connect_MySQL will get data from the database
In this example, s = "GOOG" in the function

I found the bug.
My code in other place change the dataframe to be a generator.
A stupid mistake lol
I didn't post this line in the question but this line change the datatype
# Reindex the dataframes
for s in self.symbol_list:
self.symbol_data[s] = self.symbol_data[s].reindex(index=comb_index, method='pad').iterrows()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python function to perform calculation among each group of data frame - python-3.x

Related

How do I extract specific values from a DataFrame and add them to a list?

Why does my script return "AttributeError: 'str' object has no attribute 'append'?

Calculate the values from a column in a csv file based on data from another column

Analysis of Eye-Tracking data in python (Eye-link)

Panda dataframe yield error

Categories

Resources