Thanks, everyone for your time
I have a dataframe like below,
import pandas as pd
import numpy as np
raw_data = {'Date': ['04-23-2020', '05-05-2020', '05-05-2020', '05-11-2020', '05-11-2020',
'05-12-2020', '05-12-2020', '05-27-2020', '06-03-2020'],
'Type': ['Buy', 'Buy', 'Sell', 'Buy', 'Sell', 'Buy', 'Buy',
'Buy', 'Sell'],
'Ticker': ['AAA', 'AAA', 'AAA', 'AAA', 'AAA',
'BBB', 'CCC', 'CCC', 'CCC'],
'Quantity': [60000, 12000, -30000, 49000, -30000, 2000, 10000, 28500, -10000],
'Price': [60.78, 82.20, 0, 100.00, 0, 545.00, 141.00, 172.00,
0]
}
df = pd.DataFrame (raw_data, columns = ['Date','Type','Ticker','Quantity','Price']).sort_values(['Date','Ticker']).reset_index(drop = True)
My objective is to calculate the weighted average price whenever there is a sell transaction. please see my below expected outcome. I have tried a for loop for the same but I was unable to get the results.
mycode current code
df['Pur_Value'] = df['Quantity'] * df['Price']
df['TotalQty'] = df.groupby('Ticker')['Quantity'].cumsum()
grpl = df.groupby(by = ['Ticker'])
df1 = pd.DataFrame()
finaldf = pd.DataFrame()
for key in grpl.groups.keys():
df1 = grpl.get_group(key).loc[:,['Type','Ticker','Date','Quantity','Price','Pur_Value']]
df1.sort_values(by=['Ticker','Date'],inplace=True)
df1.reset_index(drop=True,inplace=True)
Cum_Value = 0
Cum_Shares = 0
for Index,Tic in df1.iterrows():
if Tic['Type'] == "Buy":
Cum_Value += Tic['Pur_Value']
Cum_Shares += Tic['Quantity']
else:
df1['sold_price'] = Cum_Value/Cum_Shares
finaldf = finaldf.append(df1)
Expected results is columns which has the weighted avg price for sold shares like below.
I was able to solve my own issue with the below code. thanks
for key in grpl.groups.keys():
df1 = grpl.get_group(key).loc[
:, ["Type", "Ticker", "Date", "Quantity", "Price", "Pur_Value"]
]
df1.sort_values(by=["Ticker", "Date"], inplace=True)
df1.reset_index(drop=True, inplace=True)
Cum_Value = 0
Cum_Shares = 0
Cum_price = 0
sold_value = 0
for Index, Tic in df1.iterrows():
if Tic["Type"] == "Buy":
Cum_Value += Tic["Pur_Value"]
Cum_Shares += Tic["Quantity"]
Cum_price = Cum_Value / Cum_Shares
else:
sold_value = Cum_price * Tic["Quantity"]
Cum_Shares += Tic["Quantity"]
Cum_Value += sold_value
Cum_price = Cum_Value / Cum_Shares
Ticker.append(Tic["Ticker"])
Dates.append(Tic["Date"])
sold_price.append(Cum_price)
Related
In my dataset, I am trying to get the margin between two values. The code below runs perfectly if the fourth race was not included. After grouping based on a column, it seems that sometimes, there will be only 1 value, therefore, no other value to get a margin out of. I want to ignore these groupings in that case. Here is my current code:
import pandas as pd
data = {'Name':['A', 'B', 'B', 'C', 'A', 'C', 'A'], 'RaceNumber':
[1, 1, 2, 2, 3, 3, 4], 'PlaceWon':['First', 'Second', 'First', 'Second', 'First', 'Second', 'First'], 'TimeRanInSec':[100, 98, 66, 60, 75, 70, 75]}
df = pd.DataFrame(data)
print(df)
def winning_margin(times):
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin)
winning_margins.columns = ['margin']
winners = df.loc[df.PlaceWon == 'First', :]
winners = winners.join(winning_margins, on='RaceNumber')
avg_margins = winners[['Name', 'margin']].groupby('Name').mean()
avg_margins
How about returning a NaN if times does not have enough elements:
import numpy as np
def winning_margin(times):
if len(times) <= 1: # New code
return np.NaN # New code
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
your code runs with this change and seem to produce sensible results. But you can furthermore remove NaNs later if you want eg in this line
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin).dropna() # note the addition of .dropna()
You could get the winner and margin in one step:
def get_margin(x):
if len(x) < 2:
return np.NaN
i = x['TimeRanInSec'].idxmin()
nl = x['TimeRanInSec'].nsmallest(2)
margin = nl.max()-nl.min()
return [x['Name'].loc[i], margin]
Then:
df.groupby('RaceNumber').apply(get_margin).dropna()
RaceNumber
1 [B, 2]
2 [C, 6]
3 [C, 5]
(the data has the 'First' indicator corresponding to the slower time in the data)
The Tuple_list prints out something like "[ (2000, 1, 1, 1, 135) , (2000, 1, 1, 2, 136) ) , etc...]" and I can't figure out how to assign "year,month,day,hour,height" to every tuple in the list..."
def read_file(filename):
with open(filename, "r") as read:
pre_list = list()
for line in read.readlines():
remove_symbols = line.strip()
make_list = remove_symbols.replace(" ", ", ")
pre_list += make_list.split('()')
Tuple_list = [tuple(map(int, each.split(', '))) for each in pre_list]
for n in Tuple_list:
year, month, day, hour, height = Tuple_list[n][0], Tuple_list[
n][1], Tuple_list[n][2], Tuple_list[n][3], Tuple_list[n][4]
print(month)
return Tuple_list
swag = read_file("VIK_sealevel_2000.txt")
Maybe "Named tuples" is what you are looking for.
In [1]: from collections import namedtuple
In [2]: Measure = namedtuple('Measure', ['year', 'month', 'day', 'hour', 'height'])
In [3]: m1 = Measure(2005,1,1,1,4444)
In [4]: m1
Out[4]: Measure(year=2005, month=1, day=1, hour=1, height=4444)
I have data in excel and need to create a dictionary for those data.
expected output like below:-
d = [
{
"name":"dhdn",
"usn":1bm15mca13",
"sub":["c","java","python"],
"marks":[90,95,98]
},
{
"name":"subbu",
"usn":1bm15mca14",
"sub":["java","perl"],
"marks":[92,91]
},
{
"name":"paddu",
"usn":1bm15mca17",
"sub":["c#","java"],
"marks":[80,81]
}
]
Tried code but it is working for only two column
import pandas as pd
existing_excel_file = 'BHARTI_Model-4_Migration Service parameters - Input sheet_v1.0_DRAFT_26-02-2020.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet2')
df_service = df_service.fillna(method='ffill')
result = [{'name':k,'sub':g["sub"].tolist(),"marks":g["marks"].tolist()} for k,g in df_service.groupby(['name', 'usn'])]
print (result)
I am getting like below but I want as I expected like above.
[{'name': ('dhdn', '1bm15mca13'), 'sub': ['c', 'java', 'python'], 'marks': [90, 95, 98]}, {'name': ('paddu', '1bm15mca17'), 'sub': ['c#', 'java'], 'marks': [80, 81]}, {'name': ('subbu', '1bm15mca14'), 'sub': ['java', 'perl'], 'marks': [92, 91]}]
Finally, I solved.
import pandas as pd
from pprint import pprint
existing_excel_file = 'BHARTI_Model-4_Migration Service parameters - Input sheet_v1.0_DRAFT_26-02-2020.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet2')
df_service = df_service.fillna(method='ffill')
result = [{'name':k[0],'usn':k[1],'sub':v["sub"].tolist(),"marks":v["marks"].tolist()} for k,v in df_service.groupby(['name', 'usn'])]
pprint (result)
It is giving expected output as I expected.
[{'marks': [90, 95, 98],
'name': 'dhdn',
'sub': ['c', 'java', 'python'],
'usn': '1bm15mca13'},
{'marks': [80, 81],
'name': 'paddu',
'sub': ['c#', 'java'],
'usn': '1bm15mca17'},
{'marks': [92, 91],
'name': 'subbu',
'sub': ['java', 'perl'],
'usn': '1bm15mca14'}]
All right! I solved your question although it took me a while.
The first part is the same as your progress.
import pandas as pd
df = pd.read_excel('test.xlsx')
df = df.fillna(method='ffill')
Then we need to get the unique names and how many rows they cover. I'm assuming there are as many unique names as there are unique "usn's". I created a list that stores these 'counts'.
unique_names = df.name.unique()
unique_usn = df.usn.unique()
counts = []
for i in range(len(unique_names)):
counts.append(df.name.str.count(unique_names[i]).sum())
counts
[3,2,2] #this means that 'dhdn' covers 3 rows, 'subbu' covers 2 rows, etc.
Now we need a smart function that will let us obtain the necessary info from the other columns.
def get_items(column_number):
empty_list = []
lower_bound = 0
for i in range(len(counts)):
empty_list.append(df.iloc[lower_bound:sum(counts[:i+1]),column_number].values.tolist())
lower_bound = sum(counts[:i+1])
return empty_list
I leave it to you to understand what is going on. But basically we are recovering the necessary info. We now just need to apply that to get a list for subs and for marks, respectively.
list_sub = get_items(3)
list_marks = get_items(2)
Finally, we put it all into one list of dicts.
d = []
for i in range(len(unique_names)):
diction = {}
diction['name'] = unique_names[i]
diction['usn'] = unique_usn[i]
diction['sub'] = list_sub[i]
diction['marks'] = list_marks[i]
d.append(diction)
And voilĂ !
print(d)
[{'name': 'dhdn', 'usn': '1bm15mca13', 'sub': [90, 95, 98], 'marks': ['c', 'java', 'python']},
{'name': 'subbu', 'usn': '1bm15mca14', 'sub': [92, 91], 'marks': ['java', 'perl']},
{'name': 'paddu', 'usn': '1bm15mca17', 'sub': [80, 81], 'marks': ['c#', 'java']}]
Doing an Alphavantage API pull for historic stock data. I'm pulling one of their indicators. Instead of writing 36 separate functions and manually pulling, I'd like to iterate through the 36 possible combinations and do the pull each time with different variables (the variables being each of the combinations). Below is my code. It currently returns "NONE". What am I doing wrong?
Also, is there a way to combine these two functions into one?
Thanks!
def get_ppo_series(matype, series_type):
pull_parameters = {
'function': 'PPO',
'symbol': stock,
'interval': interval,
'series_type': series_type,
'fastperiod': 12,
'slowperiod': 26,
'matype': matype,
'datatype': 'json',
'apikey': key
}
column = 0
pull = rq.get(url, params=pull_parameters)
data = pull.json()
df = pd.DataFrame.from_dict(data['Technical Analysis: PPO'], orient='index', dtype=float)
df.reset_index(level=0, inplace=True)
df.columns = ['Date', 'PPO Series ' + str(column)]
df.insert(0, 'Stock', stock)
column += 1
return df.tail(past_years * annual_trading_days)
def run_ppo_series():
matype = list(range(8))
series_type = ['open', 'high', 'low', 'close']
combinations = product(matype, series_type)
for matype, series_type in combinations:
get_ppo_series(matype, series_type)
print(run_ppo_series())
I also tried the following. This version at least ran one iteration and returned data. But it stops there ???
def get_ppo_series():
column = 0
matype = list(range(8))
series_type = ['open', 'high', 'low', 'close']
combinations = product(matype, series_type)
for matype, series_type in combinations:
pull_parameters = {
'function': 'PPO',
'symbol': stock,
'interval': interval,
'series_type': series_type,
'fastperiod': 12,
'slowperiod': 26,
'matype': matype,
'datatype': 'json',
'apikey': key
}
pull = rq.get(url, params=pull_parameters)
data = pull.json()
df = pd.DataFrame.from_dict(data['Technical Analysis: PPO'], orient='index', dtype=float)
df.reset_index(level=0, inplace=True)
df.columns = ['Date', 'PPO Series ' + str(column)]
df.insert(0, 'Stock', stock)
column += 1
return df.tail(past_years * annual_trading_days)
print(get_ppo_series())
import requests as rq
import itertools
url = 'https://www.alphavantage.co/query?'
key = 'get your own key'
def get_ppo_series(matype, series_type):
pull_parameters = {
'function': 'PPO',
'symbol': 'msft',
'interval': '60min',
'series_type': series_type,
'fastperiod': 12,
'slowperiod': 26,
'matype': matype,
'datatype': 'json',
'apikey': key
}
column = 0
pull = rq.get(url, params=pull_parameters)
data = pull.json()
print('*' * 50)
print(f'MAType: {matype}, Series: {series_type}')
print(data)
def run_ppo_series():
matype = list(range(8))
series_type = ['open', 'high', 'low', 'close']
combinations = itertools.product(matype, series_type)
for matype, series_type in combinations:
get_ppo_series(matype, series_type)
run_ppo_series()
The code above works without issue once symbol and interval values are supplied.
Thank you for using Alpha Vantage! Our standard API call frequency is 5 calls per minute and 500 calls per day
I didn't bother with the DataFrame portion of get_ppo_series because it's not relevant for receiving the data
I would leave the functions separate, it looks cleaner and I think it's standard for a function to do 1 thing.
A counter can be added to the code and time.sleep(60) after every 5 iterations unless you have a different API call frequency
Function with 60 second wait after every 5 api calls
import time
def run_ppo_series():
matype = list(range(8))
series_type = ['open', 'high', 'low', 'close']
combinations = itertools.product(matype, series_type)
count = 0
for matype, series_type in combinations:
if (count%5 == 0) & (count != 0):
time.sleep(60)
get_ppo_series(matype, series_type)
count+=1
I was just curious if someone could help me out with a bit of webscraping. I am currently starting my scrape at this link -- https://www.basketball-reference.com/leagues/NBA_1981_games-october.html. I am scraping all the table "schedule" for each month and then moving onto the next year. I am able to successfully scrape from 1989 to 2001 (every month) and put into the format I desire. But my code is so fragile.... I was curious if there is a better methodology that can be explained to me rather than just pulling in the schedule table as one massive piece of text and then splicing it to fit my needs. For example, here is my code:
from selenium import webdriver as wd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas as pd
import os
chromeDriverPath = r'path of chromeDriver used by Selenium'
browser = wd.Chrome(executable_path= chromeDriverPath)
#Create the links needed
link ="https://www.basketball-reference.com/leagues/NBA_"
years = range(1989,2018,1)
months = ['october', 'november', 'december', 'january', 'february', 'march',
'april', 'may', 'june', 'july', 'august', 'september']
hour_list = ['1:00','1:30', '1:40','2:00','2:30','3:00','3:30','4:00','4:30','5:00',
'5:30','6:00','6:30','7:00','7:30','8:00','8:30','9:00','9:30',
'10:00','10:30','11:00','11:30','12:00', '12:30','12:40']
ampm = ['pm', 'am']
def scrape(url):
try:
browser.get(url)
schedule = WebDriverWait(browser,5).until(EC.presence_of_all_elements_located((By.ID, "schedule")))
except TimeoutException:
print(str(url) + ' does not exist!')
return
o_players = [schedule[i].text for i in range(0, len(schedule))]
o_players = ''.join(o_players)
o_players = o_players.splitlines()
o_players = o_players[1:]
o_players = [x.replace(',','') for x in o_players]
o_players = [x.split(' ') for x in o_players]
l0 = []
l1 = []
l2 = []
for x in o_players:
if "at" in x:
l1.append(x[:x.index("at")])
elif 'Game' in x:
l0.append(x[:x.index("Game")])
else:
l2.append(x)
l3 = l1 + l2 + l0
for x in l3:
for y in x:
if y in hour_list:
x.remove(y)
for t in x:
if t in ampm:
x.remove(t)
ot = ['OT','2OT', '3OT', '4OT','5OT']
for x in l3:
x.insert(0,'N/A')
if x[-1] != 'Score' and x[-1] not in ot:
x.insert(1,x[-1])
else:
x.insert(1,'N/A')
for y in ot:
if y in x:
x.remove('N/A')
x.remove(y)
x.insert(0,y)
l3 = [t for t in l3 if 'Playoffs' not in t]
for x in l3:
if len(x) == 17:
x.insert(0,' '.join(x[6:9]))
x.insert(1,' '.join(x[11:14]))
x.insert(1, x[11])
x.insert(3, x[16])
if len(x) == 16 and x[-1] != 'Score':
if x[8].isdigit():
x.insert(0,' '.join(x[6:8]))
x.insert(1,' '.join(x[10:13]))
x.insert(1, x[10])
x.insert(3, x[15])
else:
x.insert(0,' '.join(x[6:9]))
x.insert(1,' '.join(x[11:13]))
x.insert(1, x[11])
x.insert(3, x[15])
if len(x) == 16 and x[-1] == 'Score':
x.insert(0,' '.join(x[6:9]))
x.insert(1, ' '.join(x[11:14]))
x.insert(1, x[11])
x.insert(3, x[16])
if len(x) == 15 and x[-1] != 'Score':
x.insert(0,' '.join(x[6:8]))
x.insert(1,' '.join(x[10:12]))
x.insert(1, x[10])
x.insert(3, x[14])
if len(x) == 15 and x[-1] == 'Score':
if x[8].isdigit():
x.insert(0,' '.join(x[6:8]))
x.insert(1,' '.join(x[10:13]))
x.insert(1, x[10])
x.insert(3, x[15])
else:
x.insert(0,' '.join(x[6:9]))
x.insert(1,' '.join(x[11:13]))
x.insert(1, x[11])
x.insert(3, x[15])
if len(x) == 14:
x.insert(0,' '.join(x[6:8]))
x.insert(1,' '.join(x[10:12]))
x.insert(1, x[10])
x.insert(3, x[14])
l4 = []
for x in l3:
x = x[:10]
l4.append(x)
#Working With Pandas to Standardize Data
df = pd.DataFrame(l4)
df['Date'] = df[7] + ' '+ df[8] + ', ' + df[9]
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by=['Date'])
headers = ['Visitor', 'Visitor Points', 'Home', 'Home Points', 'OT',
'Attendance','Weekday', 'Month', 'Day', 'Year', 'Date' ]
headers_order = ['Date', 'Weekday', 'Day', 'Month', 'Year', 'Visitor', 'Visitor Points',
'Home', 'Home Points', 'OT', 'Attendance']
df.columns = headers
df = df[headers_order]
file_exists = os.path.isfile("NBA_Scrape.csv")
if not file_exists:
df.to_csv('NBA_Scrape.csv', mode='a', header=True, index=False)
else:
df.to_csv('NBA_Scrape.csv', mode='a', header=False, index=False)
for x in years:
link0 = link+str(x)+'_games-'
for y in months:
final_links = link0+str(y)+'.html'
scrape(final_links)
My code starts to return errors at year 2001 I believe. I would like to scrape through the present. Please help me scrape better. I imagine there is a much more proficient way, like looping through each element in the table "schedule" and appending each one to a different list or different column in pandas? Please lend me a hand.
Thanks,
Joe
Your target is perfectly static, so no necessary to run selenium. I would suggest using Scrapy python library. It was designed to fit all web scraping needs. It incredibly fast and flexible tool. You can use xpath to pull all elements from page separately, instead of considering it as a huge piece of text.