Calling values from dict to create boolean column in dataframe - python-3.x

Searched multiple threads on here for an answer but nothing is quite like what I am looking to do. I am trying to make a boolean column in a dataframe where one of the parameters is based on metal and associated cost maximum, see Dict below.
Cost_ranges = {'Metals': ["Cu", "Pb", "Zn", "Ni", "Mo", "Co", "Sn", "U3O8", "Ag", "Au", "Pt", "Pd", "Rh", "Os", "Ru", "Ir"],
'Cost Maximum': [350, 200, 200, 500, 800, 1000, 250, 2500, 30, 2500, 500, 1000, 6000, 2500, 2500, 2500]}
The dict is used to complete the below formula:
df_Cost['Total Cost'] >= Cost_ranges['Cost Maximum']
i.e. df_Cost['Metal'] contains to value 'Cu', it calls the 'Cost Maximum' of 350 from the dict and uses this for the boolean expression hence formula would read:
df_Cost[Total Cost] >= 350
I need it to apply to all rows in a dataframe. I have been using df.eval() but need an extra layer of processing to match up the right limit per metal.
I have tried using df.eval(), df.query(), df.loc and df.apply() but keep getting 'TypeError: 'Series' objects are mutable, thus they cannot be hashed' or 'ValueError: ('Lengths must match to compare', (9999,), (16,))' for each solution.
Look forward to the responses.

Depending on your actual data, you could do something like:
import numpy as np
import pandas as pd
Cost_ranges = {'Metals': ["Cu", "Pb", "Zn", "Ni", "Mo", "Co", "Sn", "U3O8", "Ag", "Au", "Pt", "Pd", "Rh", "Os", "Ru", "Ir"],
'Cost Maximum': [350, 200, 200, 500, 800, 1000, 250, 2500, 30, 2500, 500, 1000, 6000, 2500, 2500, 2500]}
N = 20
d = pd.DataFrame({'Metals': np.random.choice(["Cu", "Pb", "Zn", "Ni"], N),
'Cost': np.random.random(N) * 1000})
d.merge(pd.DataFrame(Cost_ranges).astype({'Cost Maximum': float}),
on = "Metals", how = "left")\
.eval('want = Cost > `Cost Maximum`')
# Metals Cost Cost Maximum want
# 0 Cu 297.386007 350.0 False
# 1 Pb 55.570657 200.0 False
# 2 Pb 91.803336 200.0 False
# 3 Cu 916.273995 350.0 True
# 4 Zn 796.383326 200.0 True
# 5 Pb 112.504581 200.0 False

Assuming your df_Cost looks a little like this (with potentially more rows and columns):
>>> df_Cost
Total Cost Metal
0 315 Cu
1 420 Cu
The easiest way is to use a dictionary to translate from Metal to max cost. Then you need to make Cost_ranges into a dictionary mapping metal name to cost:
>>> cost_lookup = dict(zip(Cost_ranges['Metals'], Cost_ranges['Cost Maximum']))
>>> cost_lookup
{'Cu': 350, 'Pb': 200, 'Zn': 200, 'Ni': 500, 'Mo': 800, 'Co': 1000, 'Sn': 250, 'U3O8': 2500, 'Ag': 30, 'Au': 2500, 'Pt': 500, 'Pd': 1000, 'Rh': 6000, 'Os': 2500, 'Ru': 2500, 'Ir': 2500}
>>> df_Cost['Metal'].map(cost_lookup)
0 350
1 350
Name: Metal, dtype: int64
>>> df_Cost['Total Cost'] >= df_Cost['Metal'].map(cost_lookup)
0 False
1 True
dtype: bool

Related

Multiply each row of an array with coefficients in list - Python

I am very new to Python an need help. This is the problem statement:
I want to calculate the value of each of the three houses by multiplying the rows of the arraym X (each row representing one house) with the coefficients in list c, so for the first house: Price = (66x3000)+(5x200)+ (15x-50) + (2x5000) + (500x100) = 258.000
Do not use numpy
Print the price of the three houses
This is what I have so far:
# input values for three houses:
# - size [m^2],
# - size of the sauna [m^2],
# - distance to water [m],
# - number of indoor bathrooms,
# - proximity of neighbors [m]
X = [[66, 5, 15, 2, 500],
[21, 3, 50, 1, 100],
[120, 15, 5, 2, 1200]]
# coefficient values
c = [3000, 200 , -50, 5000, 100]
def predict(X, c):
price = 0
for i in range (len(X)):
for j in range (len(X[i])):
price += (c[j]*X[i][j])
print(price)
predict(X, c)
The output is
258250
334350
827100.
The program adds the value of the 2nd an 3rd hourse the the previous result, rather than returning each house's value. How can I fix this?
Many thanks!
Move the line
price = 0
into the outer for loop:
def predict(X, c):
for i in range (len(X)):
price = 0
for j in range (len(X[i])):
...

How do I iterate through combinations of two lists and perform a function each time?

Doing an Alphavantage API pull for historic stock data. I'm pulling one of their indicators. Instead of writing 36 separate functions and manually pulling, I'd like to iterate through the 36 possible combinations and do the pull each time with different variables (the variables being each of the combinations). Below is my code. It currently returns "NONE". What am I doing wrong?
Also, is there a way to combine these two functions into one?
Thanks!
def get_ppo_series(matype, series_type):
pull_parameters = {
'function': 'PPO',
'symbol': stock,
'interval': interval,
'series_type': series_type,
'fastperiod': 12,
'slowperiod': 26,
'matype': matype,
'datatype': 'json',
'apikey': key
}
column = 0
pull = rq.get(url, params=pull_parameters)
data = pull.json()
df = pd.DataFrame.from_dict(data['Technical Analysis: PPO'], orient='index', dtype=float)
df.reset_index(level=0, inplace=True)
df.columns = ['Date', 'PPO Series ' + str(column)]
df.insert(0, 'Stock', stock)
column += 1
return df.tail(past_years * annual_trading_days)
def run_ppo_series():
matype = list(range(8))
series_type = ['open', 'high', 'low', 'close']
combinations = product(matype, series_type)
for matype, series_type in combinations:
get_ppo_series(matype, series_type)
print(run_ppo_series())
I also tried the following. This version at least ran one iteration and returned data. But it stops there ???
def get_ppo_series():
column = 0
matype = list(range(8))
series_type = ['open', 'high', 'low', 'close']
combinations = product(matype, series_type)
for matype, series_type in combinations:
pull_parameters = {
'function': 'PPO',
'symbol': stock,
'interval': interval,
'series_type': series_type,
'fastperiod': 12,
'slowperiod': 26,
'matype': matype,
'datatype': 'json',
'apikey': key
}
pull = rq.get(url, params=pull_parameters)
data = pull.json()
df = pd.DataFrame.from_dict(data['Technical Analysis: PPO'], orient='index', dtype=float)
df.reset_index(level=0, inplace=True)
df.columns = ['Date', 'PPO Series ' + str(column)]
df.insert(0, 'Stock', stock)
column += 1
return df.tail(past_years * annual_trading_days)
print(get_ppo_series())
import requests as rq
import itertools
url = 'https://www.alphavantage.co/query?'
key = 'get your own key'
def get_ppo_series(matype, series_type):
pull_parameters = {
'function': 'PPO',
'symbol': 'msft',
'interval': '60min',
'series_type': series_type,
'fastperiod': 12,
'slowperiod': 26,
'matype': matype,
'datatype': 'json',
'apikey': key
}
column = 0
pull = rq.get(url, params=pull_parameters)
data = pull.json()
print('*' * 50)
print(f'MAType: {matype}, Series: {series_type}')
print(data)
def run_ppo_series():
matype = list(range(8))
series_type = ['open', 'high', 'low', 'close']
combinations = itertools.product(matype, series_type)
for matype, series_type in combinations:
get_ppo_series(matype, series_type)
run_ppo_series()
The code above works without issue once symbol and interval values are supplied.
Thank you for using Alpha Vantage! Our standard API call frequency is 5 calls per minute and 500 calls per day
I didn't bother with the DataFrame portion of get_ppo_series because it's not relevant for receiving the data
I would leave the functions separate, it looks cleaner and I think it's standard for a function to do 1 thing.
A counter can be added to the code and time.sleep(60) after every 5 iterations unless you have a different API call frequency
Function with 60 second wait after every 5 api calls
import time
def run_ppo_series():
matype = list(range(8))
series_type = ['open', 'high', 'low', 'close']
combinations = itertools.product(matype, series_type)
count = 0
for matype, series_type in combinations:
if (count%5 == 0) & (count != 0):
time.sleep(60)
get_ppo_series(matype, series_type)
count+=1

Pandas groupby dataframe then return single value result (sum, total)

Dears,
Please help me, I am stucked.I guess it should not be difficult but I feel overwhelmed.
Need to make ageing of receivables, therefore they must be separated in different buckets.
Suppose we have only 3 groups: current, above_10Days and above_20Days and the following table:
d = {'Cust': [Dfg, Ers, Dac, Vds, Mhf, Kld, Xsd, Hun],
'Amount': [10000, 100000, 4000, 5411, 756000, 524058, 4444785, 54788,
'Days': 150, 21, 30, 231, 48, 15, -4, -14 }
I need to group the amounts to a total sum, depending on the Ageing group.
Example:
Current: 4499573, etc.
For that purpose, I tried to group the receivables with such code:
above_10Days = df.groupby((df['Days'] > 0) & (df['Days'] <= 10))
above10sum = above_10Days.Amount.sum().iloc[1]
It works perfect but only when they are actual amount in this group.
When they are no such A/R it throws an exception and stop executing. I tried to use function or to make 'None' value to 0, but no success.
Hopefully someone could know the solution.
Thanks in advance
IIUC:
d = {'Cust': ['Dfg', 'Ers', 'Dac', 'Vds', 'Mhf', 'Kld', 'Xsd', 'Hun'],
'Amount': [10000, 100000, 4000, 5411, 756000, 524058, 4444785, 54788],
'Days': [150, 21, 30, 231, 48, 15, -4, -14] }
df = pd.DataFrame(d)
#Updated to assign to output dataframe
df_out = (df.groupby(pd.cut(df.Days,
[-np.inf,10,20,np.inf],
labels=['Current','Above 10 Days','Above 20 Days']))['Amount']
.sum())
Output:
Days
Current 4499573
Above 10 Days 524058
Above 20 Days 875411
Name: Amount, dtype: int64
Varible assignent using .loc:
varCurrent = df_out.loc['Current']
var10 = df_out.loc['Above 10 Days']
var20 = df_out.loc['Above 20 Days']
print(varCurrent,var10,var20)
Output:
4499573 524058 875411

The corresponding variable (name) does not print out correctly (numpy)

The data are stored in three columns in csv file.
I am trying to print the highest val with it corresponding name. For some reasons, even though it prints the correct highest value, it does not print the correct name which has the highest value. I use the following codes. Where does it go wrong?
import numpy as np
names val
Jason 100
Jacob 250
Josh 90
John 80
name = data['Names'] # the first column in csv
val = (100, 250, 90, 80)
print(name[np.max(val)], np.max(val))
It should print Jacob, 250 but mine prints the correct val with the wrong name.
np.max(val) is the maximum value from that array. But name[idx] requires a index, not a value. Have you tried argmax?
In [474]: txt='''names val
...: Jason 100
...: Jacob 250
...: Josh 90
...: John 80'''
...:
In [476]: data = np.genfromtxt(txt.splitlines(), names=True, dtype=None,encoding
...: =None)
In [477]: data
Out[477]:
array([('Jason', 100), ('Jacob', 250), ('Josh', 90), ('John', 80)],
dtype=[('names', '<U5'), ('val', '<i8')])
In [478]: names = data['names']
In [479]: val = data['val']
In [480]: val
Out[480]: array([100, 250, 90, 80])
In [481]: np.max(val)
Out[481]: 250
In [483]: names[np.max(val)]
...
IndexError: index 250 is out of bounds for axis 0 with size 4
names[250] does not make sense, does it?
Using argmax instead:
In [484]: idx=np.argmax(val)
In [485]: idx
Out[485]: 1
In [486]: names[idx]
Out[486]: 'Jacob'
In [487]: data[idx]
Out[487]: ('Jacob', 250)
Or with a boolean mask
In [488]: val==np.max(val)
Out[488]: array([False, True, False, False])
In [489]: data[_]
Out[489]: array([('Jacob', 250)], dtype=[('names', '<U5'), ('val', '<i8')])
where could also be used:
In [490]: np.where(val==np.max(val))
Out[490]: (array([1]),)

Create new column in pandas dataframe based on if/elif/and functions

I have searched for my exact issue to no avail. These two threads Creating a new column based on if-elif-else condition and
create new pandas dataframe column based on if-else condition with a lookup guided my code though my code fails to execute.
Issues: I have a dataframe which I have example-reproduced below. Region attribute has two values only - a or b (or could have more), same for year, though region a could have both years etc. What I want to do is create a new column, "dollars", and look up the value for region, and if it is region "a" AND year is e.g 2006, takes sales in that row, and multiply with rate for that year and append value in new column - dollars. I am beginner and below is what I have so far - via function - and obviously executing the .apply function returns a ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0'). I am particularly interested in a more efficient implementation as the dataframe is rather large and would love to optimize computational efficiency.
import pandas as np
rate_2006, rate_2007 = 100, 200
c = {
'region': ["a", "a", "a", "a", "a", "b", "b", "b", "b", "a", "b"],
'year': [2006, 2007, 2007, 2006, 2006, 2006, 2007, 2007, 2007, 2006, 2007],
'sales': [500, 100, 2990, 15, 5000, 2000, 150, 300, 250, 1005, 600]
}
df1 = pd.DataFrame(c)
df1
def new_col(row):
if df1["region"] == "a" and df1["year"] == 2006:
nc = row["sales"] * rate_2006
elif df1["region"] == "a" and df1["year"] == 2007:
nc = row["sales"] * rate_2007
elif df1["region"] == "b" and df1["year"] == 2006:
nc = row["sales"] * rate_2006
else:
nc = row["sales"] * rate_2007
return nc
df1["Dollars"] = df1.apply(new_col, axis=1)
df1
The issue maybe due to the way you are using it. I don't know if it will help you. but I have re written the code as per my knowledge that is working.
import pandas as pd
rate_2006, rate_2007 = 100, 200
c = {
'region': ["a", "a", "a", "a", "a", "b", "b", "b", "b", "a", "b"],
'year': [2006, 2007, 2007, 2006, 2006, 2006, 2007, 2007, 2007, 2006, 2007],
'sales': [500, 100, 2990, 15, 5000, 2000, 150, 300, 250, 1005, 600]
}
df1 = pd.DataFrame(c)
print(df1)
def new_col(value):
if df1.loc[value,"region"] == "a" and df1.loc[value,"year"] == 2006:
df1.loc[value,"Dollars"] = df1.loc[value,"sales"] * rate_2006
elif df1.loc[value,"region"] == "a" and df1.loc[value,"year"] == 2007:
df1.loc[value,"Dollars"] = df1.loc[value,"sales"] * rate_2007
elif df1.loc[value,"region"] == "b" and df1.loc[value,"year"] == 2006:
df1.loc[value,"Dollars"] = df1.loc[value,"sales"] * rate_2006
else:
df1.loc[value,"Dollars"] = df1.loc[value,"sales"] * rate_2007
for value in range(len(df1)):
new_col(value)

Resources