I would like to access to all elementary flows generated by an activity in Brightway in a table that would gather the flows and the amounts.
Let's assume a random activity :
lca=bw.LCA({random_act:2761,method)
lca.lci()
lca.lcia()
lca.inventory
I have tried several ways but none works :
I have tried to export my lci with brightway2-io but some errors appear that i cannot solve :
bw2io.export.excel.lci_matrices_to_excel(db_name) returns an error when computing the biosphere matrix data for a specific row :
--> 120 bm_sheet.write_number(bio_lookup[row] + 1, act_lookup[col] + 1, value)
122 COLUMNS = (
123 u"Index",
124 u"Name",
(...)
128 u"Location",
129 )
131 tech_sheet = workbook.add_worksheet("technosphere-labels")
KeyError: 1757
I try to get manually the amount of a specific elementary flow. For example, let's say I want to compute the total amount of Aluminium needed for the activity. To do so, i try this:
flow_Al=Database("biosphere3").search("Aluminium, in ground")
(I only want the resource Aluminium that is extracted as a ore, from the ground)
amount_Al=0
row = lca.biosphere_dict[flow_Al]
col_indices = lca.biosphere_matrix[row, :].tocoo()
amount_consumers_lca = [lca.inventory[row, index] for index in col_indices.col]
for j in amount_consumers_lca:
amount_Al=amount_Al+j
amount_Al`
This works but the final amount is too low and probably isn't what i'm looking for...
How can I solve this ?
Thank you
This will work on Brightway 2 and 2.5:
import pandas as pd
import bw2data as bd
import warnings
def create_inventory_dataframe(lca, cutoff=None):
array = lca.inventory.sum(axis=1)
if cutoff is not None and not (0 < cutoff < 1):
warnings.warn(f"Ignoring invalid cutoff value {cutoff}")
cutoff = None
total = array.sum()
include = lambda x: abs(x / total) >= cutoff if cutoff is not None else True
if hasattr(lca, 'dicts'):
mapping = lca.dicts.biosphere
else:
mapping = lca.biosphere_dict
data = []
for key, row in mapping.items():
amount = array[row, 0]
if include(amount):
data.append((bd.get_activity(key), row, amount))
data.sort(key=lambda x: abs(x[2]))
return pd.DataFrame([{
'row_index': row,
'amount': amount,
'name': flow.get('name'),
'unit': flow.get('unit'),
'categories': str(flow.get('categories'))
} for flow, row, amount in data
])
The cutoff doesn't make sense for the inventory database, but it can be adapted for the LCIA result (characterized_inventory) as well.
Once you have a pandas DataFrame you can filter or export easily.
Related
I have an order list with a separate inventory system (Google Sheets). Using Pandas, I'm trying to merge the two for an efficient "pick list" and have had some mild success. However, in testing (adding multiple quantities for an order, having multiple orders with the same item/SKU type) it starts breaking down.
orders = "orderNumber,SKU,Quantity\r\n11111,GreenSneakers,2\r\n11111,Brown_Handbag,1\r\n22222,GreenSneakers,1\r\n33333,Blue_Handbag,1"
str_orders = StringIO(orders, newline='\n')
df_orders = pd.read_csv(str_orders, sep=",")
inventory = "SKU,Location\r\nGreenSneakers,DA13A\r\nGreenSneakers,DA13A\r\nRed_Handbag,DA12A\r\nGreenSneakers,DB34C\r\nGreenSneakers,DB33C\r\n"
str_inventory = StringIO(inventory, newline='\n')
df_inventory = pd.read_csv(str_inventory, sep=",")
df_inventory = df_inventory.sort_values(by='Location', ascending=False)
df_pList = df_orders.merge(df_inventory.drop_duplicates(subset=['SKU']), on='SKU', how='left')
print(df_pList)
pseudo desired output:
'
orderNumber, SKU, Quantity, Location
11111, GreenSneakers, 1, DB34C
11111, GreenSneakers, 1, DB33C
11111, Brown_Handbag, 1, NA
22222, GreenSneakers, 1, DA13A
33333, Blue_Handbag, 1, NA
'
Is Merge even a way to solve this type of a problem? Trying to stay away from looping if possible.
Below makes three dataframes.
df_pickList is what you were asking to make
copy_inventory contains what inventory would be if you pick everything (in case you want to just write the DataFrame out to overwrite your inventory file) You could elect to not make the copy and use your df_inventory directly, but especially in beta its handy to make a copy for manipulation.
df_outOfStock is a handy bucket to catch things you don't have in inventory. Cross check against current orders to see what you need to get on order
from io import StringIO
import pandas as pd
import copy
orders = """orderNumber,SKU,Quantity
11111,GreenSneakers,2
11111,Brown_Handbag,1
22222,GreenSneakers,1
33333,Blue_Handbag,1
"""
str_orders = StringIO(orders, newline='\n')
df_orders = pd.read_csv(str_orders, sep=",")
inventory = """SKU,Location
GreenSneakers,DA13A
GreenSneakers,DA13A
Red_Handbag,DA12A
GreenSneakers,DB34C
GreenSneakers,DB33C
"""
str_inventory = StringIO(inventory, newline='\n')
df_inventory = pd.read_csv(str_inventory, sep=",")
df_inventory = df_inventory.sort_values(by='Location', ascending=False)
df_outOfStock = pd.DataFrame() #placeholder to report a lack of stock
df_pickList = pd.DataFrame() #placeholder to make pick list
copy_inventory = copy.deepcopy(df_inventory) #make a copy of inventory to decimate
for orderIndex, orderLineItem in df_orders.iterrows():
for repeat in range(orderLineItem["Quantity"]): #since inventory location is 1 row per item, we need to do that many picks per order line item
availableInventory = copy_inventory.loc[copy_inventory.loc[:,"SKU"] == orderLineItem["SKU"], :]
if len(availableInventory) == 0:
#Failed to find any SKU to pull
df_outOfStock = df_outOfStock.append(orderLineItem, ignore_index=True)
else:
pickRow = {"orderNumber": orderLineItem["orderNumber"],
"SKU": orderLineItem["SKU"],
"Quantity": 1,
"Location": availableInventory.iloc[0]["Location"]}
df_pickList = df_pickList.append(pickRow, ignore_index=True)
copy_inventory.drop(index = availableInventory.index[0], inplace=True)
Thanks, this was a fun little exercise compared to dealing with non-integer quantities (i.e. feet of angle iron)
(Original wrong answer below)
I would recommend concatenating the rows into a single table (not merging and/or overwriting values), then using group by to allow the aggregation of values.
As a primer, I would start with these two links on putting your data together:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby
I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])
New to Pandas so I'm sorry if there is an obvious solution...
I imported a CSV that only had 2 columns and I created a 3rd column.
Here's a screen shot of the top 10 rows and header:
Screen shot of DataFrame
I've figured out how to find the min and max values in the ['Amount Changed'] column but also need to pull the date associated with the min and max - but not the index and ['Profit/Loss']. I've tried iloc, loc, read about groupby - I can't get any of them to return a single value (in this case a date) that I can use again.
My goal is to create a new variable 'Gi_Date' that is in the same row as the max value in ['Amount Changed'] but tied to the date in the ['Date'] column.
I'm trying to keep the variables separate so I can use them in print statements, write them to txt files, etc.
import os
import csv
import pandas as pd
import numpy as np
#path for CSV file
csvpath = ("budget_data.csv")
#Read CSV into Panadas and give it a variable name Bank_pd
Bank_pd = pd.read_csv(csvpath, parse_dates=True)
#Number of month records in the CSV
Months = Bank_pd["Date"].count()
#Total amount of money captured in the data converted to currency
Total_Funds = '${:.0f}'.format(Bank_pd["Profit/Losses"].sum())
#Determine the amount of increase or decrease from the previous month
AmtChange = Bank_pd["Profit/Losses"].diff()
Bank_pd["Amount Changed"] = AmtChange
#Identify the greatest positive change
GreatestIncrease = '${:.0f}'.format(Bank_pd["Amount Changed"].max())
Gi_Date = Bank_pd[Bank_pd["Date"] == GreatestIncrease]
#Identify the greatest negative change
GreatestDecrease = '${:.0f}'.format(Bank_pd["Amount Changed"].min())
Gd_Date = Bank_pd[Bank_pd['Date'] == GreatestDecrease]
print(f"Total Months: {Months}")
print(f"Total: {Total_Funds}")
print(f"Greatest Increase in Profits: {Gi_Date} ({GreatestIncrease})")
print(f"Greatest Decrease in Profits: {Gd_Date} ({GreatestDecrease})")
When I run the script in git bash I don't get an error anymore so I think I'm getting close, rather than showing the date it says:
$ python PyBank.py
Total Months: 86
Total: $38382578
Greatest Increase in Profits: Empty DataFrame
Columns: [Date, Profit/Losses, Amount Changed]
Index: [] ($1926159)
Greatest Decrease in Profits: Empty DataFrame
Columns: [Date, Profit/Losses, Amount Changed]
Index: [] ($-2196167)
I want it to print out like this:
$ python PyBank.py
Total Months: 86
Total: $38382578
Greatest Increase in Profits: Feb-2012 ($1926159)
Greatest Decrease in Profits: Sept-2013 ($-2196167)
Here is one years worth of the original DataFrame:
bank_pd = pd.DataFrame({'Date':['Jan-10', 'Feb-10', 'Mar-10', 'Apl-10', 'May-10', 'Jun-10', 'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10'],
'Profit/Losses':[867884, 984655, 322013, -69417, 310503, 522857, 1033096, 604885, -216386, 477532, 893810, -80353]})
The expected output with the sample df would be:
Total Months: 12
Total Funds: $5651079
Greatest Increase in Profits: Oct-10 ($693918)
Greatest Decrease in Profits: Dec-10 ($-974163)
I also had an error in the sample dataframe from above, I was missing a month when I typed it out quickly - it's fixed now.
Thanks!
I'm seeing few glitches in the variables used.
Bank_pd["Amount Changed"] = AmtChange
The above statement is actually replacing the dataframe with column "Amount Changed". After this statement you can use this column for any manipulation.
Below is the updated code and highlighted the newly added lines. You could add further formatting:
import pandas as pd
csvpath = ("budget_data.csv")
Bank_pd = pd.read_csv(csvpath, parse_dates=True)
inp_bank_pd = pd.DataFrame(Bank_pd)
Months = Bank_pd["Date"].count()
Total_Funds = '${:.0f}'.format(Bank_pd["Profit/Losses"].sum())
AmtChange = Bank_pd["Profit/Losses"].diff()
GreatestIncrease = Bank_pd["Amount Changed"].max()
Gi_Date = inp_bank_pd.loc[Bank_pd["Amount Changed"] == GreatestIncrease]
print(Months)
print(Total_Funds)
print(Gi_Date['Date'].values[0])
print(GreatestIncrease)
In your example code, Gi_date and Gd_date are trying to initialize new DF's instead of calling values. Change Gi_Date and Gd_Date:
Gi_Date = Bank_pd.sort_values('Profit/Losses').tail(1).Date
Gd_Date = Bank_pd.sort_values('Profit/Losses').head(1).Date
Check outputs:
Gi_Date
Jul-10
Gd_Date
Sep-10
To print how you want to print using string formatting:
print("Total Months: %s" %(Months))
print("Total: %s" %(Total_Funds))
print("Greatest Increase in Profits: %s %s" %(Gi_Date.to_string(index=False), GreatestIncrease))
print("Greatest Decrease in Profits: %s %s" %(Gd_Date.to_string(index=False), GreatestDecrease))
Note if you don't use the:
(Gd_Date.to_string(index=False)
The pandas object information will be included in the print output, like it is in your example when you see the DataFrame info.
Output for 12 month sample DF:
Total Months: 12
Total: $5651079
Greatest Increase in Profits: Jul-10 $693918
Greatest Decrease in Profits: Sep-10 $-974163
Use Series.idxmin and Series.idxmax with loc:
df.loc[df['Amount Changed'].idxmin(), 'Date']
df.loc[df['Amount Changed'].idxmax(), 'Date']
Full example based on your sample DataFrame:
df = pd.DataFrame({'Date':['Jan-2010', 'Feb-2010', 'Mar-2010', 'Apr-2010', 'May-2010',
'Jun-2010', 'Jul-2010', 'Aug-2010', 'Sep-2010', 'Oct-2010'],
'Profit/Losses': [867884,984655,322013,-69417,310503,522857,
1033096,604885,-216386,477532]})
df['Amount Changed'] = df['Profit/Losses'].diff()
print(df)
Date Profit/Losses Amount Changed
0 Jan-2010 867884 NaN
1 Feb-2010 984655 116771.0
2 Mar-2010 322013 -662642.0
3 Apr-2010 -69417 -391430.0
4 May-2010 310503 379920.0
5 Jun-2010 522857 212354.0
6 Jul-2010 1033096 510239.0
7 Aug-2010 604885 -428211.0
8 Sep-2010 -216386 -821271.0
9 Oct-2010 477532 693918.0
print(df.loc[df['Amount Changed'].idxmin(), 'Date'])
print(df.loc[df['Amount Changed'].idxmax(), 'Date'])
Sep-2010
Oct-2010
This is by far the most difficult problem I have faced. I am trying to create plots indexed on ratetype. For example, a matrix of unique ratetype x avg customer number for that ratetype is what I want to create efficiently. The lambda expression for getting the rows where the value is equal to each individual ratetype then getting the average customer number for that type then creating a series based on these two lists that are equal in size and length and accurate, is way over my head for pandas.
The number of different ratetypes can be in the hundreds. Reading it into a list via lambda would logically be a better choice than hard coding each possibility, as the list is going to only increase in size and new variability.
""" a section of the data for example use. Working with column "Ratetype"
column "NumberofCustomers" to work towards getting something like
list1 = unique occurs of ratetypes
list2 = avg number of customers for each ratetype
rt =['fixed','variable',..]
avg_cust_numbers = [45.3,23.1,...]
**basically for each ratetype: get mean of all row data for custno column**
ratetype,numberofcustomers
fixed,1232
variable, 1100
vec, 199
ind, 1211
alg, 123
bfd, 788
csv, 129
ggg, 1100
aaa, 566
acc, 439
"""
df['ratetype','number_of_customers']
fixed = df.loc['ratetype']=='fixed']
avg_fixed_custno = fixed.mean()
rt_counts = df.ratetype.value_counts()
rt_uniques = df.ratetype.unique()
# rt_uniques would be same size vector as avg_cust_nos, has to be anyway
avg_cust_nos = [avg_fixed_custno, avg_variable_custno]
My goal is to create and plot these subplots using matplot.pyplot.
data = {'ratetypes': pd.Series(rt_counts, index=rt_uniques),
'Avg_cust_numbers': pd.Series(avg_cust_nos, index=rt_uniques),
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ratetypes'], ascending=False)
fig, axes = plt.subplots(nrows=2, ncols=1)
for i, c in enumerate(df.columns):
df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('custno_byrate.png', bbox_inches='tight')
I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())