pandas sampling from series based on distribution - python-3.x

I currently have this df (res unique values below) of strings and a distribution p =[0.5, 0.33, 0.12, 0.05]
vid res
v1 '1072X1920'
v2 '240X416'
v3 '360X640'
v4 '720X1280'
The series is about 5000+ rows and I need to sample 3000 videos with the above distribution. I know I can do this by splitting the df into 4 parts, one for each res and use df.sample[:p[i] * 3000], like
df1072 = df[df['res'] == '1072X1920']
df1072 = df1072.sample(0.5 * 3000)
but is there a better way to do this? If I have 10 unique res then I would need to create 10 df in memory and that doesn't scale well. I was thinking np.random.choice() can help but not sure at the moment.

For example ,using sample random order your df, then using np.split
df=pd.DataFrame({'A':np.arange(100)})
n=len(df)
df=df.sample(n)
l=np.split(df, [int(0.5*n), int(0.83*n),int(0.95*n)])
Test :
list(map(len,l))
Out[1134]: [50, 33, 12, 5]
pd.concat(l).duplicated().any()
Out[1135]: False
For your example may need a groupby for loop
d={}
for y, x in df.groupby('res'):
n=len(x)
x=x.sample(n)
l=np.split(x, [int(0.5*n), int(0.83*n),int(0.95*n)])
d.append({y:l})

Related

Get all elementary flows generated by an activity in Brightway

I would like to access to all elementary flows generated by an activity in Brightway in a table that would gather the flows and the amounts.
Let's assume a random activity :
lca=bw.LCA({random_act:2761,method)
lca.lci()
lca.lcia()
lca.inventory
I have tried several ways but none works :
I have tried to export my lci with brightway2-io but some errors appear that i cannot solve :
bw2io.export.excel.lci_matrices_to_excel(db_name) returns an error when computing the biosphere matrix data for a specific row :
--> 120 bm_sheet.write_number(bio_lookup[row] + 1, act_lookup[col] + 1, value)
122 COLUMNS = (
123 u"Index",
124 u"Name",
(...)
128 u"Location",
129 )
131 tech_sheet = workbook.add_worksheet("technosphere-labels")
KeyError: 1757
I try to get manually the amount of a specific elementary flow. For example, let's say I want to compute the total amount of Aluminium needed for the activity. To do so, i try this:
flow_Al=Database("biosphere3").search("Aluminium, in ground")
(I only want the resource Aluminium that is extracted as a ore, from the ground)
amount_Al=0
row = lca.biosphere_dict[flow_Al]
col_indices = lca.biosphere_matrix[row, :].tocoo()
amount_consumers_lca = [lca.inventory[row, index] for index in col_indices.col]
for j in amount_consumers_lca:
amount_Al=amount_Al+j
amount_Al`
This works but the final amount is too low and probably isn't what i'm looking for...
How can I solve this ?
Thank you
This will work on Brightway 2 and 2.5:
import pandas as pd
import bw2data as bd
import warnings
def create_inventory_dataframe(lca, cutoff=None):
array = lca.inventory.sum(axis=1)
if cutoff is not None and not (0 < cutoff < 1):
warnings.warn(f"Ignoring invalid cutoff value {cutoff}")
cutoff = None
total = array.sum()
include = lambda x: abs(x / total) >= cutoff if cutoff is not None else True
if hasattr(lca, 'dicts'):
mapping = lca.dicts.biosphere
else:
mapping = lca.biosphere_dict
data = []
for key, row in mapping.items():
amount = array[row, 0]
if include(amount):
data.append((bd.get_activity(key), row, amount))
data.sort(key=lambda x: abs(x[2]))
return pd.DataFrame([{
'row_index': row,
'amount': amount,
'name': flow.get('name'),
'unit': flow.get('unit'),
'categories': str(flow.get('categories'))
} for flow, row, amount in data
])
The cutoff doesn't make sense for the inventory database, but it can be adapted for the LCIA result (characterized_inventory) as well.
Once you have a pandas DataFrame you can filter or export easily.

pandas groupby performance / combine 2 functions

I am learning python and trying to understand the best practices of data queries.
Here is some dummy data (customer sales) to test
import pandas as pd
df = pd.DataFrame({'Name':['tom', 'bob', 'bob', 'jack', 'jack', 'jack'],'Amount':[3, 2, 5, 1, 10, 100], 'Date':["01.02.2022", "02.02.2022", "03.02.2022", "01.02.2022", "03.02.2022", "05.02.2022"]})
df.Date = pd.to_datetime(df.Date, format='%d.%m.%Y')
I want to investigate 2 kinds of queries:
How long is a person our customer?
What is the period between first
and last purchase.
How can I run the first query without writing loops manually?
What I have done so far for the second part is this
result = df.groupby("Name").max() - df.groupby("Name").min()
Is it possible to combine these two groupby queries into one to improve the performance?
P.S. I am trying to understand pandas and key concepts how to optimize queries. Different approaches and explanations are highly appreciated.
You can use GroupBy.agg with a custom function to get the difference between the max and min date.
df.groupby('Name')['Date'].agg(lambda x: x.max()-x.min())
As you already have datetime type, this will nicely yield a Timedelta object, which by default is shown as a string in the form 'x days'.
You can also save the GroupBy object in a variable and reuse it. This way, computation of the groups occurs only once:
g = df.groupby("Name")['Date']
g.max() - g.min()
output:
Name
bob 1 days
jack 4 days
tom 0 days
Name: Date, dtype: timedelta64[ns]

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

Group two Dataframes with MultiIndex columns

I have two dataframes and I would like to create a new one, which will include all the unique columns of the two source dataframes and the aggregation of the common columns.
These are two samples:
And this is the result:
All the column indexes should match in order to be aggregated.
I have written the following code:
df_all = pd.DataFrame
for dfColumn in df_1:
if dfColumn in df_2.columns:
df_all[dfColumn] = df_1.loc[:, dfColumn].add(df_2.loc[:, dfColumn])
else:
df_all[dfColumn] = df_1[dfColumn]
for dfColumn in df_2:
if dfColumn not in df_all.columns:
df_all[dfColumn] = df_2[dfColumn]
However, I get an error on the following line:
df_all[dfColumn] = df_1.loc[:, dfColumn].add(df_2.loc[:, dfColumn])
when I am trying to assign the value to df_all[dfColumn]
It drives me crazy all the different possibilities that you have with Python.
But I cannot find one to make it work.
Thanks for your help and time.
Actually,
I fixed it only with the following:
df_all = pd.concat([df_1, df_2], axis=1)
df_all = df_all.groupby(level=[0, 1, 2], axis=1).sum()
Is there a way to replace level=[0, 1, 2] with something like level=df_all.columns.levels ?

How to use list comprehension in pandas to create a new series for a plot?

This is by far the most difficult problem I have faced. I am trying to create plots indexed on ratetype. For example, a matrix of unique ratetype x avg customer number for that ratetype is what I want to create efficiently. The lambda expression for getting the rows where the value is equal to each individual ratetype then getting the average customer number for that type then creating a series based on these two lists that are equal in size and length and accurate, is way over my head for pandas.
The number of different ratetypes can be in the hundreds. Reading it into a list via lambda would logically be a better choice than hard coding each possibility, as the list is going to only increase in size and new variability.
""" a section of the data for example use. Working with column "Ratetype"
column "NumberofCustomers" to work towards getting something like
list1 = unique occurs of ratetypes
list2 = avg number of customers for each ratetype
rt =['fixed','variable',..]
avg_cust_numbers = [45.3,23.1,...]
**basically for each ratetype: get mean of all row data for custno column**
ratetype,numberofcustomers
fixed,1232
variable, 1100
vec, 199
ind, 1211
alg, 123
bfd, 788
csv, 129
ggg, 1100
aaa, 566
acc, 439
"""
df['ratetype','number_of_customers']
fixed = df.loc['ratetype']=='fixed']
avg_fixed_custno = fixed.mean()
rt_counts = df.ratetype.value_counts()
rt_uniques = df.ratetype.unique()
# rt_uniques would be same size vector as avg_cust_nos, has to be anyway
avg_cust_nos = [avg_fixed_custno, avg_variable_custno]
My goal is to create and plot these subplots using matplot.pyplot.
data = {'ratetypes': pd.Series(rt_counts, index=rt_uniques),
'Avg_cust_numbers': pd.Series(avg_cust_nos, index=rt_uniques),
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ratetypes'], ascending=False)
fig, axes = plt.subplots(nrows=2, ncols=1)
for i, c in enumerate(df.columns):
df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('custno_byrate.png', bbox_inches='tight')

Resources