I would like the distribution for the impact of ecoinvent processes, using one of the existing impact assessment methods, doing Monte Carlo simulations. Is there an example notebook or instructions on doing this?
Here is the simplest way to do it (for a random activity and method):
from brightway2 import *
import numpy as np
ecoinvent = Database("ecoinvent 3.2 cutoff")
The MonteCarlo class derives from the LCA class, and is instantiated just like an LCA object.
my_MC = MonteCarloLCA({ecoinvent.random():1}, methods.random())
Say you want to obtain 1000 samples:
iterations = 1000
You can create an empty numpy array to collect the results:
scores = np.zeros([1, iterations])
Then, you calculate scores using next on your object:
for iteration in range(1000):
next(my_MC)
scores[0, iteration] = my_MC.score
In this example, this will be a numpy array with 1000 elements. You can then analyse this array with whatever statistical modules you are at ease with.
There are several other Monte Carlo based classes, allowing other functionalities. Have a look at the source code, you may find something useful.
Related
I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead?
Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all
When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier)
or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).
Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.
For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.
I'm using Tune class-based Trainable API. See code sample:
from ray import tune
import numpy as np
np.random.seed(42)
# first run
tune.run(tune.Trainable, ...)
# second run, expecting same result
np.random.seed(42)
tune.run(tune.Trainable, ...)
The problem is that tune.run results are still different, likely reason being that each ray actor still has different seed.
Question: how do I make ray.tune.run reproducible?
(This answer focuses on class API and ray version 0.8.7. Function API does not support reproducibility due to implementation specifics)
There are two main sources of undeterministic results.
1. Search algorithm
Every search algorithm supports random seed, although interface to it may vary. This initializes hyperparameter space sampling.
For example, if you're using AxSearch, it looks like this:
from ax.service.ax_client import AxClient
from ray.tune.suggest.ax import AxSearch
client = AxClient(..., random_seed=42)
client.create_experiment(...)
algo = AxSearch(client)
2. Trainable API
This is distributed among worker processes, which requires seeding within tune.Trainable class. Depending on the tune.Trainable.train logic that you implement, you need to manually seed numpy, tf, or whatever other framework you use, inside tune.Trainable.setup by passing seed with config argument of tune.run.
The following code is based on RLLib PR5197 that handled the same issue:
See the example:
from ray import tune
import numpy as np
import random
class Tuner(tune.Trainable):
def setup(self, config):
seed = config['seed']
np.random.seed(seed)
random.seed(seed)
...
...
seed = 42
tune.run(Tuner, config={'seed': seed})
Sklearn provide different data generation functions such as make_blobs and make_regression in sklearn.datasets.
However, I am not aware of any functions that can generate sequential data. Is there any existing libraries that can generate artificial sequential data?
It really depends on what kind of series you want. Check out this repository for generating different kinds of simulated series. It's called TimeSynth
But if you just want something you can easily modify yourself, try writing a function similar to this:
def SynthSeries(start,end,stepSize,coefficients):
import numpy as np
samples = np.array(np.arange(start,end,stepSize))
array = np.array(np.zeros(np.shape(samples)))
for coeff in coefficients:
array = np.add(array,(np.sin(coeff*samples)))
return array, samples
This is sort of a reverse of a fourier transform, if you know the base frequencies of series you want to create, you can pass it into this function to recreate the signal.
You can use it like this:
import matplotlib.pyplot as plt
(SeqData,samples) = SynthSeries(0,20,0.1,[12,3,1,22])
plt.plot(samples, SeqData)
plt.show()
Is there a parameter in sklearn that can be tweaked to run a random forest (or other estimator) multiple times to smooth out variation between runs? What's the simplest way to do this?
You can't just simply smooth out the variations between the runs manually. What you can do is perform hyper parameter tuning using GridSearchCV ( or you can look at other similar methods as well at this link. Also you can also look at doing Cross-validation of your dataset for better performance of your estimator. You can have a look at the methods in Sklearn for cross-validation.
Also please provide more information for your problem, like the type of problem you are solving, dataset, etc. so that we can help you better.
VotingClassifier with soft voting may be what you are looking for. In general, given two sets of predictions, you may take the geometric mean of the prediction to smooth it out.
from scipy.stats.mstats import gmean
df = pd.DataFrame()
#prediction renamed in 1.csv,2.csv... for convenience
for i in range(1,4):
data = pd.read_csv('{}.csv'.format(i),index_col='id')
data = data.rename(columns={'proba':i})
df = pd.concat([df,data],axis=1)
df['proba'] = gmean(df.iloc[:,1:4],axis=1)
output = pd.DataFrame(data={'id':df.index,'proba':df.proba})
output.to_csv('submissions.csv',index=False)
I want to make a 2D binary array (n_samples, n_features), where each sample is a text string and each feature is a word(unigram).
The problem is number of sample is 350000 and nunmber of feature is 40000 but my RAM size is 4GB only.
I am getting memory error after using CountVectorizer. So, is there any other way(like mini-batch) to do this?
If I use HashingVectorizer then how to get the feature_names? i.e. which column correspond to which feature?, because get_feature_names() method is not available in HashingVectorizer.
To get feature names for HashingVectorizer you can take a random sample of documents, compute hashes for them and learn which hash correspond to which tokens this way. It is not perfect because there can be other tokens which correspond to a given column, and there can be collisions, but often this is enough to inspect the vectorization result (or e.g. coefficients of a linear classifier which uses hashing features).
A shameless plug - https://github.com/TeamHG-Memex/eli5 package has this implemented:
from eli5.sklearn import InvertableHashingVectorizer
# vec should be a HashingVectorizer instance
ivec = InvertableHashingVectorizer(vec)
ivec.fit(docs_sample) # e.g. each 10-th or 100-th document
names = ivec.get_feature_names()
See also: Debugging Hashing Vectorizer section in eli5 docs.
Mini batches are not supporting in countvectorizer. However, hashing vectorizer of sklearn has partial_fit() that you can use.
Quoting sklearn documentation "There is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model."