How to generate artificial sequential data for machine learning? - python-3.x

Sklearn provide different data generation functions such as make_blobs and make_regression in sklearn.datasets.
However, I am not aware of any functions that can generate sequential data. Is there any existing libraries that can generate artificial sequential data?

It really depends on what kind of series you want. Check out this repository for generating different kinds of simulated series. It's called TimeSynth
But if you just want something you can easily modify yourself, try writing a function similar to this:
def SynthSeries(start,end,stepSize,coefficients):
import numpy as np
samples = np.array(np.arange(start,end,stepSize))
array = np.array(np.zeros(np.shape(samples)))
for coeff in coefficients:
array = np.add(array,(np.sin(coeff*samples)))
return array, samples
This is sort of a reverse of a fourier transform, if you know the base frequencies of series you want to create, you can pass it into this function to recreate the signal.
You can use it like this:
import matplotlib.pyplot as plt
(SeqData,samples) = SynthSeries(0,20,0.1,[12,3,1,22])
plt.plot(samples, SeqData)
plt.show()

Related

Tree based algorithm different behavior with duplicated features

I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead?
Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all
When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier)
or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).
Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.
For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.

How do I make ray.tune.run reproducible?

I'm using Tune class-based Trainable API. See code sample:
from ray import tune
import numpy as np
np.random.seed(42)
# first run
tune.run(tune.Trainable, ...)
# second run, expecting same result
np.random.seed(42)
tune.run(tune.Trainable, ...)
The problem is that tune.run results are still different, likely reason being that each ray actor still has different seed.
Question: how do I make ray.tune.run reproducible?
(This answer focuses on class API and ray version 0.8.7. Function API does not support reproducibility due to implementation specifics)
There are two main sources of undeterministic results.
1. Search algorithm
Every search algorithm supports random seed, although interface to it may vary. This initializes hyperparameter space sampling.
For example, if you're using AxSearch, it looks like this:
from ax.service.ax_client import AxClient
from ray.tune.suggest.ax import AxSearch
client = AxClient(..., random_seed=42)
client.create_experiment(...)
algo = AxSearch(client)
2. Trainable API
This is distributed among worker processes, which requires seeding within tune.Trainable class. Depending on the tune.Trainable.train logic that you implement, you need to manually seed numpy, tf, or whatever other framework you use, inside tune.Trainable.setup by passing seed with config argument of tune.run.
The following code is based on RLLib PR5197 that handled the same issue:
See the example:
from ray import tune
import numpy as np
import random
class Tuner(tune.Trainable):
def setup(self, config):
seed = config['seed']
np.random.seed(seed)
random.seed(seed)
...
...
seed = 42
tune.run(Tuner, config={'seed': seed})

Create random Numpy array following a given distribution and trend

I want to create data that follows the same distribution and trend of the sample data taken using numpy.
For example say I have an array x whose trend is increasing and the distribution is suppose log normal. Can I create another random array which follows same distribution and trend using numpy ?
Well, numpy doesn't have the capability to fit distributions to your data. You can either do it manually using the method you like (MLE or MM) or you can use scipy that can fit distributions over your data like shown below:
import scipy.stats as st
# Inferred parameters of the distribution
s, loc, scale = st.lognorm.fit(x)
# Distribution object
dist = st.lognorm(s, loc, scale)
# generate 1000 random samples
samples = dist.rvs(size=1000)
Scipy used MLE by default.
You will have to explore your data and look into the distributions that fit the best. Numpy or scipy can't do that for you.
Documentation of fit method: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html

How to calculate the correlation of all features with the target variable (binary classifier, python 3)?

I want to calculate in python the correlation of all my features (all of float type) and the class label (Binary, 0 or 1). In addition, I would like to plot the data to visualize their distribution by class.
This is needed so I can find features coupled to a single label and find out their real importance. Note that I don't want the pairwise feature correlation and that my classifier is binary.
I have tried the following (from a similar post in stackoverflow) but it is not exactly what I am looking for.
df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))
Please see in the picture attached how the distribution would look like for one the features (from Weka).
Class distribution for one of the features
Any feedback is really appreciated.
Correlation is not supposed to be used for categorical variables. For more explanation see here
You can understand the relationship between your independent variables and target variables with the following approach.
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(return_X_y=False)
import pandas as pd
df=pd.DataFrame(data.data[:,:5])
df.columns = data.feature_names[:5]
df['target'] = data.target.astype(str)
import seaborn as sns;
import matplotlib.pyplot as plt
g= sns.pairplot(df,hue = 'target', diag_kind= 'hist',
vars=df.columns[:-1],
plot_kws=dict(alpha=0.5),
diag_kws=dict(alpha=0.5))
plt.show()

Obtaining distribution of results from LCIA

I would like the distribution for the impact of ecoinvent processes, using one of the existing impact assessment methods, doing Monte Carlo simulations. Is there an example notebook or instructions on doing this?
Here is the simplest way to do it (for a random activity and method):
from brightway2 import *
import numpy as np
ecoinvent = Database("ecoinvent 3.2 cutoff")
The MonteCarlo class derives from the LCA class, and is instantiated just like an LCA object.
my_MC = MonteCarloLCA({ecoinvent.random():1}, methods.random())
Say you want to obtain 1000 samples:
iterations = 1000
You can create an empty numpy array to collect the results:
scores = np.zeros([1, iterations])
Then, you calculate scores using next on your object:
for iteration in range(1000):
next(my_MC)
scores[0, iteration] = my_MC.score
In this example, this will be a numpy array with 1000 elements. You can then analyse this array with whatever statistical modules you are at ease with.
There are several other Monte Carlo based classes, allowing other functionalities. Have a look at the source code, you may find something useful.

Resources