I'm computing learning curves out of random forests using sklearn. I need to do it for lot of different RFs, therefore I want to use a cluster and Dask to reduce the time of the RFs fits.
Currently I implemented the following algorithm:
from sklearn.externals import joblib
from dask.distributed import Client, LocalCluster
worker_kwargs = dict(memory_limit="2GB", ncores=4)
cluster = LocalCluster(n_workers=4, threads_per_worker=2, **worker_kwargs) # processes=False?
client = Client(cluster)
X, Y = ..., ...
estimator = RandomForestRegressor(n_jobs=-1, **rf_params)
cv = ShuffleSplit(n_splits=5, test_size=0.2)
train_sizes = [...] # 20 different values
with joblib.parallel_backend('dask', scatter=[X,Y]):
train_sizes, train_scores, test_scores = learning_curve(estimator, X, Y, cv=cv, n_jobs=-1, train_sizes=train_sizes)
Here are 2 levels of parallelism:
One for the fitting of a RF (n_jobs=-1)
One for the looping over all the training set sizes (n_jobs=-1)
My problem is: if the backend is loky, then it takes around 23s.
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 22.8s finished
Now, if the backend is dask, then it takes more time:
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 30.3s finished
I now that Dask introduces overhead, but I don't except that this explain all the difference of running time.
dask is being developed quickly and I find a lot of different versions to do the same thing, without knowing which one is up-to-date.
Related
I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("./persian-t5-base/tokenizer.json")
For the downloading part the message is:
Downloading and preparing dataset oscar/unshuffled_deduplicated_fa (download: 9.74 GiB, generated: 37.24 GiB, post-processed: Unknown size, total: 46.98 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/...
I am running it on Google Colab Pro (with High Ram setting and on TPU). However, it's about 2 hours and the execution line is still on load_datset
what is doing? is it normal for load_dataset to take so much time? Should I interrupt it an run it again?
I'm learning python and I want to perform a simple linear regression on a .csv dataset. I've successfully imported the data file. If I have data for 8 five year periods and I want to do simple linear regression how would I do this? The data is by county/state. So my headers are county, state, 1980,1985 etc.. Appreciate any help.
Please specify what target label you have in mind.
Anyways, use the sklearn library and pandas.
val= pd.DataFrame(your.data);
val.columns = your.headers;
Assuming you have a target header called "Price".
from sklearn.linear_model import LinearRegression
X = val.drop('Price',axis=1)
X contains the data on which LR will be performed.
Now create a Linear Regression object.
lm = LinearRegression()
Start the fitting:
lm.fit()
Predict your targets:
lm.predict(x)
That's it.
Almost all real world problems that you are going to encounter will have more than two variables, so let's skip the basic linear regression example. Linear regressions involving multiple variables is called "multiple linear regression". The steps to perform multiple linear regression are almost similar to that of simple linear regression. The difference lies in the evaluation. You can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.
In this section we will use multiple linear regression to predict the gas consumptions (in millions of gallons) in 48 US states based upon gas taxes (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population that has a driver’s license.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
dataset = pd.read_csv('C:/your_Path_here/petrol_consumption.csv')
dataset.head()
dataset.describe()
Result:
Index ... Consumption_In_Millions
count 48.00 ... 48.000000
mean 24.50 ... 576.770833
std 14.00 ... 111.885816
min 1.00 ... 344.000000
25% 12.75 ... 509.500000
50% 24.50 ... 568.500000
75% 36.25 ... 632.750000
max 48.00 ... 968.000000
Preparing the Data
The next step is to divide the data into attributes and labels as we did previously. However, unlike last time, this time around we are going to use column names for creating an attribute set and label. Execute the following script:
X = dataset[['Petrol_Tax', 'Average_Income', 'Paved_Highways', 'ProportionWithLicenses']]
y = dataset['Consumption_In_Millions']
Execute the following code to divide our data into training and test sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Training the Algorithm
And finally, to train the algorithm we execute the same code as before, using the fit() method of the LinearRegression class:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
As said earlier, in case of multivariable linear regression, the regression model has to find the most optimal coefficients for all the attributes. To see what coefficients our regression model has chosen, execute the following script:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
Result:
Coefficient
Petrol_Tax -40.016660
Average_Income -0.065413
Paved_Highways -0.004741
ProportionWithLicenses 1341.862121
This means that for a unit increase in "petrol_tax", there is a decrease of 24.19 million gallons in gas consumption. Similarly, a unit increase in proportion of population with a driver’s license results in an increase of 1.324 billion gallons of gas consumption. We can see that "Average_income" and "Paved_Highways" have a very little effect on the gas consumption.
Making Predictions
To make pre-dictions on the test data, execute the following script:
y_pred = regressor.predict(X_test)
To compare the actual output values for X_test with the predicted values, execute the following script:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
Result:
Actual Predicted
29 534 469.391989
4 410 545.645464
26 577 589.668394
30 571 569.730413
32 577 649.774809
37 704 646.631164
34 487 511.608148
40 587 672.475177
7 467 502.074782
10 580 501.270734
Evaluating the Algorithm
The final step is to evaluate the performance of algorithm. We'll do this by finding the values for MAE, MSE and RMSE. Execute the following script:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Results:
Mean Absolute Error: 56.822247479
Mean Squared Error: 4666.34478759
Root Mean Squared Error: 68.3106491522
You can see that the value of root mean squared error is 60.07, which is slightly greater than 10% of the mean value of the gas consumption in all states. This means that our algorithm was not very accurate but can still make reasonably good predictions.
There are many factors that may have contributed to this inaccuracy, a few of which are listed here:
1. Need more data: Only one year worth of data isn't that much, whereas having multiple years worth could have helped us improve the accuracy quite a bit.
2. Bad assumptions: We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.
3. Poor features: The features we used may not have had a high enough correlation to the values we were trying to predict.
Note: the data set used in this example is available from here.
http://people.sc.fsu.edu/~jburkardt/datasets/regression/x16.txt
Finally, see the two links below for more info on this topic.
https://stackabuse.com/linear-regression-in-python-with-scikit-learn/
https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html
Dear TensorFlow Community,
I'm training a classifier with tf.contrib.factorization.KMeansClustering,
but the training goes really slow, and only uses 1% of my GPU.
However, my 4 CPU cores are hitting about 35% use constantly.
Is it the case that K-Means is written more for the CPU than the GPU?
Is there a way I can shift more of the computation to the GPU, or some
other approach to speed up training?
Below is my script for training (Python3).
Thank you for your time.
import tensorflow as tf
def parser(record):
features={
'feats': tf.FixedLenFeature([], tf.string),
}
parsed = tf.parse_single_example(record, features)
feats = tf.convert_to_tensor(tf.decode_raw(parsed['feats'], tf.float64))
return {'feats': feats}
def my_input_fn(tfrecords_path):
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parser)
.batch(1024)
)
iterator = dataset.make_one_shot_iterator()
batch_feats = iterator.get_next()
return batch_feats
### SPEC FUNCTIONS ###
train_spec_kmeans = tf.estimator.TrainSpec(input_fn = lambda: my_input_fn('/home/ubuntu/train.tfrecords') , max_steps=10000)
eval_spec_kmeans = tf.estimator.EvalSpec(input_fn = lambda: my_input_fn('/home/ubuntu/eval.tfrecords') )
### INIT ESTIMATOR ###
KMeansEstimator = tf.contrib.factorization.KMeansClustering(
num_clusters=500,
feature_columns = [tf.feature_column.numeric_column(
key='feats',
dtype=tf.float64,
shape=(377,),
)],
use_mini_batch=True)
### TRAIN & EVAL ###
tf.estimator.train_and_evaluate(KMeansEstimator, train_spec_kmeans, eval_spec_kmeans)
Best,
Josh
Here's my best answer so far with time information, building off of Eliethesaiyan's answer and link to docs.
My original Dataset codeblock and performance:
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parse_fn)
.batch(1024)
)
real 1m36.171s
user 2m57.756s
sys 0m42.304s
Eliethesaiyan's answer (prefetch + num_parallel_calls)
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parse_fn,num_parallel_calls=multiprocessing.cpu_count())
.batch(1024)
.prefetch(1024)
)
real 0m41.450s
user 1m33.120s
sys 0m18.772s
From the docs using map_and_batch + num_parallel_batches + prefetch:
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.apply(
tf.contrib.data.map_and_batch(
map_func=parse_fn,
batch_size=1024,
num_parallel_batches=multiprocessing.cpu_count()
)
)
.prefetch(1024)
)
real 0m32.855s
user 1m11.412s
sys 0m10.408s
one of the thing that i saw that increases gpu and cpu usage, is using prefetch on the dataset.It keeps the dataset producer fetch the data while the model is also consuming the previous batch therefore maximizing resource usage. Also specifying the max of your cpu would speed up the process.
I would restructure it this way
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parser,num_parallel_calls=multiprocessing.cpu_count())
.batch(1024)
)
dataset = dataset.prefetch(1024)
here is a nice guide of best practice when it comes to use TfRecords here
I have a simple dataframe consisting of one column. In that column are 10320 observations (numerical). I'm simulating Time-Series data by inserting the data into a plot with a window of 200 observations each. Here is the code for plotting.
import matplotlib.pyplot as plt
from IPython import display
fig_size = plt.rcParams["figure.figsize"]
import time
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
fig, axes = plt.subplots(1,1, figsize=(19,5))
df = dframe.set_index(arange(0,len(dframe)))
std = dframe[0].std() * 6
window = 200
iterations = int(len(dframe)/window)
i = 0
dframe = dframe.set_index(arange(0,len(dframe)))
while i< iterations:
frm = window*i
if i == iterations:
to = len(dframe)
else:
to = frm+window
df = dframe[frm : to]
if len(df) > 100:
df = df.set_index(arange(0,len(df)))
plt.gca().cla()
plt.plot(df.index, df[0])
plt.axhline(y=std, xmin=0, xmax=len(df[0]),c='gray',linestyle='--',lw = 2, hold=None)
plt.axhline(y=-std , xmin=0, xmax=len(df[0]),c='gray',linestyle='--', lw = 2, hold=None)
plt.ylim(min(dframe[0])- 0.5 , max(dframe[0]) )
plt.xlim(-50,window+50)
display.clear_output(wait=True)
display.display(plt.gcf())
canvas = FigureCanvas(fig)
canvas.print_figure('fig.png', dpi=72, bbox_inches='tight')
i += 1
plt.close()
This simulates a flow of real-time data and visualizes it. What I want is to apply theanets RNN LSTM to the data to detect anomalies unsupervised. Because I am doing it unsupervised I don't think that I need to split my data into training and test sets. I haven't found much of anything that makes sense to me so far and have been googling for about 2 hours. Just hoping that you guys may be able to help. I want to put the prediction output of the RNN on the graph as well and define a threshold that, if the error is too large, the values will be identified as anomalous. If you need more information please comment and let me know. Thank you!
READING
Like neurons, LSTM networks are build of interconnected LSTM Blocks whose training is done via BackPropogation Through Time.
Classical anomaly detection using time series required prediction of time series output in future (at one or more points) and finding error on these points with true values. Prediction Error above a threshold will reflect and amomly
SOLUTION
Having said this
You've to train network so you need training sets and test sets both
Use N inputs to predict M outputs (decide upon N and M with experimentation - values for which training error is low)
Scroll a window of (N+M) elements in input data and use this data array of (N+M) items also termed as frame to train or test network.
Typically we use 90% of starting series for training and 10% for testing.
This scheme will fail as if training is not proper there will be false prediction errors which are not-anomaly. So make sure to provide enough training, and most important shuffle training frames and consider all variations.
Prediction with the SVM model created with 5 features and 3000 samples using default parameters is taking unexpectedely longer time (more than hour) with 5 features and 100000 samples. Is there way of accelerating the prediction?
A few issues to consider here:
Have you standardized your input matrix X? SVM is not scale-invariant, so it could be difficult for the algo to do classification if they takes a large number of raw inputs without proper scaling.
The choice of parameter C: Higher C allows a more complicated non-smooth decision boundary and it takes much more time to fit under this complexity. So decreasing the value C from default 1 to a lower value could accelerate the process.
It's also recommended to choose a proper value of gamma. This could be done via Grid-Search-Cross-Validation.
Here is the code to do grid-search cross validation. I ignore the test set here for simplicity.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
# generate some artificial data
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9])
# make a pipeline for convenience
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
# set up parameter space, we want to tune SVC params C and gamma
# the range below is 10^(-5) to 1 for C and 0.01 to 100 for gamma
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
# choose your customized scoring function, popular choices are f1_score, accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
# construct grid search
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
# what's the best estimator
gscv.best_params_
Out[20]: {'svc__C': 1.0, 'svc__gamma': 0.21544346900318834}
# what's the best score, in our case, roc_auc_score
gscv.best_score_
Out[22]: 0.86819366014152421
Note: the SVC is still not running very fast. It takes more than 40s to compute 50 possible combinations of params.
%time gscv.fit(X, y)
CPU times: user 42.6 s, sys: 959 ms, total: 43.6 s
Wall time: 43.6 s
Because the number of features is relatively low, I would start with decreasing the penalty parameter. It controls the penalty for mislabeled samples in the train data, and as your data contains 5 features, I guess it is not exactly linearly separable.
Generally, this parameter (C) allows the classifier to have larger margin on the account of higher accuracy (see this for more information)
By default, C=1.0. Start with svm = SVC(C=0.1) and see how it goes.
One reason might be that the parameter gamma is not the same.
By default sklearn.svm.SVC uses RBF kernel and gamma is 0.0, in which case 1/n_features will be used instead. So gamma is different given different number of features.
In terms of suggestions, I agree with Jianxun's answer.