Scikit Learn - Random Forest: How continuous feature is handled? - scikit-learn

Random Forest accepts numerical data. Usually features with text data is converted to numerical categories and continuous numerical data is fed as it is without discretization. How the RF treat the continuous data for creating nodes? Will it bin the continuous numerical data internally? or treat each data as discrete level.
for example:
I want to feed a data set(ofcourse after categorizing the text features) to RF. How the continuous data is handled by the RF?
Is it advisable to discretize the continuous data(longitudes and latitudes, in this case) before feeding? Or doing so information is lost?

As far as I understand, you are asking how the threshold is chosen for continuous features. The binning occurs at values, where your class is changed. For example, consider the following 1D dataset with x as feature and y as class variable
x = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [ 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]
The two possible candidate cuts will be considered: (i) between 2 and 3 (will practically look like as x<2.5) and (ii) between 7 and 8 (as x<7.5).
Among these two candidates the second one will be chosen since it provides a better separation. Them the algorithm goes to the next step.
Therefore it is not advisable to discretize the data yourself. Think about this with the data above. If, for example, you discretize the data in 5 bins [1, 2 | 3, 4 | 5, 6 | 7, 8 | 9, 10], you miss the best split (since 7 and 8 will be in one bin).

You are asking about DecisionTrees. Because RandomForest is ensemble model, and by itself it don't know anything about data, it fully relies on decisons from base estimators (In this case DecisionTrees), and aggregates them.
So, how DecisionTree is treating continious features: Look at this official documentation page. DecisionTreeClassifier was fitted on continuous dataset (Fisher irises), if you will look at the picture of tree - it has threshold value in each node over some chosen feature at this node.

Related

Is there any "better way" to train a quadruped to walk using reinforcement Learning

I have been chasing this problem of using RL to train a quadruped to walk. But have got NO noteworthy success. Following are the Details of the GYM ENV I am using.
Sim: pybullet
env.action_space = box(shape=(12,), upper = 1, lower = -1)
converting selected actions and multiplying them by max_actions specified for each joint.
action space are the 3 motor positions(hip_joint_y, hip_joint_x, knee_joint) x 4 legs of the robot
env.observation_space = box(shape=(12,), upper = np.inf, lower = -np.inf)
observation_space include
roll, pitch of the body [r, p]
angular vel [x, y, z]
linear acc [x, y, z]
Binary contact forces for each leg [1 if in contact else 0]. [1, 1, 1, 1]
reward = (
+ distance_reward
- body_rotation_reward
- energy_usage_reward
- body_drift_from x-axis reward
- body_shake_reward)
I have tried the following approaches.
Using PPO from stable-baselines3 for 20 million timesteps [No Distinct improvement]
Using DDPG, TD3, SAC, A2C, and PPO with 5 million timesteps on each algo increasing policy network up to 4 layers of 1024 neurons each [1024, 1024, 1024, 1024] for qf and vf, or actor and critic.
Using the Discrete Delta concept to scale action limits so changing action_space from box to MultiDiscrete with each action limiting from 0 to 6. discrete_delta_vals = [-0.3, -0.1, -0.03, 0, 0.03, 0.1, 0.3]. Each joint value is decided from choosing one value from the discrete_delta_vals list and adding that value to the previous actions.
Keeping hip_joint_y of all legs as zeros and changing action space from box(shape=(12,)) to box(shape=(8,)). Trained this agent for another 6M timesteps, there seems to be a small improvement at first and then the eps_length and mean_reward settles and no significant improvements afterwards.
I have generated Half Ellipsoid Trajectories with IK and That works but that is explicitly Robotics Approach to solve this problem. I am currently looking into DeepMimic to use those trajectories to guide RL to build a stable walking gait. No Significant breakthrough.
Here is the Repo Link
Check the scripts folder and go through the start_training_v(x).py scripts. Thanks in Advance. If you feel like discussing the entire topic to sort this please drop your email in the comment and I'll reach out to you.
Hi try using Nvidia IsaacGym. This uses pytorch end to endon GPU with PPO. I was able to train a custom urdf to walk in about 10 minutes of training

How to fit the best probability distribution model to my data in python?

i have about 20,000 rows of data like this,,
Id | value
1 30
2 3
3 22
..
n 27
I did statistics to my data,, the average value 33.85, median 30.99, min 2.8, max 206, 95% confidence interval 0.21.. So most values around 33, and there are some outliers (a little).. So it seems like a distribution with long tail.
I am new to both distribution and python,, i tried class fitter https://pypi.org/project/fitter/ to try many distribution from Scipy package,, and loglaplace distribution showed the lowest error (although not quiet understand it).
I read almost all questions in this thread and i concluded two approaches (1) fitting a distribution model and then in my simulation i draw random values (2) compute the frequency of different groups of values,, but this solution will not have a value more than 206 for example.
Having my data which is values (number), what is the best approach to fit a distribution to my data in python as in my simulation i need to draw numbers. The random numbers must have same pattern as my data. Also i need to validate the model is well presenting my data by drawing my data and the model curve.
One way is to select the best model according to the Bayesian information criterion (called BIC).
OpenTURNS implements an automatic method of selection (see doc here).
Suppose you have an array x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], here a quick example:
import openturns as ot
# Define x as a Sample object. It is a sample of size 11 and dimension 1
sample = ot.Sample([[xi] for xi in x])
# define distributions you want to test on the sample
tested_distributions = [ot.WeibullMaxFactory(), ot.NormalFactory(), ot.UniformFactory()]
# find the best distribution according to BIC and print its parameters
best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_distributions)
print(best_model)
>>> Uniform(a = -0.769231, b = 10.7692)

Spark: Difference Between Reduce() vs Fold() [duplicate]

This question already has an answer here:
Why is the fold action necessary in Spark?
(1 answer)
Closed 4 years ago.
I'm studying Spark using Learning Spark, Lightning-Fast Data Analysis book.
I have been to many sites and read many articles but I still did not understand the difference between reduce() and fold().
According to the book that I'm using:
"Similar to reduce() is fold(), which also takes a function with the same signature as needed for reduce(), but in addition takes a “zero value” to be used for the initial call on each partition. The zero value you provide should be the identity element for your operation; that is, applying it multiple times with your function should not change the value (e.g., 0 for +, 1 for *, or an empty list for concatenation)."
To help me better understand, I run the following code:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2)
rdd.getNumPartitions()
Out[1]: 2
rdd.glom().collect()
Out[2]: [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
rdd.reduce(lambda x,y: x+y)
Out[3]: 55
rdd.fold(0, lambda x,y: x+y)
Out[4]: 55
Question:
1) Referencing: "but in addition takes a “zero value” to be used for the initial call on each partition." What does it mean initial call on each partition?
2) Referencing: "The zero value you provide should be the identity element for your operation; that is, applying it multiple times with your function should not change the value" If that's the case, what is the point of providing "the value" for the operation?
3) According to the example I provided above, both produced the sum of 55. What's the difference?
the difference is that fold lets you change the type of the result, whereas reduce doesn't and thus can use values from the data.
e.g.
rdd.fold("",lambda x,y: x+str(y))
'12345678910'
Your example doesn't change the type of the result and indeed in that example, you can use reduce instead of fold.
a "normal" fold used in a non-distributed environment uses the initial value once. However, as spark runs distributed it would run a fold that would start with the initial value in each partition and again when combining the results
Because in your example you've created the 10 numbers above in 2 partitions if we'd call the following :
rdd.fold("HERE",lambda x,y: x+str(y))
we'd get
'HEREHERE12345HERE678910'

Understanding Data Leakage and getting perfect score by exploiting test data

I have read an article on data leakage. In a hackathon there are two sets of data, train data on which participants train their algorithm and test set on which performance is measured.
Data leakage helps in getting a perfect score in test data, with out viewing train data by exploiting the leak.
I have read the article, but I am missing the crux how the leakage is exploited.
Steps as shown in article are following:
Let's load the test data.
Note, that we don't have any training data here, just test data. Moreover, we will not even use any features of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare.
Let's load the data with test indices.
test = pd.read_csv('../test_pairs.csv')
test.head(10)
pairId FirstId SecondId
0 0 1427 8053
1 1 17044 7681
2 2 19237 20966
3 3 8005 20765
4 4 16837 599
5 5 3657 12504
6 6 2836 7582
7 7 6136 6111
8 8 23295 9817
9 9 6621 7672
test.shape[0]
368550
For example, we can think that there is a test dataset of images, and each image is assigned a unique Id from 0 to N−1 (N -- is the number of images). In the dataframe from above FirstId and SecondId point to these Id's and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with Id=1427 and Id=8053 belong to the same class, we should predict 1, and 0 otherwise.
But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary).
print(test['FirstId'].nunique())
print(test['SecondId'].nunique())
26325
26310
So the number of pairs we are given to classify is very very small compared to the total number of pairs.
To exploit the leak we need to assume (or prove), that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with 1000 classes, N images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have 1000*N*(N−1)/2 positive pairs, while total number of pairs was 1000*N(1000N−1)/2.
Another example: in Quora competitition the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller.
Finally, let's get a fraction of pairs of class 1. We just need to submit a constant prediction "all ones" and check the returned accuracy. Create a dataframe with columns pairId and Prediction, fill it and export it to .csv file. Then submit
test['Prediction'] = np.ones(test.shape[0])
sub=pd.DataFrame(test[['pairId','Prediction']])
sub.to_csv('sub.csv',index=False)
All ones have accuracy score is 0.500000.
So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class 1 are oversampled.
Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below.
Building a magic feature
In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important.
Incidence matrix
First, we need to build an incidence matrix. You can think of pairs (FirstId, SecondId) as of edges in an undirected graph.
The incidence matrix is a matrix of size (maxId + 1, maxId + 1), where each row (column) i corresponds i-th Id. In this matrix we put the value 1to the position [i, j], if and only if a pair (i, j) or (j, i) is present in a given set of pais (FirstId, SecondId). All the other elements in the incidence matrix are zeros.
Important! The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is impossible to store them in memory in dense format. But due to their sparsity incidence matrices can be easily represented as sparse matrices. If you are not familiar with sparse matrices, please see wiki and scipy.sparse reference. Please, use any of scipy.sparseconstructors to build incidence matrix.
For example, you can use this constructor: scipy.sparse.coo_matrix((data, (i, j))). We highly recommend to learn to use different scipy.sparseconstuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple for loop. You will need first to create a matrix using scipy.sparse.coo_matrix((M, N), [dtype]) with an appropriate shape (M, N) and then iterate through (FirstId, SecondId) pairs and fill corresponding elements in matrix with ones.
Note, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself.
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
test = pd.read_csv('../test_pairs.csv')
x = test[['FirstId','SecondId']].rename(columns={'FirstId':'col1', 'SecondId':'col2'})
y = test[['SecondId','FirstId']].rename(columns={'SecondId':'col1', 'FirstId':'col2'})
comb = pd.concat([x,y],ignore_index=True).drop_duplicates(keep='first')
comb.head()
col1 col2
0 1427 8053
1 17044 7681
2 19237 20966
3 8005 20765
4 16837 599
data = np.ones(comb.col1.shape, dtype=int)
inc_mat = scipy.sparse.coo_matrix((data,(comb.col1,comb.col2)), shape=(comb.col1.max() + 1, comb.col1.max() + 1))
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
f = rows_FirstId.multiply(rows_SecondId)
f = np.asarray(f.sum(axis=1))
f.shape
(368550, 1)
f = f.sum(axis=1)
f = np.squeeze(np.asarray(f))
print (f.shape)
Now build the magic feature
Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. i-th row is a representation for an object with Id = i. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good.
Now select the rows from the incidence matrix, that correspond to test.FirstId's, and test.SecondId's.
So do not forget to convert pd.series to np.array
These lines should normally run very quickly
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
Our magic feature will be the dot product between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar.
Now compute dot product between corresponding rows in rows_FirstId and rows_SecondId matrices.
From magic feature to binary predictions
But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set.
We may try to choose a thresold, and set the predictions to 1, if the feature value f is higer than the threshold, and 0 otherwise. What threshold would you choose?
How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature f.
For example use np.unique function, check for flags
Function to count frequency of each element
from scipy.stats import itemfreq
itemfreq(f)
array([[ 14, 183279],
[ 15, 852],
[ 19, 546],
[ 20, 183799],
[ 21, 6],
[ 28, 54],
[ 35, 14]])
Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values?
In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information.
Choose a threshold below:
pred = f > 14 # SET THRESHOLD HERE
pred
array([ True, False, True, ..., False, False, False], dtype=bool)
submission = test.loc[:,['pairId']]
submission['Prediction'] = pred.astype(int)
submission.to_csv('submission.csv', index=False)
I want to understand the idea behind this. How we are exploiting the leak from the test data only.
There's a hint in the article. The number of positive pairs should be 1000*N*(N−1)/2, while the number of all pairs is 1000*N(1000N−1)/2. Of course, the number of all pairs is much, much larger if the test set was sampled at random.
As the author mentions, after you evaluate your constant prediction of 1s on the test set, you can tell that the sampling was not done at random. The accuracy you obtain is 50%. Had the sampling been done correctly, this value should've been much lower.
Thus, they construct the incidence matrix and calculate the dot product (the measure of similarity) between the representations of our ID features. They then reuse the information about the accuracy obtained with constant predictions (at 50%) to obtain the corresponding threshold (f > 14). It's set to be greater than 14 because that constitutes roughly half of our test set, which in turn maps back to the 50% accuracy.
The "magic" value didn't have to be greater than 14. It could have been equal to 14. You could have adjusted this value after some leader board probing (as long as you're capturing half of the test set).
It was observed that the test data was not sampled properly; same-class pairs were oversampled. Thus there is a much higher probability of each pair in the training set to have target=1 than any random pair. This led to the belief that one could construct a similarity measure based only on the pairs that are present in the test, i.e., whether a pair made it to the test is itself a strong indicator of similarity.
Using this insight one can calculate an incidence matrix and represent each id j as a binary array (the i-th element representing the presence of i-j pair in test, and thus representing the strong probability of similarity between them). This is a pretty accurate measure, allowing one to find the "similarity" between two rows just by taking their dot product.
The cutoff arrived at is purely by the knowledge of target-distribution found by leaderboard probing.

Typed Lists in Theano

Consider the following machine translation problem. Let s be a source sentence and t be a target sentence. Both sentences are conceptually represented as lists of indices, where the indices correspond to the position of the words in the associated dictionaries. Example:
s = [34, 68, 91, 20]
t = [29, 0, 43]
Note that s and t don't necessarily have the same length. Now let S and T be sets of such instances. In other words, they are a parallel corpus. Example:
S = [[34, 68, 91, 20], [4, 7, 1]]
T = [[29, 0, 43], [190, 37, 25, 60]]
Note that not all s's in S have the same length. That is, sentences have variable numbers of words.
I am implementing a machine translation system in Theano, and the first design decision is what kind of data structures to use for S and T. From one of the answers posted on Matrices with different row lengths in numpy , I learnt that typed lists are a good solution for storing variable length tensors.
However, I realise that they complicate my code a lot. Let me give you one example. Say that we have two typed lists y and p_y_given_x and aim to calculate the negative loss likelihood. If they were regular tensors, a simple statement like this would suffice:
loss = t.mean(t.nnet.categorical_crossentropy(p_y_given_x, y))
But categorical_crossentropy can only be applied to tensors, so in case of typed lists I have to iterate over them and apply the function separately to each element:
_loss, _ = theano.scan(fn=lambda i, p, y: t.nnet.categorical_crossentropy(p[i], y[i]),
non_sequences=[p_y_given_x, y],
sequences=[t.arange(y.__len__(), dtype='int64')])
loss = t.mean(_loss)
On top of making my code more and more messy, these problems propagate. For instance, if I want to calculate the gradient of the loss, the following doesn't work anymore:
grad_params = t.grad(loss, params)
I don't know exactly why it doesn't work. I'm sure it has to do with the type of loss, but I am not interested in investigating any further how I could make it work. The mess is growing exponentially, and what I would like is to know whether I am using typed lists in the wrong way, or if it is time to give up on them because they are not well enough supported yet.
Typed list isn't used by anybody yet. But the idea for having them is that you iterate on them with scan for each sentence. Then you do everything you need in 1 scan. You don't do 1 scan for each operation.
So the scan is only used to do the iteration on each example in the minibatch, and the inside of scan is all what is done on one example.
We haven't tested typed list with grad yet. It is possible that it is missing some implementations.

Resources