Related
In equal-width discretization, the variable values are assigned to intervals of the same width. The number of intervals is user-defined and the width is determined by the minimum/maximum values and the number of intervals.
For example, given the values 10, 20, 100, 130 the minimum is 10 and the maximum is 130. If the user defines the number of intervals as six, given the formula:
Interval Width = (Max(x) - Min(x)) / N
The width is (130 - 10) / 6 = 20
And the six zero-based intervals are: [ 10, 30, 50, 70, 90, 110, 130]
Finally, the interval assignments are defined for each element in the dataset:
Value in the dataset New feature engineered value
10 0
20 0
57 2
101 4
130 5
I have the following code that uses a pandas dataframe with a sklean function to divide the dataframe in equal width intervals:
from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
df['output_col'] = discretizer.fit_transform(df[['input_col']])
This works fine, but I need to implement an equivalent dask function that will trigger the process in parallel in multiple partitions, and I cannot find KBinsDiscretizer in dask_ml.preprocessing Any suggestions? I cannot use map_partitions because it will apply the function to each partition independently, and I need the intervals applied to the entire dataframe.
You're facing a common tradeoff with distributed workflows. Do you want to spend the time/resource/compute required to determine the exact min/max, which is a pre-requisite for the binning scheme you describe, or is an approximate answer alright? If the latter, how do you design an algorithm which adequately captures the data's min/max while remaining efficient?
We can start with the exact solution, since it's easier to implement. The key is simply to find the min and max first, then digitize the data. Note that this requires computing all values in the column twice. If persisting the data is an option (e.g. you are working with a distributed cluster or can fit the column to be binned in memory), it would help avoid unecessary repetition:
def discretize_exact(
s: dask.dataframe.Series, K: int
) -> dask.dataframe.Series:
"""
Discretize values in dask.dataframe Series into K equal-width bins
Parameters
----------
s : dask.dataframe.Series
Series with values to be binned
K : int
Number of equal-width bins to generate
Returns
-------
binned : dask.dataframe.Series
dask.dataframe.Series with scheduled np.digitize operation
called using map_partitions. The values in ``binned`` will
be in [0, K] giving the index of the K bins in the interval
[vmin, vmax].
"""
# schedule the min/max computation
vmin, vmax = s.min(), s.max()
# compute vmin and vmax together so we only compute once
vmin, vmax = dask.compute(vmin, vmax)
# will create K - 1 equal width bins, with
# the outer ends open, such that the first bin will be
# (-inf, vmin + step) and the last will be [vmax - step, inf)
bins = np.linspace(vmin, vmax, (K + 1))[1:-1]
return s.map_partitions(
np.digitize,
bins=bins,
meta=('binned', 'uint16'),
)
This does (I think) what you're looking for, but does involve computing the min and max first prior to scheduling the binning operation. Using an example frame:
import dask.dataframe, pandas as pd, numpy as np
N = 10000
df = dask.dataframe.from_pandas(
pd.DataFrame({'a': np.random.random(size=N)}),
chunksize=1000,
)
We can use the above function to discretize our data:
In [68]: df['binned_a'] = discretize_exact(df['a'], K=10)
In [69]: df
Out[69]:
Dask DataFrame Structure:
a binned_a
npartitions=10
0 float64 uint16
1000 ... ...
... ... ...
9000 ... ...
9999 ... ...
Dask Name: assign, 40 tasks
In [70]: df.compute()
Out[70]:
a binned_a
0 0.548415 5
1 0.872668 8
2 0.466869 4
3 0.133986 1
4 0.833126 8
... ... ...
9995 0.223438 2
9996 0.575271 5
9997 0.922593 9
9998 0.030127 0
9999 0.204283 2
[10000 rows x 2 columns]
Alternatively, you could try to approximate the bin edges. You could do this a number of ways, including sampling the dataframe to identify the min/max of one or more partitions, or you the user could provide an overly wide-estimate of the range. Note that, depending on your workflow, computing the first partition may still involve computing a large part of the overall graph, or even the entire graph if e.g. the dataframe was reshuffled in a recent step.
def find_minmax_of_first_partition(
s: dask.dataframe.Series
) -> tuple[float, float]:
"""
Find the min and max of the first partition of a dask.dataframe.Series
"""
partition_0_stats = (
s.partitions[0].compute().agg(['min', 'max'])
)
return (
partition_0_stats['min'].item(),
partition_0_stats['max'].item(),
)
You could widen this range if desired, using your intuition about the spread of the values:
vmin_p0, vmax_p0 = find_minmax_of_first_partition(df['a'])
range_p0 = (vmax_p0 - vmin_p0)
mean_p0 = (vmin_p0 + vmax_p0) / 2
# guess that the overall data is within 10x the range of partition 1
min_est, max_est = mean_p0 - 5*range_p0, mean_p0 + 5*range_p0
# now, bin all values using this estimated min, max. Note that
# any data falling outside your estimated min/max value will be
# coded as values 0 or K + 1.
bins = np.linspace(min_est, max_est, (K + 1))
binned = s.map_partitions(
np.digitize,
bins=bins,
meta=('binned', 'uint16'),
)
these bins will be equally spaced, but will not necessarily start/end at the min/max and therefore may either not catch all the data or may have empty bins at the edges. You may need to take a look at how your bin specification performs and iterate based on your data.
i have about 20,000 rows of data like this,,
Id | value
1 30
2 3
3 22
..
n 27
I did statistics to my data,, the average value 33.85, median 30.99, min 2.8, max 206, 95% confidence interval 0.21.. So most values around 33, and there are some outliers (a little).. So it seems like a distribution with long tail.
I am new to both distribution and python,, i tried class fitter https://pypi.org/project/fitter/ to try many distribution from Scipy package,, and loglaplace distribution showed the lowest error (although not quiet understand it).
I read almost all questions in this thread and i concluded two approaches (1) fitting a distribution model and then in my simulation i draw random values (2) compute the frequency of different groups of values,, but this solution will not have a value more than 206 for example.
Having my data which is values (number), what is the best approach to fit a distribution to my data in python as in my simulation i need to draw numbers. The random numbers must have same pattern as my data. Also i need to validate the model is well presenting my data by drawing my data and the model curve.
One way is to select the best model according to the Bayesian information criterion (called BIC).
OpenTURNS implements an automatic method of selection (see doc here).
Suppose you have an array x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], here a quick example:
import openturns as ot
# Define x as a Sample object. It is a sample of size 11 and dimension 1
sample = ot.Sample([[xi] for xi in x])
# define distributions you want to test on the sample
tested_distributions = [ot.WeibullMaxFactory(), ot.NormalFactory(), ot.UniformFactory()]
# find the best distribution according to BIC and print its parameters
best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_distributions)
print(best_model)
>>> Uniform(a = -0.769231, b = 10.7692)
I have read an article on data leakage. In a hackathon there are two sets of data, train data on which participants train their algorithm and test set on which performance is measured.
Data leakage helps in getting a perfect score in test data, with out viewing train data by exploiting the leak.
I have read the article, but I am missing the crux how the leakage is exploited.
Steps as shown in article are following:
Let's load the test data.
Note, that we don't have any training data here, just test data. Moreover, we will not even use any features of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare.
Let's load the data with test indices.
test = pd.read_csv('../test_pairs.csv')
test.head(10)
pairId FirstId SecondId
0 0 1427 8053
1 1 17044 7681
2 2 19237 20966
3 3 8005 20765
4 4 16837 599
5 5 3657 12504
6 6 2836 7582
7 7 6136 6111
8 8 23295 9817
9 9 6621 7672
test.shape[0]
368550
For example, we can think that there is a test dataset of images, and each image is assigned a unique Id from 0 to N−1 (N -- is the number of images). In the dataframe from above FirstId and SecondId point to these Id's and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with Id=1427 and Id=8053 belong to the same class, we should predict 1, and 0 otherwise.
But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary).
print(test['FirstId'].nunique())
print(test['SecondId'].nunique())
26325
26310
So the number of pairs we are given to classify is very very small compared to the total number of pairs.
To exploit the leak we need to assume (or prove), that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with 1000 classes, N images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have 1000*N*(N−1)/2 positive pairs, while total number of pairs was 1000*N(1000N−1)/2.
Another example: in Quora competitition the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller.
Finally, let's get a fraction of pairs of class 1. We just need to submit a constant prediction "all ones" and check the returned accuracy. Create a dataframe with columns pairId and Prediction, fill it and export it to .csv file. Then submit
test['Prediction'] = np.ones(test.shape[0])
sub=pd.DataFrame(test[['pairId','Prediction']])
sub.to_csv('sub.csv',index=False)
All ones have accuracy score is 0.500000.
So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class 1 are oversampled.
Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below.
Building a magic feature
In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important.
Incidence matrix
First, we need to build an incidence matrix. You can think of pairs (FirstId, SecondId) as of edges in an undirected graph.
The incidence matrix is a matrix of size (maxId + 1, maxId + 1), where each row (column) i corresponds i-th Id. In this matrix we put the value 1to the position [i, j], if and only if a pair (i, j) or (j, i) is present in a given set of pais (FirstId, SecondId). All the other elements in the incidence matrix are zeros.
Important! The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is impossible to store them in memory in dense format. But due to their sparsity incidence matrices can be easily represented as sparse matrices. If you are not familiar with sparse matrices, please see wiki and scipy.sparse reference. Please, use any of scipy.sparseconstructors to build incidence matrix.
For example, you can use this constructor: scipy.sparse.coo_matrix((data, (i, j))). We highly recommend to learn to use different scipy.sparseconstuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple for loop. You will need first to create a matrix using scipy.sparse.coo_matrix((M, N), [dtype]) with an appropriate shape (M, N) and then iterate through (FirstId, SecondId) pairs and fill corresponding elements in matrix with ones.
Note, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself.
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
test = pd.read_csv('../test_pairs.csv')
x = test[['FirstId','SecondId']].rename(columns={'FirstId':'col1', 'SecondId':'col2'})
y = test[['SecondId','FirstId']].rename(columns={'SecondId':'col1', 'FirstId':'col2'})
comb = pd.concat([x,y],ignore_index=True).drop_duplicates(keep='first')
comb.head()
col1 col2
0 1427 8053
1 17044 7681
2 19237 20966
3 8005 20765
4 16837 599
data = np.ones(comb.col1.shape, dtype=int)
inc_mat = scipy.sparse.coo_matrix((data,(comb.col1,comb.col2)), shape=(comb.col1.max() + 1, comb.col1.max() + 1))
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
f = rows_FirstId.multiply(rows_SecondId)
f = np.asarray(f.sum(axis=1))
f.shape
(368550, 1)
f = f.sum(axis=1)
f = np.squeeze(np.asarray(f))
print (f.shape)
Now build the magic feature
Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. i-th row is a representation for an object with Id = i. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good.
Now select the rows from the incidence matrix, that correspond to test.FirstId's, and test.SecondId's.
So do not forget to convert pd.series to np.array
These lines should normally run very quickly
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
Our magic feature will be the dot product between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar.
Now compute dot product between corresponding rows in rows_FirstId and rows_SecondId matrices.
From magic feature to binary predictions
But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set.
We may try to choose a thresold, and set the predictions to 1, if the feature value f is higer than the threshold, and 0 otherwise. What threshold would you choose?
How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature f.
For example use np.unique function, check for flags
Function to count frequency of each element
from scipy.stats import itemfreq
itemfreq(f)
array([[ 14, 183279],
[ 15, 852],
[ 19, 546],
[ 20, 183799],
[ 21, 6],
[ 28, 54],
[ 35, 14]])
Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values?
In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information.
Choose a threshold below:
pred = f > 14 # SET THRESHOLD HERE
pred
array([ True, False, True, ..., False, False, False], dtype=bool)
submission = test.loc[:,['pairId']]
submission['Prediction'] = pred.astype(int)
submission.to_csv('submission.csv', index=False)
I want to understand the idea behind this. How we are exploiting the leak from the test data only.
There's a hint in the article. The number of positive pairs should be 1000*N*(N−1)/2, while the number of all pairs is 1000*N(1000N−1)/2. Of course, the number of all pairs is much, much larger if the test set was sampled at random.
As the author mentions, after you evaluate your constant prediction of 1s on the test set, you can tell that the sampling was not done at random. The accuracy you obtain is 50%. Had the sampling been done correctly, this value should've been much lower.
Thus, they construct the incidence matrix and calculate the dot product (the measure of similarity) between the representations of our ID features. They then reuse the information about the accuracy obtained with constant predictions (at 50%) to obtain the corresponding threshold (f > 14). It's set to be greater than 14 because that constitutes roughly half of our test set, which in turn maps back to the 50% accuracy.
The "magic" value didn't have to be greater than 14. It could have been equal to 14. You could have adjusted this value after some leader board probing (as long as you're capturing half of the test set).
It was observed that the test data was not sampled properly; same-class pairs were oversampled. Thus there is a much higher probability of each pair in the training set to have target=1 than any random pair. This led to the belief that one could construct a similarity measure based only on the pairs that are present in the test, i.e., whether a pair made it to the test is itself a strong indicator of similarity.
Using this insight one can calculate an incidence matrix and represent each id j as a binary array (the i-th element representing the presence of i-j pair in test, and thus representing the strong probability of similarity between them). This is a pretty accurate measure, allowing one to find the "similarity" between two rows just by taking their dot product.
The cutoff arrived at is purely by the knowledge of target-distribution found by leaderboard probing.
Lets say i am detecting dogs on images.
Output of my CNN is
Dense(24,activation="relu")
Which means i want to detect up to 6 dogs ( each dogs should be represented by xmin,ymin,xmax,ymax = 4 values , 4 * 6 = 24 )
Now lets say i have two dogs on pictures and their positions are ( bounding box )
dog1 = { xmin: 50, ymin:50, xmax:150,ymax:150}
dog2 = { xmin: 300,ymin:300,xmax:400,ymax:400}
Now tha label for this picture would look something like
( 50, 50, 150, 150, 300, 300, 400 ,400 , 0 ,0, 0 ... 16 zeros )
Now what if my CNN outputs something like
( 290, 285, 350 , 350, 60 , 40 , 120 ,110 ... 0 ... )
AS you can see the first bounding box that CNN outputs is closer to the bounding box of second dog and vice verse.
How should i deal with this?
I can create my own MSE function and output the smallest value e.g
def custom_mse(y_true, y_pred):
tmp = 10000000000
a = list(itertools.permutations(y_pred))
for i in range(0, len(a)):
t = K.mean(K.square(a[i] - y_true), axis=-1)
if t < tmp :
tmp = t
return tmp
But this would results only in "best" loss but the weights would get modified wrongly.
How can i modify output of CNN ( permutate or rearrange elements ) so this would work?
I hope i explained it clearly.
Thanks for help.
Your issue lies in on what object you calculate the loss.
Tensorflow/Keras/Almost any other library, uses its own objects in order to find derivatives and define the calculation graph.
Therefore, if you need to do anything with the graph, you must do it using a defined op or define your own op using the given methods and objects. Tensorflow also allows to wrap regular python functions and make it as ops on tensors.
As for your problem, create a 2d output array of dims [4,num_of_objects] and use tensorflow operations to reorder the second dimension before calculating loss. See full lists here. Split it according to second dimension, iterate combinations, use tf.min to find the minimum loss, and optimize only the minimum loss. Experimented with that approach, it works, also with bounding boxes.
EDIT: Noted that you perform your experiments and calculation with Keras, use Tensorflow backend and work only on tensors, do NOT retreive data from graph to numpy/list objects. Use only tensors.
Good luck!
I have a high-dimensional word-bi-gram frequency matrix (1100 x 100658, dtype=int). As column-names I'm setting the word-bi-grams (like 'of-the', 'and-the',...) with
myPandaDataFrame.columns = word-bi-grams
as row index I use for example the proficiency (high, medium, low)
myPandaDataFrame.columns.set_index(['PROFICIENCY'], inplace=True, drop=True)
then I'm doing
from sklearn.decomposition import PCA
x = 500
pcax = PCA(n_components=x)
pcax.fit(myPandaDataFrame)
PCA(copy=True, n_components=x, whiten=False)
existing_2dx = pcax.transform(myPandaDataFrame)
existing_df_2dx = pandas.DataFrame(existing_2dx)
existing_df_2dx.index = myPandaDataFrame.index
existing_df_2dx.columns = ['PC{0}'.format(i) for i in range(x)]
My first problem, where I think it is wrong, is that I can set only a max number of 1100 components. That is the number of existing rows. I'm very new to PCA and tried couple of examples, but seems like I can't get it right for my matrix.
Is someone seeing where I'm doing a mistake or can someone link to a tutorial / example which is similar to my problem. I would be very happy :)
With best regards
PCA decomposes the empirical data covariance matrix into eigenvalues and vectors. This matrix has rank min(n_lines, n_columns). After this number the eigenvalues become 0, so your data are entirely explained by the number of components up to there. This number of components reflects your data perfectly. In order to do any sort of dimensionality reduction you need to choose less components.
You can't have more components than the number of the dimensions (rank) of the space your matrix spans, which in turn would be no larger than the minimum of the number of rows or columns (or less if matrix is not of full rank).
See the below example: with a matrix of size 500 x 10000, you can ask for 1,000 components and will get back 500, on which you can then project your matrix, returning a 500 x 500 matrix:
df = pd.DataFrame(data=np.random.random(size=(500, 10000)))
RangeIndex: 500 entries, 0 to 499
Columns: 10000 entries, 0 to 9999
dtypes: float64(10000)
memory usage: 38.1 MB
x = 1000
pca = PCA(n_components=x)
pca.fit(df)
pca.explained_variance_ratio_.shape
(500,)
existing_2dx = pca.transform(df)
existing_2dx.shape
(500, 500)