Implement Equal-Width Intervals feature engineering in Dask - scikit-learn

In equal-width discretization, the variable values are assigned to intervals of the same width. The number of intervals is user-defined and the width is determined by the minimum/maximum values and the number of intervals.
For example, given the values 10, 20, 100, 130 the minimum is 10 and the maximum is 130. If the user defines the number of intervals as six, given the formula:
Interval Width = (Max(x) - Min(x)) / N
The width is (130 - 10) / 6 = 20
And the six zero-based intervals are: [ 10, 30, 50, 70, 90, 110, 130]
Finally, the interval assignments are defined for each element in the dataset:
Value in the dataset New feature engineered value
10 0
20 0
57 2
101 4
130 5
I have the following code that uses a pandas dataframe with a sklean function to divide the dataframe in equal width intervals:
from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
df['output_col'] = discretizer.fit_transform(df[['input_col']])
This works fine, but I need to implement an equivalent dask function that will trigger the process in parallel in multiple partitions, and I cannot find KBinsDiscretizer in dask_ml.preprocessing Any suggestions? I cannot use map_partitions because it will apply the function to each partition independently, and I need the intervals applied to the entire dataframe.

You're facing a common tradeoff with distributed workflows. Do you want to spend the time/resource/compute required to determine the exact min/max, which is a pre-requisite for the binning scheme you describe, or is an approximate answer alright? If the latter, how do you design an algorithm which adequately captures the data's min/max while remaining efficient?
We can start with the exact solution, since it's easier to implement. The key is simply to find the min and max first, then digitize the data. Note that this requires computing all values in the column twice. If persisting the data is an option (e.g. you are working with a distributed cluster or can fit the column to be binned in memory), it would help avoid unecessary repetition:
def discretize_exact(
s: dask.dataframe.Series, K: int
) -> dask.dataframe.Series:
"""
Discretize values in dask.dataframe Series into K equal-width bins
Parameters
----------
s : dask.dataframe.Series
Series with values to be binned
K : int
Number of equal-width bins to generate
Returns
-------
binned : dask.dataframe.Series
dask.dataframe.Series with scheduled np.digitize operation
called using map_partitions. The values in ``binned`` will
be in [0, K] giving the index of the K bins in the interval
[vmin, vmax].
"""
# schedule the min/max computation
vmin, vmax = s.min(), s.max()
# compute vmin and vmax together so we only compute once
vmin, vmax = dask.compute(vmin, vmax)
# will create K - 1 equal width bins, with
# the outer ends open, such that the first bin will be
# (-inf, vmin + step) and the last will be [vmax - step, inf)
bins = np.linspace(vmin, vmax, (K + 1))[1:-1]
return s.map_partitions(
np.digitize,
bins=bins,
meta=('binned', 'uint16'),
)
This does (I think) what you're looking for, but does involve computing the min and max first prior to scheduling the binning operation. Using an example frame:
import dask.dataframe, pandas as pd, numpy as np
N = 10000
df = dask.dataframe.from_pandas(
pd.DataFrame({'a': np.random.random(size=N)}),
chunksize=1000,
)
We can use the above function to discretize our data:
In [68]: df['binned_a'] = discretize_exact(df['a'], K=10)
In [69]: df
Out[69]:
Dask DataFrame Structure:
a binned_a
npartitions=10
0 float64 uint16
1000 ... ...
... ... ...
9000 ... ...
9999 ... ...
Dask Name: assign, 40 tasks
In [70]: df.compute()
Out[70]:
a binned_a
0 0.548415 5
1 0.872668 8
2 0.466869 4
3 0.133986 1
4 0.833126 8
... ... ...
9995 0.223438 2
9996 0.575271 5
9997 0.922593 9
9998 0.030127 0
9999 0.204283 2
[10000 rows x 2 columns]
Alternatively, you could try to approximate the bin edges. You could do this a number of ways, including sampling the dataframe to identify the min/max of one or more partitions, or you the user could provide an overly wide-estimate of the range. Note that, depending on your workflow, computing the first partition may still involve computing a large part of the overall graph, or even the entire graph if e.g. the dataframe was reshuffled in a recent step.
def find_minmax_of_first_partition(
s: dask.dataframe.Series
) -> tuple[float, float]:
"""
Find the min and max of the first partition of a dask.dataframe.Series
"""
partition_0_stats = (
s.partitions[0].compute().agg(['min', 'max'])
)
return (
partition_0_stats['min'].item(),
partition_0_stats['max'].item(),
)
You could widen this range if desired, using your intuition about the spread of the values:
vmin_p0, vmax_p0 = find_minmax_of_first_partition(df['a'])
range_p0 = (vmax_p0 - vmin_p0)
mean_p0 = (vmin_p0 + vmax_p0) / 2
# guess that the overall data is within 10x the range of partition 1
min_est, max_est = mean_p0 - 5*range_p0, mean_p0 + 5*range_p0
# now, bin all values using this estimated min, max. Note that
# any data falling outside your estimated min/max value will be
# coded as values 0 or K + 1.
bins = np.linspace(min_est, max_est, (K + 1))
binned = s.map_partitions(
np.digitize,
bins=bins,
meta=('binned', 'uint16'),
)
these bins will be equally spaced, but will not necessarily start/end at the min/max and therefore may either not catch all the data or may have empty bins at the edges. You may need to take a look at how your bin specification performs and iterate based on your data.

Related

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

how to generate 100 equally spaced numbers drawn from a normal distribution using python

I generate random numbers using normal distribution in python using np.random.normal()
But is there any way to generate equally spaced numbers using normal distribution in python?
Use numpy.linspace() to generate equally spaced numbers in Python.
Below code will generate 50 equally spaced numbers (in x-axis) of a normal distribution with mean 8 and std.deviation 3 covering from 10% to 90% area of normal distribution.
import numpy as np
from scipy.stats import norm
x = np.linspace(norm(8,3).ppf(0.1), norm(8,3).ppf(0.9), 50)
To get the y-axis value of normal distribution for each of the 50 equally spaced x-axis values use below code:
y = norm(8,3).pdf(x)
Details of linspace() command:
numpy.linspace(start, stop, num = 50, endpoint = True, retstep = False, dtype = None) returns number spaces evenly w.r.t interval
start : [optional] start of interval range. By default start = 0
stop : end of interval range
num : [int, optional] No. of samples to generate
restep : if True, return (samples, step). By default restep = False
dtype : type of output array

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Let's say I have a data set and used matplotlib to draw a histogram of said data set.
n, bins, patches = plt.hist(data, normed=1)
How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:
s = 0
for i in range(len(n)):
s += n[i] * ((bins[i] + bins[i+1]) / 2)
mean = s / numpy.sum(n)
which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:
t = 0
for i in range(len(n)):
t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))
my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.
You haven't weighted the contribution of each bin with n[i]. Change the increment of t to
t += n[i]*(bins[i] - mean)**2
By the way, you can simplify (and speed up) your calculation by using numpy.average with the weights argument.
Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.
In [54]: x = np.random.normal(loc=10, scale=2, size=1000)
In [55]: x.mean()
Out[55]: 9.9760798903061847
In [56]: x.var()
Out[56]: 3.7673459904902025
In [57]: x.std()
Out[57]: 1.9409652213499866
I'll use numpy.histogram to compute the histogram:
In [58]: n, bins = np.histogram(x)
mids is the midpoints of the bins; it has the same length as n:
In [59]: mids = 0.5*(bins[1:] + bins[:-1])
The estimate of the mean is the weighted average of mids:
In [60]: mean = np.average(mids, weights=n)
In [61]: mean
Out[61]: 9.9763028267760312
In this case, it is pretty close to the mean of the original data.
The estimated variance is the weighted average of the squared difference from the mean:
In [62]: var = np.average((mids - mean)**2, weights=n)
In [63]: var
Out[63]: 3.8715035807387328
In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677
That estimate is within 2% of the actual sample standard deviation.
The following answer is equivalent to Warren Weckesser's, but maybe more familiar to those who prefer to want mean as the expected value:
counts, bins = np.histogram(x)
mids = 0.5*(bins[1:] + bins[:-1])
probs = counts / np.sum(counts)
mean = np.sum(probs * mids)
sd = np.sqrt(np.sum(probs * (mids - mean)**2))
Do take note in certain context you may want the unbiased sample variance where the weights are not normalized by N but N-1.

Sklearn and PCA. Why is max n_row == max n_components?

I have a high-dimensional word-bi-gram frequency matrix (1100 x 100658, dtype=int). As column-names I'm setting the word-bi-grams (like 'of-the', 'and-the',...) with
myPandaDataFrame.columns = word-bi-grams
as row index I use for example the proficiency (high, medium, low)
myPandaDataFrame.columns.set_index(['PROFICIENCY'], inplace=True, drop=True)
then I'm doing
from sklearn.decomposition import PCA
x = 500
pcax = PCA(n_components=x)
pcax.fit(myPandaDataFrame)
PCA(copy=True, n_components=x, whiten=False)
existing_2dx = pcax.transform(myPandaDataFrame)
existing_df_2dx = pandas.DataFrame(existing_2dx)
existing_df_2dx.index = myPandaDataFrame.index
existing_df_2dx.columns = ['PC{0}'.format(i) for i in range(x)]
My first problem, where I think it is wrong, is that I can set only a max number of 1100 components. That is the number of existing rows. I'm very new to PCA and tried couple of examples, but seems like I can't get it right for my matrix.
Is someone seeing where I'm doing a mistake or can someone link to a tutorial / example which is similar to my problem. I would be very happy :)
With best regards
PCA decomposes the empirical data covariance matrix into eigenvalues and vectors. This matrix has rank min(n_lines, n_columns). After this number the eigenvalues become 0, so your data are entirely explained by the number of components up to there. This number of components reflects your data perfectly. In order to do any sort of dimensionality reduction you need to choose less components.
You can't have more components than the number of the dimensions (rank) of the space your matrix spans, which in turn would be no larger than the minimum of the number of rows or columns (or less if matrix is not of full rank).
See the below example: with a matrix of size 500 x 10000, you can ask for 1,000 components and will get back 500, on which you can then project your matrix, returning a 500 x 500 matrix:
df = pd.DataFrame(data=np.random.random(size=(500, 10000)))
RangeIndex: 500 entries, 0 to 499
Columns: 10000 entries, 0 to 9999
dtypes: float64(10000)
memory usage: 38.1 MB
x = 1000
pca = PCA(n_components=x)
pca.fit(df)
pca.explained_variance_ratio_.shape
(500,)
existing_2dx = pca.transform(df)
existing_2dx.shape
(500, 500)

Return value from DataFrame at maximum of a function

For narrow banded processing I want the complex pressure at the peak frequency bin. To find the peak frequency bin I use the frequency with the highest absolute value, within a small range of frequencies.
I have come up with the following code, borrowing heavily from
Use idxmax for indexing in pandas
This seems to me bulky, and hard to generalize. Ideally I hope to be able to be able to make fBins into an array, and return many frequencies at once. Its OK to make maxAbsIndex into a list, but I can't see the next step.
import numpy as np
import pandas as pd
# Construct fake frequency data on multiple channels
np.random.seed(0)
numF = 1000
f = np.arange(numF) / (numF * 2)
y = np.random.randn(numF, 2) + 1j * np.random.randn(numF, 2)
# Put time series into a DataFrame, indexed by frequency
yFrame = pd.DataFrame(y, index = f)
fBins = 0.1
tol = 0.01
# Find the index of the maxium absolute value within a given frequency window
absMaxIndex = yFrame[(fBins - tol) : (fBins + tol)].abs().idxmax()
# Return the value at this index
value = [yFrame.ix[items[1], items[0]] for items in absMaxIndex.iteritems()]
print(value)
Value should have the complex value
[(-2.0946030712061448-1.0585718976053677j), (-2.7396771671895563+0.79204149842297422j)]
Which have the largest absolute value in yFrame between 0.09 and 0.11 Hz for each channel.

Resources