Clustering objects with weighted attributes - attributes

I want to cluster a set of objects which have multiple attributes and some attributes are more important than others
is there a simple way to give these specific attributes a heavy weight in a way to give them more importance then the others?

Look - every instance of objects from your set might been represented as multidimensional vector (each attribute of your object is a component of vector). So, you might use distance-based clustering (distance between similar vectors is very small), such as k-means. You need to define your own distance function between vectors.
For example if your objects has 3 attributes (X Y Z), also that each attribute has its weight (importance) (wx wy wz).
According to this, for example, you might define distance function between two vectors (X1 Y1 Z1) and (X2 Y2 Z2) in such way (cosinus distance):
(wx^2*X1*X2+wy^2*Y1*Y2+wz^2*Z1*Z2)
dist= -----------------------------------------------------------------------
[(wx^2*X1^2+wy^2*Y1^2+wz^2*Z1^2)*(wx^2*X2^2+wy^2*Y2^2+wz^2*Z2^2)]^0,5

Related

multiple regression correlation effect

I would like to investigate the effects of two independent variables on a dependent variable. Suppose we have X1, X2 independent variables, and Y dependent variable.
I use two different approaches. In the first approach, to eliminate the effect of X1 on Y, I generate the conditional distribution of Y|X1 and perform regression using the second variable X2. When I check the correlations between X2 and Y|X1, I obtain relatively high correlations (R2>0.50). However, when I perform multiple regression over a wide range of data (X1 and X2), the effect of X2 on Y is decreased and becomes insignificant. How do these approaches give conflicting results? What is the most appropriate approach to determine the effect of X2 on Y for a given X1 value? Thanks.
It could be good to see the code or the above in mathematical notation.
For instance: did you include the constant terms?
What do you see when:
Y = B0 + B1X1 + B2X2
That will be the easiest to check, and B2 will give you probably what you want.
That model is still simple, you could explore something like:
Y = B0 + B1X1 + B2X2 + B3X1X2
or
Y = B0 + B1X1 + B2X2 + B3X1X2 + B4X1^2 + B5X2^2
And see if there are changes in the coefficients and if there are new significant coefficients.
You could go further and explore Structural Equation Models

Computing the area filled by matplotlib.pyplot.fill(...)

I'd like to compute the area inside of a curve defined by two vectors a and b. For your reference the curve looks something like this (pyplot.plot(a,b)):
I saw matplotlib has a fill functionality that let you fill the area enclosed by the curve:
I'm wondering, there's any way to obtain the area filled using that same function? It would be very useful as the other way I'm thinking of computing that area is through numerical integration, much more cumbersome.
Thank you for your time.
If you really want to find the area that was filled by matplotlib.pyplot.fill(a, b), you can use its output as follows:
def computeArea(pos):
x, y = (zip(*pos))
return 0.5 * numpy.abs(numpy.dot(x, numpy.roll(y, 1)) - numpy.dot(y, numpy.roll(x, 1)))
# pyplot.fill(a, b) will return a list of matplotlib.patches.Polygon.
polygon = matplotlib.pyplot.fill(a, b)
# The area of the polygon can be computed as follows:
# (you could also sum the areas of all polygons in the list).
print(computeArea(polygon[0].xy))
This method is based on this answer,
and it is not the most efficient one.

sklearn customized standarization of data

Suppose I have a 2D numpy array:
X = np.array[
[..., ...],
[..., ...]]
And I want to standardize the data either with:
X = StandardScaler().fit_transform(X)
or:
X = (X - X.mean())/X.std()
The results are different. Why are they different?
Assuming X is a feature matrix of shape (n x m) (n instances and m features). We want to scale each feature so its instances are distributed with a mean of zero and with unit variance.
To do this you need to calculate the mean and standard deviation of each feature for the provided instances (column of X) and then calculate the scaled feature vectors. Currently you are calculating the mean and standard deviation of the whole dataset and scaling the data using these values: this will give you meaningless results in all but a few special cases (i.e., X = np.ones((100,2)) is such a special case).
Practically, to calculate these statistics for each feature you will need to set the axis parameter of the .mean() or .std() methods to 0. This will perform the calculations along the columns and return a (1 x m) shaped array (actually a (m,) array, but thats another story), where each value is the mean or standard deviation for the given column. You can then use numpy broadcasting to correctly scale the feature vectors.
The below example shows how you can correctly implement it manually. x1 and x2 are 2 features with 100 training instances. We store them in a feature matrix X.
x1 = np.linspace(0, 100, 100)
x2 = 10 * np.random.normal(size=100)
X = np.c_[x1, x2]
# scale the data using the sklearn implementation
X_scaled = StandardScaler().fit_transform(X)
# scale the data taking mean and std along columns
X_scaled_manual = (X - X.mean(axis=0)) / X.std(axis=0)
If you print the two you will see they match exactly, explicitly:
print(np.sum(X_scaled-X_scaled_manual))
returns 0.0.

Rendering Fractals: The Mobius Transformation and The Newtonian Basin

I understand how to render (two dimensional) "Escape Time Group" fractals (Julia and Mandelbrot), but I can't seem to get a Mobius Transformation or a Newton Basin rendered.
I'm trying to render them using the same method (by recursively using the polynomial equation on each pixel 'n' times), but I have a feeling these fractals are rendered using totally different methods. Mobius 'Transformation' implies that an image must already exist, and then be transformed to produce the geometry, and the Newton Basin seems to plot each point, not just points that fall into a set.
How are these fractals graphed? Are they graphed using the same iterative methods as the Julia and Mandelbrot?
Equations I'm Using:
Julia: Zn+1 = Zn^2 + C
Where Z is a complex number representing a pixel, and C is a complex constant (Correct).
Mandelbrot: Cn+1 = Cn^2 + Z
Where Z is a complex number representing a pixel, and C is the complex number (0, 0), and is compounded each step (The reverse of the Julia, correct).
Newton Basin: Zn+1 = Zn - (Zn^x - a) / (Zn^y - a)
Where Z is a complex number representing a pixel, x and y are exponents of various degrees, and a is a complex constant (Incorrect - creating a centered, eight legged 'line star').
Mobius Transformation: Zn+1 = (aZn + b) / (cZn + d)
Where Z is a complex number representing a pixel, and a, b, c, and d are complex constants (Incorrect, everything falls into the set).
So how are the Newton Basin and Mobius Transformation plotted on the complex plane?
Update: Mobius Transformations are just that; transformations.
"Every Möbius transformation is
a composition of translations,
rotations, zooms (dilations) and
inversions."
To perform a Mobius Transformation, a shape, picture, smear, etc. must be present already in order to transform it.
Now how about those Newton Basins?
Update 2: My math was wrong for the Newton Basin. The denominator at the end of the equation is (supposed to be) the derivative of the original function. The function can be understood by studying 'NewtonRoot.m' from the MIT MatLab source-code. A search engine can find it quite easily. I'm still at a loss as to how to graph it on the complex plane, though...
Newton Basin:
f(x) = x - f(x) / f'(x)
In Mandelbrot and Julia sets you terminate the inner loop if it exceeds a certain threshold as a measurement how fast the orbit "reaches" infinity
if(|z| > 4) { stop }
For newton fractals it is the other way round: Since the newton method is usually converging towards a certain value we are interested how fast it reaches its limit, which can be done by checking when the difference of two consecutive values drops below a certain value (usually 10^-9 is a good value)
if(|z[n] - z[n-1]| < epsilon) { stop }

Clustering by assigning weights to the attributes

I have a data set in excel sheet which I need to cluster it by assigning weights. How can I do it?
You can define a function that computes the distance between two points by attribute weights into account. An example of this would be weighted euclidean distance
Specifically if there are k attributes for each point in your dataset and if the corresponding weights for the attributes are d1,d2,..,dk then distance between two points X and Y is
d(X,Y) = sum(di * (Xi-Yi)^2) i=1,2..k where Xi is the value of ith attribute for the point X.
If the weights are inverse of the variance of the attribute it reduces to mahalanobis distance
http://en.wikipedia.org/wiki/Mahalanobis_distance
Once you define the distance function you can use K-means to cluster your data.

Resources