I want to generate a multi class test dataset using numpy only for a classification problem.
For example X is a numpy array of dimension(mxn), y of dimension(mx1) and let's say there are k no. of classes. Please help me with the code.
[Here X represents the features and y represents the labels]
You can use np.random.randint like:
import numpy as np
m = 4
n = 4
k = 5
X = np.random.randint(0,2,(m,n))
X
array([[1, 1, 1, 1],
[1, 0, 0, 1],
[1, 1, 0, 0],
[1, 1, 1, 1]])
y = np.random.randint(0,k,m)
y
array([3, 3, 0, 4])
You can create multi class dataset using numpy as follows -
def generate_dataset(size, classes=2, noise=0.5):
# Generate random datapoints
labels = np.random.randint(0, classes, size)
x = (np.random.rand(size) + labels) / classes
y = x + np.random.rand(size) * noise
# Reshape data in order to merge them
x = x.reshape(size, 1)
y = y.reshape(size, 1)
labels = labels.reshape(size, 1)
# Merge the data
data = np.hstack((x, y, labels))
return data
When visualised with matplotlib generated data will look like following -
You can change the number of classes and spread of data using classes and noise parameter. Here I have kept linear relation between x-axis and y-axis values which can also be changed as per requirement.
Related
I am trying to convert the rows [0-1] of a matrix to representation in number (binary equivalent), the code I have is the following:
import numpy as np
def generate_binary_matrix(matrix):
result = []
for i in matrix:
val = '0b' + ''.join([str(x) for x in i])
result.append(int(val, 2))
result = np.array(result)
return result
initial_matrix = np.array([[0, 1, 0], [1, 0, 0], [0, 0, 1]])
result = generate_binary_matrix(initial_matrix )
print(result)
This code works but it is very slow, does anyone know how to do it in a faster way?
You can convert a 0/1 list to binary using just arithmetic, which should be faster:
from functools import reduce
b = reduce(lambda r, x: 2*r + x, i)
Suppose you matrix numpy array is A with m rows and n columns.
Create a b vector with nelements by:
b = np.power(2, np.arange(n))[::-1]
then your answer is A # b
Example:
import numpy as np
A = np.array([[0, 0, 1], [1, 0, 1]])
n = A.shape[1]
b = np.power(2, np.arange(n))[::-1]
print(A # b) # --> [1 5]
update - I reversed b as the MSB (2^n-1) is A[:,0] + power arguments were mistakenly flipped + add an example.
I have a numpy vector in the shape of 17520 and only one column, I want to plot it by using plt.scatter and I don`t know if I should change the shape or not.
My np vector consists of three values 0,1 and 2, where each value represents the cluster number after using hierarchical clustering.
After reading the Excel file and doing some pre-processing, here is my code:
plt.figure(figsize=(10, 7))
plt.title("Customer Dendograms")
dend = shc.dendrogram(shc.linkage(data1, method='ward'))
cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
X=cluster.fit_predict(data1)
The variable X holds the np vector that I want to plot and the shape is 17520.
Please any help to be able to plot the data in which the 0 values are red, the ones are blue, and the twos are green.
You can try this (assuming that data1 has shape 17520 x 2 and X has shape 17520 x 1).
redClass = data1[X == 0]
blueClass = data1[X == 1]
greenClass = data1[X == 2]
plt.scatter(redClass[:, 0], redClass[:, 1], c='r')
plt.scatter(blueClass[:, 0], blueClass[:, 1], c='b')
plt.scatter(greenClass[:, 0], greenClass[:, 1], c='g')
plt.show()
I have a theano tensor and I would like to clip its values, but each index to a different range.
For example, if I have a vector [a,b,c] , I want to clip a to [0,1] , clip b to [2,3] and c to [3,5].
How can I do that efficiently?
Thanks!
The theano.tensor.clip operation supports symbolic minimum and maximum values so you can pass three tensors, all of the same shape, and it will perform an element-wise clip of the first with respect to the second (minimum) and third (maximum).
This code shows two variations on this theme. v1 requires the minimum and maximum values to be passed as separate vectors while v2 allows the minimum and maximum values to be passed more like a list of pairs, represented as a two column matrix.
import theano
import theano.tensor as tt
def v1():
x = tt.vector()
min_x = tt.vector()
max_x = tt.vector()
y = tt.clip(x, min_x, max_x)
f = theano.function([x, min_x, max_x], outputs=y)
print f([2, 1, 4], [0, 2, 3], [1, 3, 5])
def v2():
x = tt.vector()
min_max = tt.matrix()
y = tt.clip(x, min_max[:, 0], min_max[:, 1])
f = theano.function([x, min_max], outputs=y)
print f([2, 1, 4], [[0, 1], [2, 3], [3, 5]])
def main():
v1()
v2()
main()
I'm trying to create the model shown below with PyMC 3 but can't figure out how to properly map probabilities to the observed data with a lambda function.
import numpy as np
import pymc as pm
data = np.array([[0, 0, 1, 1, 2],
[0, 1, 2, 2, 2],
[2, 2, 1, 1, 0],
[1, 1, 2, 0, 1]])
(D, W) = data.shape
V = len(set(data.ravel()))
T = 3
a = np.ones(T)
b = np.ones(V)
with pm.Model() as model:
theta = [pm.Dirichlet('theta_%s' % i, a, shape=T) for i in range(D)]
z = [pm.Categorical('z_%i' % i, theta[i], shape=W) for i in range(D)]
phi = [pm.Dirichlet('phi_%i' % i, b, shape=V) for i in range(T)]
w = [pm.Categorical('w_%i_%i' % (i, j),
p=lambda z=z[i][j], phi_=phi: phi_[z], # Error is here
observed=data[i, j])
for i in range(D) for j in range(W)]
The error I get is
AttributeError: 'function' object has no attribute 'shape'
In the model I'm attempting to build, the elements of z indicate which element in phi gives the probability of the corresponding observed value in data (placed in RV w). In other words,
P(data[i,j]) <- phi[z[i,j]][data[i,j]]
I'm guessing I need to define the probability with a Theano expression or use Theano as_op but I don't see how it can be done for this model.
You should specify your categorical p values as Deterministic objects before passing them on to w. Otherwise, the as_op implementation would look something like this:
#theano.compile.ops.as_op(itypes=[t.lscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
def p(z=z, phi=phi):
return [phi[z[i,j]] for i in range(D) for j in range(W)]
How should I best use scikit-learn for the following supervised classification problem (simplified), with binary features:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
train_data = np.array([[0, 0, 1, 0],
[1, 0, 1, 1],
[0, 1, 1, 1]], dtype=bool)
train_targets = np.array([0, 1, 2])
c = DecisionTreeClassifier()
c.fit(train_data, train_targets)
p = c.predict(np.array([1, 1, 1, 1], dtype=bool))
print(p)
# -> [1]
That works fine. However, suppose now that I know a priori that the presence of feature 0 excludes class 1. Can additional information of this kind be easily included in the classification process?
Currently, I'm just doing some (problem-specific and heuristic) postprocessing to adjust the resulting class. I could perhaps also manually preprocess and split the dataset into two according to the feature, and train two classifiers separately (but with K such features, this ends up in 2^K splitting).
Can additional information of this kind be easily included in the classification process?
Domain-specific hacks are left to the user. The easiest way to do this is to predict probabilities...
>>> prob = c.predict_proba(X)
and then rig the probabilities to get the right class out.
>>> invalid = (prob[:, 1] == 1) & (X[:, 0] == 1)
>>> prob[invalid, 1] = -np.inf
>>> pred = c.classes_[np.argmax(prob, axis=1)]
That's -np.inf instead of 0 so the 1 label doesn't come up as a result of tie-breaking vs. other zero-probability classes.