Spark fp growth is not giving multiple items in consequent - python-3.x

I am using spark fp growth algorithm. I have given minsupport and confidence as o, so all combinations i should get
from pyspark.ml.fpm import FPGrowth
df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.0, minConfidence=0.0)
model = fpGrowth.fit(df)
# Display generated association rules.
model.associationRules.show()
First problem is always my consequent contain only one element
[1] -> [5, 2] should be a sample output freq of 1 is 3, freq of 5,2 is 2 and freq of [5, 2, 1]| is 2. so This should come in rules

The spark implementation is such that it would only return 1 element in the consequent.
You can check the same in the below link.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
//the consequent contains always only one element
itemSupport.get(consequent.head))
This is from the MLlib package(ML package uses MLlib implementation).
Cheers,

Related

Getting Concordance result of lifelines CoxPH model in a dataframe

I am using CoxPH implementation of lifelines package in python. Currently, results are in tabular view of coefficients and related stats and can be seen with print_summary(). Here is an example
df = pd.DataFrame({'duration': [4, 6, 5, 5, 4, 6],
'event': [0, 0, 0, 1, 1, 1],
'cat': [0, 1, 0, 1, 0, 1]})
cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event', show_progress=True)
cph.print_summary()
out[]
[Table of results from print_summary()][1]
How can I get only Concordance index as dataframe or list. cph.summary
returns a dataframe of main results i.e. p-values and coef but it does not include concordance index and other surrounding information.
you can access the c-index with cph.concordance_index_ - and you could put this into a list or dataframe if you wish.
You can also compute the concordance index for Cox model using a small script available at this link. The code is given below.
from lifelines.utils import concordance_index
cph = CoxPHFitter().fit(df, 'T', 'E')
Cindex = concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])
This code will give C-index value, which also matches with cph.concordance_index_

pyspark--FPGrowth: how does transform work on unseen transactions?

I am using pyspark.ml.fpm.FPGrowth in Spark 2.4 and I have a question about how precisely transform works on a transactions which are new.
My understanding is that model.transform will take each transaction X and find all Y such that
Conf(X-->Y) > minConfidence. It will then return the list of such Y ordered by confidence.
However suppose there is no transaction which contains X, so Conf(X-->Y) is undefined for all Y, I am unsure how the algorithm will transform this transaction.
This is a simple set of transactions taken from the docs:
DF = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 4])
], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0, minConfidence=0)
model = fpGrowth.fit(DF)
Then we supply a simple transaction as test data:
test_DF = spark.createDataFrame([
(0, [4,5])
], ["id", "items"])
test_DF = spark.createDataFrame(baskets, schema=schema)
model.transform(test_DF).show()
+---+------+----------+
|num| items|prediction|
+---+------+----------+
| 1|[4, 5]| [1, 3, 2]|
+---+------+----------+
Does anyone know how the prediction [1,3,2] was generated?
I think FPGrowthModel.transform applies the rules mined by FPGrowth on the transactions, so when ever it finds an itemset X in a transaction and at the same time we have a rule that says (X=>Y) then it suggests the item Y in prediction column for this transaction,
but the question know I noticed that in the case we have a transaction that contains X and Y it returns [ ] in prediction column unless there is a rule that says X & Y => Z in this case it will suggest Z instead.
So that makes it hard to evaluate the model with accuracy metric :(

python numpy stack matrices and add specific corner/column entries

Say we have two matrices A and B with a size of 2 by 2. Is there a command that can stack them horizontally and add A[:,1] to B[:,0] so that the resulting matrix C is 2 by 3, with C[:,0] = A[:,0], C[:,1] = A[:,1] + B[:,0], C[:,2] = B[:,1]. One step further, stacking them on diagonal so that C[0:2,0:2] = A, C[1:2,1:2] = B, C[1,1] = A[1,1] + B[0,0]. C is 3 by 3 in this case. Hard coding this routine is not hard, but I'm just curious since MATLAB has a similar function if my memory serves me well.
A straight forward approach is to copy or add the two arrays to a target:
In [882]: A=np.arange(4).reshape(2,2)
In [883]: C=np.zeros((2,3),int)
In [884]: C[:,:-1]=A
In [885]: C[:,1:]+=A # or B
In [886]: C
Out[886]:
array([[0, 1, 1],
[2, 5, 3]])
Another approach is to to pad A at the end, pad B at the start, and sum; while there is a convenient pad function, it won't be any faster.
And for the diagonal
In [887]: C=np.zeros((3,3),int)
In [888]: C[:-1,:-1]=A
In [889]: C[1:,1:]+=A
In [890]: C
Out[890]:
array([[0, 1, 0],
[2, 3, 1],
[0, 2, 3]])
Again the 2 arrays could be pad and added.
I'm not aware of any specialized function to do this; even if there were, it probably would do the same thing. This isn't a common enough operation to justify a compiled version.
I have built up finite element sparse matrices by adding over lapping element matrices. The sparse formats for both MATLAB and scipy facilitate this (duplicate coordinates are summed).
============
In [896]: np.pad(A,[[0,0],[0,1]],mode='constant')+np.pad(A,[[0,0],[1,0]],mode='
...: constant')
Out[896]:
array([[0, 1, 1],
[2, 5, 3]])
In [897]: np.pad(A,[[0,1],[0,1]],mode='constant')+np.pad(A,[[1,0],[1,0]],mode='
...: constant')
Out[897]:
array([[0, 1, 0],
[2, 3, 1],
[0, 2, 3]])
What's the special MATLAB code for doing this?
in Octave I found:
prepad(A,3,0,axis=2)+postpad(A,3,0,axis=2)

how i can fetching data in this array and if condition for make it be an cluster with minimum distance in python?

I will to implement kmeans in python, but i just don't know to process min distance from euclidean distance.
i have been calculate data in 3 cluster,
this is my result array :
[array([4, 5], dtype=int64), 4.1231056256176606, 0,
array([4, 8], dtype=int64), 4.4721359549995796, 0,
array([14, 23], dtype=int64), 22.022715545545239, 0,
array([4, 5], dtype=int64), 1.0, 1,
array([4, 8], dtype=int64), 2.0, 1,
array([14, 23], dtype=int64), 19.723082923316021, 1]
here its my code:
for i in range(len(centroidrandom)):
for j in range(3):
jarak_=euclidean_distances(data[j],centroidrandom[:][i])
cluster.append(data[j])
cluster.append(jarak_[0][0])
cluster.append(i)
print(cluster)
Here is some example code for kmeans clustering with three clusters, modified from the example given in the comment above:
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation for three sets of data
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2), rand(150,2) + array([0,.5])))
# computing K-Means with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
print idx
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or',
data[idx==2,0],data[idx==2,1],'oy')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

scikit-learn: Get selected features for prediction data

I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])

Resources