pyspark--FPGrowth: how does transform work on unseen transactions?

pyspark--FPGrowth: how does transform work on unseen transactions? - apache-spark

I am using pyspark.ml.fpm.FPGrowth in Spark 2.4 and I have a question about how precisely transform works on a transactions which are new.
My understanding is that model.transform will take each transaction X and find all Y such that
Conf(X-->Y) > minConfidence. It will then return the list of such Y ordered by confidence.
However suppose there is no transaction which contains X, so Conf(X-->Y) is undefined for all Y, I am unsure how the algorithm will transform this transaction.
This is a simple set of transactions taken from the docs:
DF = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 4])
], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0, minConfidence=0)
model = fpGrowth.fit(DF)
Then we supply a simple transaction as test data:
test_DF = spark.createDataFrame([
(0, [4,5])
], ["id", "items"])
test_DF = spark.createDataFrame(baskets, schema=schema)
model.transform(test_DF).show()
+---+------+----------+
|num| items|prediction|
+---+------+----------+
| 1|[4, 5]| [1, 3, 2]|
+---+------+----------+
Does anyone know how the prediction [1,3,2] was generated?

I think FPGrowthModel.transform applies the rules mined by FPGrowth on the transactions, so when ever it finds an itemset X in a transaction and at the same time we have a rule that says (X=>Y) then it suggests the item Y in prediction column for this transaction,
but the question know I noticed that in the case we have a transaction that contains X and Y it returns [ ] in prediction column unless there is a rule that says X & Y => Z in this case it will suggest Z instead.
So that makes it hard to evaluate the model with accuracy metric :(

Related

Build edges with similarity value from adjacency matrix using NetworkX?

I have build a graph from nodes like:
data = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': [21, -0.1, 0.003, 4, 2.1]
})
import networkx as nx
G = nx.Graph()
for i, attr in data.set_index('id').iterrows():
G.add_node(i, **attr.to_dict())
I have calculated similarity matrix (by excluding the id column).
from sklearn.metrics.pairwise import cosine_similarity
# Calculate the pairwise cosine similarities
S = cosine_similarity(data.drop('id', axis=1))
T = S.tolist()
df = pd.DataFrame.from_records(T)
Here is my adj matrix:
adj_mat = pd.DataFrame(df.to_numpy(), index=data['id'], columns=data['id'])
Now, how can I "attach" and connect the nodes using this adj_mat? For example I want node with id = 1 to connect to node with id = 2 with an edge with a similarity parameter equals to the similarity calculated in adj matrix.
Please advise how to do it.

Solved by firstly building the graph from adj matrix:
G = nx.Graph()
G = nx.from_pandas_adjacency(df_adj)
Then looping on my nodes data, update the nodes with their attributes (and remove the self loops):
G.remove_edges_from(nx.selfloop_edges(G))
for i, attr in data.set_index('id').iterrows():
G.add_node(i, **attr.to_dict())
Hope it will help others :)

Is it possible to chain iterators on a list in Python?

I have a list with negative and positive numbers:
a = [10, -5, 30, -23, 9]
And I want to get two lists of tuples of the values and their indexes in the original list so I did this.
b = list(enumerate(a))
positive = list(filter(lambda x: x[1] > 0, b))
negative = list(filter(lambda x: x[1] < 0, b))
But it feels kind of bad casting to list in between. Is there a way to chain these iterators immediately?

The iterator returned by enumerate can only be iterated once so that is why you have to cast it to a list or a tuple if you want to reuse the functionality of it multiple times.
If you don't want to cast it or find that looping over the list twice is too inefficient for your liking you might want to consider a standard for loop, since it allows you to append to both lists at the same time in only a single iteration of the original list:
a = [10, -5, 30, -23, 9]
positive, negative = [], []
for idx, val in enumerate(a):
if val == 0:
continue
(positive if val > 0 else negative).append((idx, val))
print(f"Positive: {positive}")
print(f"Negative: {negative}")
Output:
Positive: [(0, 10), (2, 30), (4, 9)]
Negative: [(1, -5), (3, -23)]

GRU/LSTM in Keras with input sequence of varying length

I'm working on a smaller project to better understand RNN, in particualr LSTM and GRU. I'm not at all an expert, so please bear that in mind.
The problem I'm facing is given as data in the form of:
>>> import numpy as np
>>> import pandas as pd
>>> pd.DataFrame([[1, 2, 3],[1, 2, 1], [1, 3, 2],[2, 3, 1],[3, 1, 1],[3, 3, 2],[4, 3, 3]], columns=['person', 'interaction', 'group'])
person interaction group
0 1 2 3
1 1 2 1
2 1 3 2
3 2 3 1
4 3 1 1
5 3 3 2
6 4 3 3
this is just for explanation. We have different person interacting with different groups in different ways. I've already encoded the various features. The last interaction of a user is always a 3, which means selecting a certain group. In the short example above person 1 chooses group 2, person 2 chooses group 1 and so on.
My whole data set is much bigger but I would like to understand first the conceptual part before throwing models at it. The task I would like to learn is given a sequence of interaction, which group is chosen by the person. A bit more concrete, I would like to have an output a list with all groups (there are 3 groups, 1, 2, 3) sorted by the most likely choice, followed by the second and third likest group. The loss function is therefore a mean reciprocal rank.
I know that in Keras Grus/LSTM can handle various length input. So my three questions are.
The input is of the format:
(samples, timesteps, features)
writing high level code:
import keras.layers as L
import keras.models as M
model_input = L.Input(shape=(?, None, 2))
timestep=None should imply the varying size and 2 is for the feature interaction and group. But what about the samples? How do I define the batches?
For the output I'm a bit puzzled how this should look like in this example? I think for each last interaction of a person I would like to have a list of length 3. Assuming I've set up the output
model_output = L.LSTM(3, return_sequences=False)
I then want to compile it. Is there a way of using the mean reciprocal rank?
model.compile('adam', '?')
I know the questions are fairly high level, but I would like to understand first the big picture and start to play around. Any help would therefore be appreciated.

The concept you've drawn in your question is a pretty good start already. I'll add a few things to make it work, as well as a code example below:
You can specify LSTM(n_hidden, input_shape=(None, 2)) directly, instead of inserting an extra Input layer; the batch dimension is to be omitted for the definition.
Since your model is going to perform some kind of classification (based on time series data) the final layer is what we'd expect from "normal" classification as well, a Dense(num_classes, action='softmax'). Chaining the LSTM and the Dense layer together will first pass the time series input through the LSTM layer and then feed its output (determined by the number of hidden units) into the Dense layer. activation='softmax' allows to compute a class score for each class (we're going to use one-hot-encoding in a data preprocessing step, see code example below). This means class scores are not ordered, but you can always do so via np.argsort or np.argmax.
Categorical crossentropy loss is suited for comparing the classification score, so we'll use that one: model.compile(loss='categorical_crossentropy', optimizer='adam').
Since the number of interactions. i.e. the length of model input, varies from sample to sample we'll use a batch size of 1 and feed in one sample at a time.
The following is a sample implementation w.r.t to the above considerations. Note that I modified your sample data a bit, in order to provide more "reasoning" behind group choices. Also each person needs to perform at least one interaction before choosing a group (i.e. the input sequence cannot be empty); if this is not the case for your data, then introducing an additional no-op interaction (e.g. 0) can help.
import pandas as pd
import tensorflow as tf
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(10, input_shape=(None, 2))) # LSTM for arbitrary length series.
model.add(tf.keras.layers.Dense(3, activation='softmax')) # Softmax for class probabilities.
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Example interactions:
# * 1: Likes the group,
# * 2: Dislikes the group,
# * 3: Chooses the group.
df = pd.DataFrame([
[1, 1, 3],
[1, 1, 3],
[1, 2, 2],
[1, 3, 3],
[2, 2, 1],
[2, 2, 3],
[2, 1, 2],
[2, 3, 2],
[3, 1, 1],
[3, 1, 1],
[3, 1, 1],
[3, 2, 3],
[3, 2, 2],
[3, 3, 1]],
columns=['person', 'interaction', 'group']
)
data = [person[1][['interaction', 'group']].values for person in df.groupby('person')]
x_train = [x[:-1] for x in data]
y_train = tf.keras.utils.to_categorical([x[-1, 1]-1 for x in data]) # Expects class labels from 0 to n (-> subtract 1).
print(x_train)
print(y_train)
class TrainGenerator(tf.keras.utils.Sequence):
def __init__(self, x, y):
self.x = x
self.y = y
def __len__(self):
return len(self.x)
def __getitem__(self, index):
# Need to expand arrays to have batch size 1.
return self.x[index][None, :, :], self.y[index][None, :]
model.fit_generator(TrainGenerator(x_train, y_train), epochs=1000)
pred = [model.predict(x[None, :, :]).ravel() for x in x_train]
for p, y in zip(pred, y_train):
print(p, y)
And the corresponding sample output:
[...]
Epoch 1000/1000
3/3 [==============================] - 0s 40ms/step - loss: 0.0037
[0.00213619 0.00241093 0.9954529 ] [0. 0. 1.]
[0.00123938 0.99718493 0.00157572] [0. 1. 0.]
[9.9632275e-01 7.5039308e-04 2.9268670e-03] [1. 0. 0.]
Using custom generator expressions: According to the documentation we can use any generator to yield the data. The generator is expected to yield batches of the data and loop over the whole data set indefinitely. When using tf.keras.utils.Sequence we do not need to specify the parameter steps_per_epoch as this will default to len(train_generator). Hence, when using a custom generator, we shall provide this parameter as well:
import itertools as it
model.fit_generator(((x_train[i % len(x_train)][None, :, :],
y_train[i % len(y_train)][None, :]) for i in it.count()),
epochs=1000,
steps_per_epoch=len(x_train))

Spark fp growth is not giving multiple items in consequent

I am using spark fp growth algorithm. I have given minsupport and confidence as o, so all combinations i should get
from pyspark.ml.fpm import FPGrowth
df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.0, minConfidence=0.0)
model = fpGrowth.fit(df)
# Display generated association rules.
model.associationRules.show()
First problem is always my consequent contain only one element
[1] -> [5, 2] should be a sample output freq of 1 is 3, freq of 5,2 is 2 and freq of [5, 2, 1]| is 2. so This should come in rules

The spark implementation is such that it would only return 1 element in the consequent.
You can check the same in the below link.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
//the consequent contains always only one element
itemSupport.get(consequent.head))
This is from the MLlib package(ML package uses MLlib implementation).
Cheers,

scikit learn: train_test_split, can I ensure same splits on different datasets

I understand that the train_test_split method splits a dataset into random train and test subsets. And using random_state=int can ensure we have the same splits on this dataset for each time the method is called.
My problem is slightly different.
I have two datasets, A and B, they contain identical sets of examples and the order of these examples appear in each dataset is also identical. But they key difference is that exmaples in each dataset uses a different sets of features.
I would like to test to see if the features used in A leads to better performance than features used in B. So I would like to ensure that when I call train_test_split on A and B, I can get the same splits on both datasets so that the comparison is meaningful.
Is this possible? Do I simply need to ensure the random_state in both method calls for both datasets are the same?
Thanks

Yes, random state is enough.
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X2 = np.hstack((X,X))
>>> X_train, X_test, _, _ = train_test_split(X,y, test_size=0.33, random_state=42)
>>> X_train2, X_test2, _, _ = train_test_split(X2,y, test_size=0.33, random_state=42)
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> X_train2
array([[4, 5, 4, 5],
[0, 1, 0, 1],
[6, 7, 6, 7]])
>>> X_test
array([[2, 3],
[8, 9]])
>>> X_test2
array([[2, 3, 2, 3],
[8, 9, 8, 9]])

Looking at the code for the train_test_split function, it sets the random seed inside the function at every call. So it will result in the same split every time. We can check that this works pretty simply
X1 = np.random.random((200, 5))
X2 = np.random.random((200, 5))
y = np.arange(200)
X1_train, X1_test, y1_train, y1_test = model_selection.train_test_split(X1, y,
test_size=0.1,
random_state=42)
X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X1, y,
test_size=0.1,
random_state=42)
print np.all(y1_train == y2_train)
print np.all(y1_test == y2_test)
Which outputs:
True
True
Which is good! Another way of doing this problem is to create one training and test split on all your features and then split your features up before training. However if you're in a weird situation where you need to do both at once (sometimes with similarity matrices you don't want test features in your training set), then you can use the StratifiedShuffleSplit function to return the indices of the data that belongs to each set. For example:
n_splits = 1
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits,
test_size=0.1,
random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]

Since sklearn.model_selection.train_test_split(*arrays, **options) accepts a variable number of arguments, you can just do like this:
A_train, A_test, B_train, B_test, _, _ = train_test_split(A, B, y,
test_size=0.33,
random_state=42)

As mentioned above you can use Random state parameter.
But if you want to globally generate the same results means setting the random state for all future calls u can use.
np.random.seed('Any random number ')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pyspark--FPGrowth: how does transform work on unseen transactions? - apache-spark

Related

Build edges with similarity value from adjacency matrix using NetworkX?

Is it possible to chain iterators on a list in Python?

GRU/LSTM in Keras with input sequence of varying length

Spark fp growth is not giving multiple items in consequent

scikit learn: train_test_split, can I ensure same splits on different datasets

Categories

Resources