How do I add extra features to the array created by doc2Vec model.infer_vector? - doc2vec

I am new to NLP and doc2Vec.
I used doc2vec to generate an array for each document.
I want to use the array and extra features (eg Income) as features for another model like Logistic Regression. How do I combine the doc array and extra features?
def get_vectors(model, tagged_docs):
sents = tagged_docs.values
targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words)) for doc in sents])
return targets, regressors
model= Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model.build_vocab(train_tagged.values)
model.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=1)
y_train_doc , X_train_doc = get_vectors(model, train_tagged)
print(X_train_doc)
(array([ 0.168, -0.36 , -0.13], dtype=float32),
array([ 0.185, 0.17, 0.04], dtype=float32),....)
X_train_doc is a tuple of array. So for each array, do I input each element into different columns in a df like below?
doc | Income | doc_feature1 | doc_feature2| doc_feature3 |
1 | 10000 | 0.168 | -0.36 | -0.13 |
2 | 500 | 0.185 | 0.17 | 0.04 |

That's going to depend on exactly what downstream libraries/models you're using. In general, you'd not want to re-introduce the overhead of Pandas Dataframes – downstream models are more likely to use raw numpy arrays.
If using scikit-learn pipelines, the FeatureUnion class may be of use:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
If instead you've got, say…
a numpy array of 10000 rows with 300 dimensions each for your 10000 elements' doc-vectors, then
10000 rows of 5 dimensions each for your 10000 elements' other features, in the same order
…then you may want to concatenate those horizontallly, into 10000 rows of 305 dimensions each, using something like the numpy hstack function (one of a variety of options):
https://numpy.org/doc/stable/reference/generated/numpy.hstack.html#numpy.hstack

Related

Concatenate non-image feature vector (dimension 3x1) with the image feature vector (dimension Nx1) at the flattend layer of CNN & fed to dense layer

I am working with a chest X-ray image dataset containing 15 classes. Image filenames are reserved in a CSV file with some non-image values. The image dataset is split into train, test, validation. I used imagedatagenerator to augment images.
|---------------------|------------------|---------------|
| Image Index | Patient Gender | View Position |
|---------------------|------------------|---------------|
| 00008236_001.png | 1 | 0 |
|---------------------|------------------|---------------|
| 00016410_014.png | 0 | 1 |
|---------------------|------------------|---------------|
| 00014751_001.png | 1 | 0 |
|---------------------|------------------|---------------|
| 00020318_012.png | 1 | 1 |
|---------------------|------------------|---------------|
[[CSV file containing non image features (Patient gender and chest x-ray image view position are encoded as {0,1})]
I want to concatenate these two column's values with the flattened layer of CNN.
I tried the following code but it showed error.
train_set_features = train_set[['View Position','Patient Gender']]
input_features =train_set_att.values # Shape=(90771, 2)
from keras.applications import *
from keras.layers import GlobalAveragePooling2D, Dense, Dropout, Flatten,Concatenate
from keras.models import Sequential
base_model = MobileNet( include_top=False,input_shape=(224,224,3))
x = base_model.output
x = Flatten()(x) #output shape = (None,7168)
non_image_features = Input(shape=[2,], name="non_image") #output shape = (None,2)
x= concatenate([x, non_image_features]) #output shape = (None,7170)
# and a logistic layer
predictions = Dense(15, activation="sigmoid",name='visualized_layer')(x)
model = Model(inputs=[base_model.input,non_image_features], outputs=predictions)
opt = Adam(learning_rate=0.001)
model.compile(optimizer=opt, loss='binary_crossentropy',metrics=['binary_accuracy','mae'])
history = model.fit_generator([train_generator,input_features ],
validation_data=valid_generator,
steps_per_epoch=100,
validation_steps=25,
epochs =64,)
predicted_values = model.predict_generator(test_generator, steps = len(test_generator))
Is this the right way to concatenate values with flattened layer?
there are a few problems with your code. First, you try to call concatenate, but there is no such function. Notice that the function you import is named Concatenate.
Second, if you try to call this function then you need to do it in a different way. Check the documentation to see how it works (look for examples of use): https://keras.io/api/layers/merging_layers/concatenate/
Lastly, if you will still experience problems, please report us the error you are receiving so we can try to find out what exactly is wrong.
First of all, as already pointed out, it's Concatenate, not concatenate.
From the documentation, fit_generator expects a generator, not a list [train_generator,input_features ]. The generator is expected to yield a tuple - (inputs, targets) at every iteration. Each batch (generated at each iteration ) usually contain about 64 or 128 training examples.
In your case the first argument is most likely train_generator and not [train_generator,input_features ]

what type of data receive as parameters the method predict of a LinearRegression instance from sklearn?

I am doing an example of Linear Regression with sciki-learn but i am confuse about the predict method;
In Scikit-Learn you will see this;
my_Linear_Model.predict(self, X)
Parameters:
X : array_like or sparse matrix, shape (n_samples, n_features)
Samples.
Note: array_like does not give me enough information of what type of data a predict method could receive. Remember that with Pandas we deal with Serie and DataFrame object.
I want to know the different types of array the predict method could receive.
Note: array_like does not give me enough information of what type of data a predict method could receive. Remember that with Pandas we deal with Serie and DataFrame object.
For linear regression in scikit-learn you need to use numeric types of your columns (int oder float), the other types are not able to read.
If you have a dataframe df:
df
A B C target
1 1 1 1
2 3 -1 10
you will select your array X, directly as columns from your dataframe:
X = df[['A', 'B', 'C']]
your target variable y, you will also select from your dataframe:
y = df[['target']]

Bunch object not callable - scikit-learn rcv1 dataset

I want to split the train and test set for RCV1 inbuilt dataset and apply k-means algorithm, however while trying to split the data, an error is shown saying bunch object not callable
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
x_train = rcv1(subset='train')
Indeed it is not; neither it is a dataframe - see the docs. Some extra info is included in the DESCR attribute:
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
print(rcv1.DESCR)
Result:
.. _rcv1_dataset:
RCV1 dataset
------------
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually
categorized newswire stories made available by Reuters, Ltd. for research
purposes. The dataset is extensively described in [1]_.
**Data Set Characteristics:**
============== =====================
Classes 103
Samples total 804414
Dimensionality 47236
Features real, between 0 and 1
============== =====================
:func:`sklearn.datasets.fetch_rcv1` will load the following
version: RCV1-v2, vectors, full sets, topics multilabels::
>>> from sklearn.datasets import fetch_rcv1
>>> rcv1 = fetch_rcv1()
It returns a dictionary-like object, with the following attributes:
``data``:
The feature matrix is a scipy CSR sparse matrix, with 804414 samples and
47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.
A nearly chronological split is proposed in [1]_: The first 23149 samples are
the training set. The last 781265 samples are the testing set. This follows
the official LYRL2004 chronological split. The array has 0.16% of non zero
values::
>>> rcv1.data.shape
(804414, 47236)
``target``:
The target values are stored in a scipy CSR sparse matrix, with 804414 samples
and 103 categories. Each sample has a value of 1 in its categories, and 0 in
others. The array has 3.15% of non zero values::
>>> rcv1.target.shape
(804414, 103)
``sample_id``:
Each sample can be identified by its ID, ranging (with gaps) from 2286
to 810596::
>>> rcv1.sample_id[:3]
array([2286, 2287, 2288], dtype=uint32)
``target_names``:
The target values are the topics of each sample. Each sample belongs to at
least one topic, and to up to 17 topics. There are 103 topics, each
represented by a string. Their corpus frequencies span five orders of
magnitude, from 5 occurrences for 'GMIL', to 381327 for 'CCAT'::
>>> rcv1.target_names[:3].tolist() # doctest: +SKIP
['E11', 'ECAT', 'M11']
The dataset will be downloaded from the `rcv1 homepage`_ if necessary.
The compressed size is about 656 MB.
.. _rcv1 homepage: http://jmlr.csail.mit.edu/papers/volume5/lewis04a/
.. topic:: References
.. [1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004).
RCV1: A new benchmark collection for text categorization research.
The Journal of Machine Learning Research, 5, 361-397.
So, if you want to stick to the original training & test subsets, as described above, you should simply do:
X_train = rcv1.data[0:23149,]
X.train.shape
# (23149, 47236)
X_test = rcv1.data[23149:,]
X_test.shape
# (781265, 47236)
and similarly for your y_train and y_test, using rcv1.target.
If you want to use a different training & test partition, use:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
rcv1.data, rcv1.target, test_size=0.33, random_state=42)
adjusting your test_size accordingly.

how to "normalize" vectors values when using Spark CountVectorizer?

CountVectorizer and CountVectorizerModel often creates a sparse feature vector that looks like this:
(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])
this basically says the total size of the vocabulary is 10, the current document has 5 unique elements, and in the feature vector, these 5 unique elements take position at 0, 1, 4, 6 and 8. Also, one of the elements show up twice, therefore the 2.0 value.
Now, I would like to "normalize" the above feature vector and make it look like this,
(10,[0,1,4,6,8],[0.3333,0.1667,0.1667,0.1667,0.1667])
i.e., each value is divided by 6, the total number of all elements together. For example, 0.3333 = 2.0/6.
So is there a way to do this efficiently here?
Thanks!
You can use Normalizer
class pyspark.ml.feature.Normalizer(*args, **kwargs)
Normalize a vector to have unit norm using the given p-norm.
with 1-norm
from pyspark.ml.linalg import SparseVector
from pyspark.ml.feature import Normalizer
df = spark.createDataFrame([
(SparseVector(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0]), )
], ["features"])
Normalizer(inputCol="features", outputCol="features_norm", p=1).transform(df).show(1, False)
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |features |features_norm |
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])|(10,[0,1,4,6,8],[0.3333333333333333,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666])|
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+

Behaviour of train_test_split() from Scikit-learn

I am curious how the train_test_split() method of Scikit-learn will behave in the following scenario:
An imaginary dataset:
id, count, size
1, 4, 8
2, 5, 9
3, 6, 0
say I would divide it into two separate sets like this (keeping 'id' in both):
id, count | id, size
1, 4 | 1, 8
2, 5 | 2, 9
3, 6 | 3, 0
And split them both with train_test_split() with the same random_state of 0. Would the order of both be the same with 'id' as reference? (since you are shuffling the same dataset but with different parts left out)
I am curious as to how this works because I have two models. The first one gets trained with the dataset and adds it's results to the dataset, part of which is then used to train the second model.
When doing this it's important that when testing the generalization of the second model, no data points are used which were also used to train the first model. This is because the data was 'seen before' and the model will know what to do with it, so then you are not testing the generalization to new data anymore.
It would be great if train_test_split() would shuffle it the same since then one would not need to keep track of what data was used to train the first algorithm to prevent contamination of the test results.
They should have the same resulting indices if you use the same random_state parameter in each call.
However--you could also just reverse your order of operations. Call test/train split on the parent dataset, then create two sub-sets from both the test and train sets that result.
Example:
print(df)
id count size
0 1 4 8
1 2 5 9
2 3 6 0
from sklearn.model_selection import train_test_split
dfa = df[['id', 'count']].copy()
dfb = df[['id', 'size']].copy()
rstate = 123
traina, testa = train_test_split(dfa, random_state=123)
trainb, testb = train_test_split(dfb, random_state=123)
assert traina.index.equals(trainb.index)
# True

Resources