Related
I am doing the following:
import spicy
nlp = spacy.load("en")
doc = nlp('Hello Stack Over Flow, my name is Steve')
doc.vector:
In [1]: doc = nlp('Hello Stack Over Flow, my name is Steve')
In [2]: doc.vector
Out[2]:
array([ 1.67874452e-02, 1.43885329e-01, -1.64147541e-01, -3.52525562e-02,
1.71078995e-01, 5.81666678e-02, 1.42294103e-02, -1.58536658e-01,
-1.17119223e-01, 1.00338888e+00, -1.03455082e-01, 5.80027774e-02,
5.08872233e-02, -2.64734793e-02, -4.76809964e-02, -3.61649990e-02,
-4.25985567e-02, 4.86545563e-01, -5.22996634e-02, 2.66118869e-02,
-7.14791119e-02, 2.33504437e-02, -1.01438001e-01, 1.78358995e-03,
6.41188920e-02, -1.93965547e-02, -1.72182247e-02, -4.99197766e-02,
3.82994451e-02, 2.89904438e-02, 1.10834874e-01, 1.07230783e-01,
1.72666041e-03, 9.85269994e-02, -2.64622234e-02, 1.47332232e-02,
1.49853658e-02, -3.25594470e-02, -2.28943750e-02, -6.28201067e-02,
-4.13866527e-03, 4.12439965e-02, -1.09200180e-03, -3.77365127e-02,
3.02788876e-02, -2.47912239e-02, -3.86282206e-02, -8.49756673e-02,
8.79433304e-02, -7.35666696e-03, -2.35625561e-02, 1.29868105e-01,
-8.24742168e-02, 3.79751101e-02, 6.52077794e-03, 4.12433175e-03,
-4.44555469e-03, -8.54532197e-02, 4.30566669e-02, -4.90945578e-02,
1.08687999e-02, -3.58653292e-02, 3.19277793e-02, 1.70548886e-01,
7.04367757e-02, -1.03306666e-01, -6.25603348e-02, -4.16669573e-05,
-9.90156457e-03, 4.87144403e-02, -6.59128875e-02, 2.21944507e-03,
6.23853356e-02, -1.16886329e-02, -2.20711138e-02, 1.35971338e-01,
5.85511066e-02, -2.78507806e-02, -4.42699976e-02, 1.22686662e-01,
-4.96295579e-02, 8.47733300e-03, -1.72136649e-02, 3.73593345e-02,
1.38313353e-01, -1.81285888e-01, 8.07836726e-02, -1.01186670e-01,
1.90296680e-01, -8.37400090e-03, -4.79855575e-02, 4.62987460e-02,
4.97333193e-03, 1.08253332e-02, 1.37178123e-01, -4.36927788e-02,
-9.02644824e-03, 2.52826661e-02, -2.60283332e-02, 7.33327791e-02,
-4.21555527e-02, -9.45088938e-02, -2.36399993e-02, -2.59645544e-02,
-1.17972204e-02, -7.21249953e-02, -1.62978880e-02, 4.46572453e-02,
8.05888604e-03, 1.73073336e-02, -1.11245394e-01, -1.35631096e-02,
4.26412188e-02, -1.24742221e-02, -4.93782237e-02, -3.84650044e-02,
9.32500139e-03, -2.58344412e-02, 5.39288903e-03, -2.51024440e-02,
-1.68177821e-02, 1.81681886e-02, 6.95144460e-02, 5.96744493e-02,
1.28178876e-02, 8.18611085e-02, 2.03688871e-02, -1.45592675e-01,
-2.97091678e-02, 1.67966553e-03, 2.56901123e-02, -1.57507751e-02,
-3.29821557e-02, 3.69144455e-02, 2.69458871e-02, -7.87097737e-02,
-3.22544426e-02, 9.35557822e-04, 2.51506642e-02, -1.39920013e-02,
-5.63631117e-01, 1.28184333e-01, 8.25011209e-02, 4.69026715e-02,
-2.58401129e-02, 3.11454497e-02, 7.81277791e-02, -1.18433349e-02,
2.19431128e-02, 2.38199951e-03, -2.19482221e-02, 5.75609989e-02,
1.32304668e-01, 4.28974479e-02, -1.32128010e-02, 4.54772264e-02,
-9.00077820e-02, -7.34564438e-02, -8.14672261e-02, -5.10835573e-02,
-3.27358916e-02, 2.09213328e-02, 5.85612208e-02, -2.49340013e-02,
-1.03430830e-01, -1.28346771e-01, 4.52880040e-02, 5.96577907e-03,
1.12773672e-01, -3.90797779e-02, -5.79966642e-02, 4.97789842e-05,
2.49000057e-03, -2.88800001e-02, -9.96003374e-02, 3.41123343e-02,
-3.62301096e-02, -7.10571110e-02, -5.67906946e-02, 4.61289100e-03,
7.72120059e-02, -1.36105552e-01, -6.25717789e-02, -8.04037750e-02,
2.12122276e-02, -6.30133413e-03, -9.87700000e-02, 6.31399453e-02,
-8.64481106e-02, -4.26407792e-02, -8.36099982e-02, 1.07030040e-02,
-1.34339988e-01, 6.82333438e-03, 5.62012270e-02, 6.89233318e-02,
5.61566688e-02, -9.32652280e-02, 6.18273281e-02, 1.12723336e-01,
-1.04766667e-01, -2.15716790e-02, -1.15266666e-01, 4.57017794e-02,
7.47987852e-02, -9.02220607e-04, 7.75654465e-02, -2.66306698e-02,
1.93627775e-02, -4.89100069e-03, -1.43213451e-01, -6.52845576e-02,
1.64663326e-02, -5.07618897e-02, -1.49422223e-02, 4.21274304e-02,
1.06691113e-02, -5.97029589e-02, -1.20738111e-01, -1.61822215e-02,
-5.95551059e-02, 3.67141105e-02, 2.88833342e-02, 5.24356700e-02,
7.51844468e-03, -3.79579999e-02, 9.96864438e-02, 1.28289998e-01,
1.56755541e-02, -1.55926663e-02, -4.89732213e-02, 2.24273317e-02,
-9.15533304e-03, 7.32631087e-02, -7.48946667e-02, -1.15108885e-01,
-5.56773357e-02, -8.49866867e-03, -3.00188921e-02, 3.55113335e-02,
-4.22161110e-02, 7.19971135e-02, 3.67489979e-02, -1.00055551e-02,
7.52926618e-02, -1.43726662e-01, -4.08722041e-03, -1.49663329e-01,
1.41400262e-03, 5.52397817e-02, 8.86320025e-02, -7.44862184e-02,
-3.23222089e-03, 3.30205560e-02, 3.77681069e-02, 6.58650026e-02,
2.83081792e-02, -3.24210003e-02, 1.93070006e-02, 5.67157790e-02,
6.17166609e-02, 1.09540010e-02, 4.71896678e-02, 7.68444464e-02,
-2.51592230e-02, -4.28744499e-03, -2.40004435e-02, 3.28795537e-02,
1.25606894e-01, -6.05716556e-02, 5.52507788e-02, -2.12161113e-02,
-8.45399946e-02, -7.95067847e-02, -1.33965556e-02, -5.02544455e-02,
-3.03339995e-02, 1.19719980e-02, 6.15093298e-02, 1.11455554e-02,
1.24445252e-01, 5.54273315e-02, 1.28475904e-01, -9.19478834e-02,
-2.29498874e-02, -4.18815538e-02, 5.02915531e-02, -1.14721097e-02,
1.06602885e-01, -8.45602229e-02, -4.17976640e-02, 1.39088994e-02,
-2.19033333e-03, 7.99388885e-02, 1.08606648e-02, -1.27933361e-02,
-2.84678000e-03, -2.97433343e-02, -8.61347839e-02, 9.06177703e-03],
dtype=float32)
But when I running the following I get:
In [3]: for token in doc: print("{} : {}".format(token, token.vector[:3]))
Hello : [0. 0. 0.]
Stack : [0. 0. 0.]
Over : [0. 0. 0.]
Flow : [0. 0. 0.]
, : [-0.082752 0.67204 -0.14987 ]
my : [ 0.08649 0.14503 -0.4902 ]
name : [ 0.23231 -0.024102 -0.83964 ]
is : [-0.084961 0.502 0.0023823]
Steve : [0. 0. 0.]
Please advise why do I get different representations?
The first vector is whole sentence representation?
Please explain me why do I get different vectors?
The solution is: A real-valued meaning representation. Defaults to an average of the token vectors.
Source: https://spacy.io/api/doc#vector
Hope it will help others too.
To solve a 5 parameter model, I need at least 5 data points to get a unique solution. For x and y data below:
import numpy as np
x = np.array([[-0.24155831, 0.37083184, -1.69002708, 1.4578805 , 0.91790011,
0.31648635, -0.15957368],
[-0.37541846, -0.14572825, -2.19695883, 1.01136142, 0.57288752,
0.32080956, -0.82986857],
[ 0.33815532, 3.1123936 , -0.29317028, 3.01493602, 1.64978158,
0.56301755, 1.3958912 ],
[ 0.84486735, 4.74567324, 0.7982888 , 3.56604097, 1.47633894,
1.38743513, 3.0679506 ],
[-0.2752026 , 2.9110031 , 0.19218081, 2.0691105 , 0.49240373,
1.63213241, 2.4235483 ],
[ 0.89942508, 5.09052174, 1.26048572, 3.73477373, 1.4302902 ,
1.91907482, 3.70126468]])
y = np.array([-0.81388378, -1.59719762, -0.08256274, 0.61297275, 0.99359647,
1.11315445])
I used only 6 data to fit a 8 parameter model (7 slopes and 1 intercept).
lr = LinearRegression().fit(x, y)
print(lr.coef_)
array([-0.83916772, -0.57249998, 0.73025938, -0.02065629, 0.47637768,
-0.36962192, 0.99128474])
print(lr.intercept_)
0.2978781587718828
Clearly, it's using some kind of assignment to reduce the degrees of freedom. I tried to look into the source code but couldn't found anything about that. What method do they use to find the parameter of under specified model?
You don't need to reduce the degrees of freedom, it simply finds a solution to the least squares problem min sum_i (dot(beta,x_i)+beta_0-y_i)**2. For example, in the non-sparse case it uses the linalg.lstsq module from scipy. The default solver for this optimization problem is the gelsd LAPACK driver. If
A= np.concatenate((ones_v, X), axis=1)
is the augmented array with ones as its first column, then your solution is given by
x=numpy.linalg.pinv(A.T*A)*A.T*y
Where we use the pseudoinverse precisely because the matrix may not be of full rank. Of course, the solver doesn't actually use this formula but uses singular value Decomposition of A to reduce this formula.
I have a word2vec model that I created in PySpark. The model is saved as a .parquet file. I want to be able to access and query the model (or the words and word vectors) using vanilla Python because I am building a flask app that will allow a user to enter words of interest for finding synonyms.
I've extracted the words and word vectors, but I've noticed that while I have approximately 7000 unique words, my word vectors have a length of 100. For example, here are two words "serious" and "breaks". Their vectors only have a length of 100. Why is this? How is it able to then reconstruct the entire vector space with only 100 values for each word? Is it simply only giving me the top 100 or the first 100 values?
vectors.take(2)
Out[48]:
[Row(word=u'serious', vector=DenseVector([0.0784, -0.0882, -0.0342, -0.0153, 0.0223, 0.1034, 0.1218, -0.0814, -0.0198, -0.0325, -0.1024, -0.2412, -0.0704, -0.1575, 0.0342, -0.1447, -0.1687, 0.0673, 0.1248, 0.0623, -0.0078, -0.0813, 0.0953, -0.0213, 0.0031, 0.0773, -0.0246, -0.0822, -0.0252, -0.0274, -0.0288, 0.0403, -0.0419, -0.1122, -0.0397, 0.0186, -0.0038, 0.1279, -0.0123, 0.0091, 0.0065, 0.0884, 0.0899, -0.0479, 0.0328, 0.0171, -0.0962, 0.0753, -0.187, 0.034, -0.1393, -0.0575, -0.019, 0.0151, -0.0205, 0.0667, 0.0762, -0.0365, -0.025, -0.184, -0.0118, -0.0964, 0.1744, 0.0563, -0.0413, -0.054, -0.1764, -0.087, 0.0747, -0.022, 0.0778, -0.0014, -0.1313, -0.1133, -0.0669, 0.0007, -0.0378, -0.1093, -0.0732, 0.1494, -0.0815, -0.0137, 0.1009, -0.0057, 0.0195, 0.0085, 0.025, 0.0064, 0.0076, 0.0676, 0.1663, -0.0078, 0.0278, 0.0519, -0.0615, -0.0833, 0.0643, 0.0032, -0.0882, 0.1033])),
Row(word=u'breaks', vector=DenseVector([0.0065, 0.0027, -0.0121, 0.0296, -0.0467, 0.0297, 0.0499, 0.0843, 0.1027, 0.0179, -0.014, 0.0586, 0.06, 0.0534, 0.0391, -0.0098, -0.0266, -0.0422, 0.0188, 0.0065, -0.0309, 0.0038, -0.0458, -0.0252, 0.0428, 0.0046, -0.065, -0.0822, -0.0555, -0.0248, -0.0288, -0.0016, 0.0334, -0.0028, -0.0718, -0.0571, -0.0668, -0.0073, 0.0658, -0.0732, 0.0976, -0.0255, -0.0712, 0.0899, 0.0065, -0.04, 0.0964, 0.0356, 0.0142, 0.0857, 0.0669, -0.038, -0.0728, -0.0446, 0.1194, -0.056, 0.1022, 0.0459, -0.0343, -0.0861, -0.0943, -0.0435, -0.0573, 0.0229, 0.0368, 0.085, -0.0218, -0.0623, 0.0502, -0.0645, 0.0247, -0.0371, -0.0785, 0.0371, -0.0047, 0.0012, 0.0214, 0.0669, 0.049, -0.0294, -0.0272, 0.0642, -0.006, -0.0804, -0.06, 0.0719, -0.0109, -0.0272, -0.0366, 0.0041, 0.0556, 0.0108, 0.0624, 0.0134, -0.0094, 0.0219, 0.0164, -0.0545, -0.0055, -0.0193]))]
Any thoughts on the best way to reconstruct this model in vanilla python?
Just to improve on the comment by zero323, for anyone else who arrives here.
Word2Vec has a default setting to create word vectors of 100dims. To change this:
model = Word2Vec(sentences, size=300)
when initializing the model will create vectors of 300 dimensions.
I think the problem lays with your minCount parameter value for the Word2Vec model.
If this value is too high, less words get used in the training of the model resulting in a words vector of only 100.
7000 unique words is not a lot.
Try setting the minCount lower than the default 5.
model.setMinCount(value)
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=word2vec#pyspark.ml.feature.Word2Vec
I use RF twice in a row.
First, I fit it using max_features='auto' and the whole dataset (109 feature), in order to perform features selection.
The following is RandomForestClassifier.feature_importances_, it correctly gives me 109 score per each feature:
[0.00118087, 0.01268531, 0.0017589 , 0.01614814, 0.01105567,
0.0146838 , 0.0187875 , 0.0190427 , 0.01429976, 0.01311706,
0.01702717, 0.00901344, 0.01044047, 0.00932331, 0.01211333,
0.01271825, 0.0095337 , 0.00985686, 0.00952823, 0.01165877,
0.00193286, 0.0012602 , 0.00208145, 0.00203459, 0.00229907,
0.00242616, 0.00051358, 0.00071606, 0.00975515, 0.00171034,
0.01134927, 0.00687018, 0.00987706, 0.01507474, 0.01223525,
0.01170495, 0.00928417, 0.01083082, 0.01302036, 0.01002457,
0.00894818, 0.00833564, 0.00930602, 0.01100774, 0.00818604,
0.00675784, 0.00740617, 0.00185461, 0.00119627, 0.00159034,
0.00154336, 0.00478926, 0.00200773, 0.00063574, 0.00065675,
0.01104192, 0.00246746, 0.01663812, 0.01041134, 0.01401842,
0.02038318, 0.0202834 , 0.01290935, 0.01476593, 0.0108275 ,
0.0118773 , 0.01050919, 0.0111477 , 0.00684507, 0.01170021,
0.01291888, 0.00963295, 0.01161876, 0.00756015, 0.00178329,
0.00065709, 0. , 0.00246064, 0.00217982, 0.00305187,
0.00061284, 0.00063431, 0.01963523, 0.00265208, 0.01543552,
0.0176546 , 0.01443356, 0.01834896, 0.01385694, 0.01320648,
0.00966011, 0.0148321 , 0.01574166, 0.0167107 , 0.00791634,
0.01121442, 0.02171706, 0.01855552, 0.0257449 , 0.02925843,
0.01789742, 0. , 0. , 0.00379275, 0.0024365 ,
0.00333905, 0.00238971, 0.00068355, 0.00075399]
Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it.
Given max_features='auto' and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_, returns more (62):
[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]
Why? Shouldn't it returns just ~10 features importances?
You misunderstood the meaning of max_features, which is
The number of features to consider when looking for the best split
It is not the number of features when transforming the data.
It is the threshold in transform method that determines the most important features.
threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.
Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.
In [1]: import numpy as np
In [2]: from sklearn.metrics import roc_curve
In [3]: np.random.seed(11)
In [4]: aa = np.random.choice([True, False],100)
In [5]: bb = np.random.uniform(0,1,100)
In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)
In [7]: thresholds
Out[7]:
array([ 1.97396826, 0.97396826, 0.9711752 , 0.95996265, 0.95744405,
0.94983331, 0.93290463, 0.93241372, 0.93214862, 0.93076592,
0.92960511, 0.92245024, 0.91179548, 0.91112166, 0.87529458,
0.84493853, 0.84068543, 0.83303741, 0.82565223, 0.81096657,
0.80656679, 0.79387241, 0.77054807, 0.76763223, 0.7644911 ,
0.75964947, 0.73995152, 0.73825262, 0.73466772, 0.73421299,
0.73282534, 0.72391126, 0.71296292, 0.70930102, 0.70116428,
0.69606617, 0.65869235, 0.65670881, 0.65261474, 0.6487222 ,
0.64805644, 0.64221486, 0.62699782, 0.62522484, 0.62283401,
0.61601839, 0.611632 , 0.59548669, 0.57555854, 0.56828967,
0.55652111, 0.55063947, 0.53885029, 0.53369398, 0.52157349,
0.51900774, 0.50547317, 0.49749635, 0.493913 , 0.46154029,
0.45275916, 0.44777116, 0.43822067, 0.43795921, 0.43624093,
0.42039077, 0.41866343, 0.41550367, 0.40032843, 0.36761763,
0.36642721, 0.36567017, 0.36148354, 0.35843793, 0.34371331,
0.33436415, 0.33408289, 0.33387442, 0.31887024, 0.31818719,
0.31367915, 0.30216469, 0.30097917, 0.29995201, 0.28604467,
0.26930354, 0.2383461 , 0.22803687, 0.21800338, 0.19301808,
0.16902881, 0.1688173 , 0.14491946, 0.13648451, 0.12704826,
0.09141459, 0.08569481, 0.07500199, 0.06288762, 0.02073298,
0.01934336])
Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.
Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!
From the documentation:
thresholds : array, shape = [n_thresholds]
Decreasing thresholds on the decision function used to compute
fpr and tpr. thresholds[0] represents no instances being predicted
and is arbitrarily set to max(y_score) + 1.
So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.
this seems like a bug to me - in roc_curve(aa,bb), 1 is added to the first threshold. You should create an issue here https://github.com/scikit-learn/scikit-learn/issues