Why do mllib word2vec word vectors only have 100 elements? - apache-spark

I have a word2vec model that I created in PySpark. The model is saved as a .parquet file. I want to be able to access and query the model (or the words and word vectors) using vanilla Python because I am building a flask app that will allow a user to enter words of interest for finding synonyms.
I've extracted the words and word vectors, but I've noticed that while I have approximately 7000 unique words, my word vectors have a length of 100. For example, here are two words "serious" and "breaks". Their vectors only have a length of 100. Why is this? How is it able to then reconstruct the entire vector space with only 100 values for each word? Is it simply only giving me the top 100 or the first 100 values?
vectors.take(2)
Out[48]:
[Row(word=u'serious', vector=DenseVector([0.0784, -0.0882, -0.0342, -0.0153, 0.0223, 0.1034, 0.1218, -0.0814, -0.0198, -0.0325, -0.1024, -0.2412, -0.0704, -0.1575, 0.0342, -0.1447, -0.1687, 0.0673, 0.1248, 0.0623, -0.0078, -0.0813, 0.0953, -0.0213, 0.0031, 0.0773, -0.0246, -0.0822, -0.0252, -0.0274, -0.0288, 0.0403, -0.0419, -0.1122, -0.0397, 0.0186, -0.0038, 0.1279, -0.0123, 0.0091, 0.0065, 0.0884, 0.0899, -0.0479, 0.0328, 0.0171, -0.0962, 0.0753, -0.187, 0.034, -0.1393, -0.0575, -0.019, 0.0151, -0.0205, 0.0667, 0.0762, -0.0365, -0.025, -0.184, -0.0118, -0.0964, 0.1744, 0.0563, -0.0413, -0.054, -0.1764, -0.087, 0.0747, -0.022, 0.0778, -0.0014, -0.1313, -0.1133, -0.0669, 0.0007, -0.0378, -0.1093, -0.0732, 0.1494, -0.0815, -0.0137, 0.1009, -0.0057, 0.0195, 0.0085, 0.025, 0.0064, 0.0076, 0.0676, 0.1663, -0.0078, 0.0278, 0.0519, -0.0615, -0.0833, 0.0643, 0.0032, -0.0882, 0.1033])),
Row(word=u'breaks', vector=DenseVector([0.0065, 0.0027, -0.0121, 0.0296, -0.0467, 0.0297, 0.0499, 0.0843, 0.1027, 0.0179, -0.014, 0.0586, 0.06, 0.0534, 0.0391, -0.0098, -0.0266, -0.0422, 0.0188, 0.0065, -0.0309, 0.0038, -0.0458, -0.0252, 0.0428, 0.0046, -0.065, -0.0822, -0.0555, -0.0248, -0.0288, -0.0016, 0.0334, -0.0028, -0.0718, -0.0571, -0.0668, -0.0073, 0.0658, -0.0732, 0.0976, -0.0255, -0.0712, 0.0899, 0.0065, -0.04, 0.0964, 0.0356, 0.0142, 0.0857, 0.0669, -0.038, -0.0728, -0.0446, 0.1194, -0.056, 0.1022, 0.0459, -0.0343, -0.0861, -0.0943, -0.0435, -0.0573, 0.0229, 0.0368, 0.085, -0.0218, -0.0623, 0.0502, -0.0645, 0.0247, -0.0371, -0.0785, 0.0371, -0.0047, 0.0012, 0.0214, 0.0669, 0.049, -0.0294, -0.0272, 0.0642, -0.006, -0.0804, -0.06, 0.0719, -0.0109, -0.0272, -0.0366, 0.0041, 0.0556, 0.0108, 0.0624, 0.0134, -0.0094, 0.0219, 0.0164, -0.0545, -0.0055, -0.0193]))]
Any thoughts on the best way to reconstruct this model in vanilla python?

Just to improve on the comment by zero323, for anyone else who arrives here.
Word2Vec has a default setting to create word vectors of 100dims. To change this:
model = Word2Vec(sentences, size=300)
when initializing the model will create vectors of 300 dimensions.

I think the problem lays with your minCount parameter value for the Word2Vec model.
If this value is too high, less words get used in the training of the model resulting in a words vector of only 100.
7000 unique words is not a lot.
Try setting the minCount lower than the default 5.
model.setMinCount(value)
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=word2vec#pyspark.ml.feature.Word2Vec

Related

How to go around truncating long sentences with Hugginface Tokenizers?

I am new to tokenizers. My understanding is that the truncate attribute just cuts the sentences. But I need the whole sentence for context.
For example, my sentence is :
"Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri daha sonra 980 yılında nasıl adlandırılmıştır? Ali bin Abbas'ın eseri Rezi'nin hangi isimli eserinden daha özlü ve daha sistematikdir? Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri İbn-i Sina'nın hangi isimli eserinden daha uygulamalı bir biçimde yazılmıştır? Kitab el-Maliki Avrupa'da Constantinus Africanus tarafından hangi dile çevrilmiştir? Kitab el-Maliki'nin ilk bölümünde neye ağırlık verilmiştir?
But when I use max_length=64, truncation=True and pad_to_max_length=True for my encoder(as suggested in the internet), half of sentence is being gone:
▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '-', 's', '▁Sina', '▁ad', 'lı', '▁es', 'eri', '▁daha', '▁sonra', '▁980', '▁yıl', 'ında', '▁na', 'sıl', '▁adlandır', 'ılmıştır', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁es', 'eri', '▁Rez', 'i', "'", 'nin', '▁', 'hangi', '▁is', 'imli', '▁es', 'erinden', '▁daha', '▁', 'özlü', '▁ve', '▁daha', '▁sistema', 'tik', 'dir', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '</s>']
And when I increase max length, CUDA is running out of memory of course. What should be my approach for long texts in the dataset?
My code for encoding:
input_encodings = tokenizer.batch_encode_plus(
example_batch['context'],
max_length=512,
add_special_tokens=True,
truncation=True,
pad_to_max_length=True)
target_encodings = tokenizer.batch_encode_plus(
example_batch['questions'],
max_length=64,
add_special_tokens=True,
truncation=True,
pad_to_max_length=True)
Yes, the truncate attribute just keeps the given number of subwords from the left. The workaround depends on the task you are solving and the data that you use.
Are the long sequence frequent in your data? If not, you can just safely throw away the instances because it is unlikely that the model would learn to generalize for long sequences anyway.
If you really need the long context, you have plenty of options:
Decrease the batch size (and perhaps do updates once after several batches).
Make the model smaller: use either a smaller dimension or fewer layers.
Use a different architecture: Transformers need quadratic memory w.r.t sequence length. Wouldn't an LSTM or CNN do the job? What architectures for long sequences (e.g., Reformer, Longformer).
If you need to use a pre-trained BERT-like model and there is no model of the size that would fit your needs, you can distill a smaller model or a model with a more suitable architecture yourself.
Perhaps you can split the input. In tasks like answer span selection, you can split the text where you are looking for an answer into smaller chunks and search in the chunks independently.

fasttext: why do aligned vectors contain only one value per word?

I was taking a look at the Fasttext aligned vectors of some languages and was surprised to find that each vectors consisted of one value only. I was expecting a Matrix witch multidimensional vectors belonging to each word, instead there is only one column of numbers. I'm very new to this field and was wondering if somebody could explain to me, how this single number belongig to each word came to be and wether I'm looking at a semantic space as I was expecting or something different (if so what is it and are alingend multidimensional semantic spaces available somewhere?)
I think you may be misinterpreting those files.
When I look at one of those files – for example wiki.en.align.vec – each line is a word-token, then 300 different values (to provide a 300-dimensional word-vector).
For example, the 4th line of the file is:
the -0.0324 -0.0462 -0.0087 0.0994 0.0147 -0.0198 -0.0811 -0.0362 0.0445 0.0402 -0.0199 -0.1173 0.0906 -0.0304 -0.0320 -0.0374 -0.0249 -0.0099 0.0017 0.0719 -0.0834 0.0382 -0.1141 -0.0288 -0.0666 -0.0365 -0.0006 0.0098 0.0282 0.0310 -0.0773 0.0755 -0.0528 0.1225 -0.0138 -0.0879 0.0036 -0.0593 0.0416 -0.0588 0.0266 -0.0011 -0.0419 0.0141 0.0388 -0.0597 -0.0203 0.0444 0.0253 -0.0316 0.0352 -0.0318 -0.0473 0.0347 -0.0250 0.0289 0.0426 0.0218 -0.0254 0.0486 -0.0252 -0.0904 0.1607 -0.0379 0.0231 -0.0988 -0.1213 -0.0926 -0.1116 0.0345 -0.1856 -0.0409 0.0306 -0.0653 -0.0377 -0.0301 0.0361 0.1212 0.0105 -0.0354 0.0552 0.0363 -0.0427 0.0555 -0.0031 -0.0830 -0.0325 0.0415 -0.0461 -0.0615 -0.0412 0.0060 0.1680 -0.1347 0.0271 -0.0438 0.0364 0.0121 0.0018 -0.0138 -0.0625 -0.0161 -0.0009 -0.0373 -0.1009 -0.0583 0.0038 0.0109 -0.0068 0.0319 -0.0043 -0.0412 -0.0506 -0.0674 0.0426 -0.0031 0.0788 0.0924 0.0559 0.0449 0.1364 0.1132 -0.0378 0.1060 0.0130 0.0349 0.0638 0.1020 0.0459 0.0634 -0.0870 0.0447 -0.0124 0.0167 -0.0603 0.0297 -0.0298 0.0691 -0.0280 0.0749 0.0474 0.0275 0.0255 0.0184 0.0085 0.1116 0.0233 0.0176 0.0327 0.0471 0.0662 -0.0353 -0.0387 -0.0336 -0.0354 -0.0348 0.0157 -0.0294 0.0710 0.0299 -0.0602 0.0732 -0.0344 0.0419 0.0773 0.0119 -0.0550 0.0377 0.0808 -0.0424 -0.0977 -0.0386 -0.0334 -0.0384 -0.0520 0.0641 0.0049 0.1226 -0.0011 -0.0131 0.0224 0.0138 -0.0243 0.0544 -0.0164 0.1194 0.0916 -0.0755 0.0565 0.0235 -0.0009 -0.0818 0.0953 0.0873 -0.0215 0.0240 -0.0271 0.0134 -0.0870 0.0597 -0.0073 -0.0230 -0.0220 0.0562 -0.0069 -0.0796 -0.0118 0.0059 0.0221 0.0509 0.1175 0.0508 -0.0044 -0.0265 0.0328 -0.0525 0.0493 -0.1309 -0.0674 0.0148 -0.0024 -0.0163 -0.0241 0.0726 -0.0165 0.0368 -0.0914 0.0197 0.0018 -0.0149 0.0654 0.0912 -0.0638 -0.0135 -0.0277 -0.0078 0.0092 -0.0477 0.0054 -0.0153 -0.0411 -0.0177 0.0874 0.0221 0.1040 0.1004 0.0595 -0.0610 0.0650 -0.0235 0.0257 0.1208 0.0129 -0.0086 -0.0846 0.1102 -0.0338 -0.0553 0.0166 -0.0602 0.0128 0.0792 -0.0181 0.0046 -0.0548 -0.0394 -0.0546 0.0425 0.0048 -0.1172 -0.0925 -0.0357 -0.0123 0.0371 -0.0142 0.0157 0.0442 0.1186 0.0834 -0.0293 0.0313 -0.0287 0.0095 0.0080 0.0566 -0.0370 0.0257 0.1032 -0.0431 0.0544 0.0323 -0.1076 -0.0187 0.0407 -0.0198 -0.0255 -0.0505 0.0827 -0.0650 0.0176
Thus every one of the 2,519,370 word-tokens has a 300-dimensional vector.
If this isn't what you're seeing, you should explain further. If this is what you're seeing and you were expecting something else, you should explain further what you were expecting.

How to estimate camera pose according to a projective transformation matrix of two consecutive frames?

I'm working on the kitti visual odometry dataset. I use projective transformation to register two 2D consecutive frames(see projective transformation example here
). I want to know how this 3*3 projective transformation matrix is related to the ground truth poses provided by the kitti dataset.
This dataset gives the ground truth poses (trajectory) for the sequences, which is described below:
Folder 'poses':
The folder 'poses' contains the ground truth poses (trajectory) for the
first 11 sequences. This information can be used for training/tuning your
method. Each file xx.txt contains a N x 12 table, where N is the number of
frames of this sequence. Row i represents the i'th pose of the left camera
coordinate system (i.e., z pointing forwards) via a 3x4 transformation
matrix. The matrices are stored in row aligned order (the first entries
correspond to the first row), and take a point in the i'th coordinate
system and project it into the first (=0th) coordinate system. Hence, the
translational part (3x1 vector of column 4) corresponds to the pose of the
left camera coordinate system in the i'th frame with respect to the first
(=0th) frame. Your submission results must be provided using the same data
format.
Some samples of the given groud-truth poses:
1.000000e+00 9.043680e-12 2.326809e-11 5.551115e-17 9.043683e-12 1.000000e+00 2.392370e-10 3.330669e-16 2.326810e-11 2.392370e-10 9.999999e-01 -4.440892e-16
9.999978e-01 5.272628e-04 -2.066935e-03 -4.690294e-02 -5.296506e-04 9.999992e-01 -1.154865e-03 -2.839928e-02 2.066324e-03 1.155958e-03 9.999971e-01 8.586941e-01
9.999910e-01 1.048972e-03 -4.131348e-03 -9.374345e-02 -1.058514e-03 9.999968e-01 -2.308104e-03 -5.676064e-02 4.128913e-03 2.312456e-03 9.999887e-01 1.716275e+00
9.999796e-01 1.566466e-03 -6.198571e-03 -1.406429e-01 -1.587952e-03 9.999927e-01 -3.462706e-03 -8.515762e-02 6.193102e-03 3.472479e-03 9.999747e-01 2.574964e+00
9.999637e-01 2.078471e-03 -8.263498e-03 -1.874858e-01 -2.116664e-03 9.999871e-01 -4.615826e-03 -1.135202e-01 8.253797e-03 4.633149e-03 9.999551e-01 3.432648e+00
9.999433e-01 2.586172e-03 -1.033094e-02 -2.343818e-01 -2.645881e-03 9.999798e-01 -5.770163e-03 -1.419150e-01 1.031581e-02 5.797170e-03 9.999299e-01 4.291335e+00
9.999184e-01 3.088363e-03 -1.239599e-02 -2.812195e-01 -3.174350e-03 9.999710e-01 -6.922975e-03 -1.702743e-01 1.237425e-02 6.961759e-03 9.998991e-01 5.148987e+00
9.998890e-01 3.586305e-03 -1.446384e-02 -3.281178e-01 -3.703403e-03 9.999605e-01 -8.077186e-03 -1.986703e-01 1.443430e-02 8.129853e-03 9.998627e-01 6.007777e+00
9.998551e-01 4.078705e-03 -1.652913e-02 -3.749547e-01 -4.231669e-03 9.999484e-01 -9.229794e-03 -2.270290e-01 1.649063e-02 9.298401e-03 9.998207e-01 6.865477e+00
9.998167e-01 4.566671e-03 -1.859652e-02 -4.218367e-01 -4.760342e-03 9.999347e-01 -1.038342e-02 -2.554151e-01 1.854788e-02 1.047004e-02 9.997731e-01 7.724036e+00
9.997738e-01 5.049868e-03 -2.066463e-02 -4.687329e-01 -5.289072e-03 9.999194e-01 -1.153730e-02 -2.838096e-01 2.060470e-02 1.164399e-02 9.997198e-01 8.582886e+00
9.997264e-01 5.527315e-03 -2.272922e-02 -5.155474e-01 -5.816781e-03 9.999025e-01 -1.268908e-02 -3.121547e-01 2.265686e-02 1.281782e-02 9.996611e-01 9.440275e+00
9.996745e-01 6.000540e-03 -2.479692e-02 -5.624310e-01 -6.345160e-03 9.998840e-01 -1.384246e-02 -3.405416e-01 2.471098e-02 1.399530e-02 9.995966e-01 1.029896e+01
9.996182e-01 6.468772e-03 -2.686440e-02 -6.093087e-01 -6.873365e-03 9.998639e-01 -1.499561e-02 -3.689250e-01 2.676374e-02 1.517453e-02 9.995266e-01 1.115757e+01
9.995562e-01 7.058450e-03 -2.894213e-02 -6.562052e-01 -7.530449e-03 9.998399e-01 -1.623192e-02 -3.973964e-01 2.882292e-02 1.644266e-02 9.994492e-01 1.201541e+01
9.995095e-01 5.595311e-03 -3.081450e-02 -7.018788e-01 -6.093682e-03 9.998517e-01 -1.610315e-02 -4.239119e-01 3.071983e-02 1.628303e-02 9.993953e-01 1.286965e+01
The common name for your "projective transformation" is homography. In a calibrated setup (i.e. if you know your camera's field of view or, equivalently, its focal length) a homography can be decomposed into 3D rotation and translation, the latter only up to scale. The decomposition algorithm additionally produces the normal to the 3D plane inducting the homography. The algorithm has up to 4 solutions, of which only one is feasible when you apply additional constraints, such as that the matched image points triangulate in front of the camera, and that the general direction of the translation match a known prior.
More information about the method is in a well-known paper by Malis and Vargas. There is an implementation in OpenCV, under the name decomposeHomographyMat.

Reducing Model file size in LIBSVM

I want to reduce the model file size . Can we reduce it by reducing the number of digits in the weights of the model file. The number of classes in my model file is around 3800 and the number of features is around 357000. Here is some excerpt from the model file. Can I reduce the number of digits in these weights.
solver_type L2R_L2LOSS_SVC_DUAL
nr_class 3821
nr_feature 357021
bias -1.000000000000000
w
-0.6298615183549175 -0.6884816945277815 -0.9850473581929793
-0.2730180225739936 -0.4444522939544599 -0.3045368061994185
-0.6752904784743610 -0.4936186126242763 -0.8167435931134331
-0.8747648882598349 -0.4980187300672689 -0.8255372912521536
-0.3329812532124196 -0.1751416471640286 -0.7447656595877303
-0.4240569914873799 -0.9004909961812873 -0.9857813112641359
-0.3674085365663847 -0.4819407419877990 -0.3645238468547681
-0.5827397105860186 -0.7290781581209491 -0.8615229165775795
-0.3975308017493017 -0.6522787326004871 -0.9846626520798610
-0.5583216247458188 -0.9488816092738117 -0.6469158771901011
-0.2306256734853684 -0.2940612946888093 -0.6895719661937446
-0.3041407180695167 -0.5602587606930518 -0.4434458835686698
-0.3960629365410545 -0.7512211790407204 -0.6082476608695304
-1.336132842955273 -0.6057066303450040 -0.5726087731282288
-0.4918814547677718 -0.7606578865363953 -0.2951659264868926
-0.3881680788359501 -0.3109241231671961 -0.7078707491799914
-0.3623625688446360 -0.4430137729068305 -0.9279271098475936
-0.2290838088700753 -0.3870980678621480 -0.8000332693180561
-0.7964744879675550 -0.4950551119251316 -0.5201500981458075
-0.6654200978736288 -0.9037766341356712 -0.5921799507740539
-0.4552915755388566 -0.8048467444625557 -0.08638961422716016
-0.3175800991399296 -0.8889281355804046 -0.8889673432972257
0.009443893188055608 -0.3033030733905986 -0.6063958370642328
-0.7781676697747630 -0.9969339455729528 -0.7847641855193951
-0.3709450948897945 -0.9293821956430142 -0.6711216076980766
-0.6472048031763484 -0.2844660995208588 -0.4547657013618363
-0.3093274839631762 -0.8264594986328345 -0.2693948669009715
-0.5691246530468883 -0.5816949288414970 -0.7988407843132017
-0.5846410991542126 -0.6102733673192773 -0.9474472897104326
-0.4619018809588187 -0.6922626991585266 -0.8529509393486879
-0.9341690394723746 -0.2048861760333368 -0.5763255438056814
-0.4753823007333206 -0.9847858814169310 -0.6084670508904806
-0.6097889096385636 -0.1558026578670219 -0.5407452525949980
-0.8426597160875828 -0.5728578082647764 -0.6254655056167889
-0.5002570985981800 -0.5660289375686121 -0.6966970933117435
-0.3595184568720410 -0.8869769517170271 -0.8293060581021244
-0.7660244640066636 -0.9191108227612158 -0.7495472111112249
-0.3250789003708131 -0.8545862221106031 -0.9847863669982040
-0.9862358540926807 -0.9843872487122278 -0.3764841688606632
-0.6665806111063707 -0.6998869717621219 -0.8398491506346015
-0.7498849663083538 -0.2584536929034274 -0.8798094698402976
-0.8659064866640068 -0.8540212609217359 -0.4705628403387491
-0.9848057457322186 -0.5870303872290659 -0.9105115844147157
-0.6855534064105064 -0.7447256224770895 -0.9845164901161550
-0.9267803381073205 -0.6874399094864110 -0.9868490844056681
-0.9871049327408159 -0.9127271706215343 -0.8894132571749456
-0.7481430771200624 -0.7661512147794380 -0.4619076734386954
-0.3463253354355214 -0.7324122395130058 -0.7198934949704492
-0.3869971300152642 -0.3580173602243875 -0.8144411145869335
-0.4708508640578066 -0.7583061726079500 -0.6102585014526588
-0.2323551831668570 -0.7124730357532248 -0.6407019387626708
-0.8770555543363814 -0.7747723882503575 -0.8880529094965369
-0.5221765657051773 -0.8927103129537772 -0.8873570244928761
-0.6814118942525524 -0.4812414843861851 -0.07723442473878635
-0.3004215736435181 -0.7901826925719376 -0.6000050603345796
-0.9391488020802135 -0.6130019120301854 -0.6519260224181763
-0.6312423953207323 -0.6236684911320279 -0.8319901021019791
-0.9846585341126538 -0.8241847119432536 -0.9849733862258551
0.03619613868867930 -0.9402473523400392 -0.4963043182116479
-0.06988396609313940 -0.6160025364808686 -0.9485679374403244
-0.9552678112333591 -0.2951058860501357 -0.9871232492575841
-0.2801466899229405 -0.5623043303

Why does the input LibSVM dat format for Decision Tree in Spark MLLib look like this?

I am looking at the documentation of Decision Tree in Spark MLLib. Here is a line of code
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
that loads the input data. When I opened the sample_libsv_data.txt file, one of the lines looked like:
0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 218:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 266:253 267:190 268:114 269:253 270:228 271:47 272:79 273:255 274:168 290:48 291:238 292:252 293:252 294:179 295:12 296:75 297:121 298:21 301:253 302:243 303:50 317:38 318:165 319:253 320:233 321:208 322:84 329:253 330:252 331:165 344:7 345:178 346:252 347:240 348:71 349:19 350:28 357:253 358:252 359:195 372:57 373:252 374:252 375:63 385:253 386:252 387:195 400:198 401:253 402:190 413:255 414:253 415:196 427:76 428:246 429:252 430:112 441:253 442:252 443:148 455:85 456:252 457:230 458:25 467:7 468:135 469:253 470:186 471:12 483:85 484:252 485:223 494:7 495:131 496:252 497:225 498:71 511:85 512:252 513:145 521:48 522:165 523:252 524:173 539:86 540:253 541:225 548:114 549:238 550:253 551:162 567:85 568:252 569:249 570:146 571:48 572:29 573:85 574:178 575:225 576:253 577:223 578:167 579:56 595:85 596:252 597:252 598:252 599:229 600:215 601:252 602:252 603:252 604:196 605:130 623:28 624:199 625:252 626:252 627:253 628:252 629:252 630:233 631:145 652:25 653:128 654:252 655:253 656:252 657:141 658:37
I can understand that the first element is the class label (0) and I know about decision tree algorithm but I don't understand why each feature is like a tuple? Shouldn't we have just numbers representing features? What is the meaning of 128:51 as a feature value here?
128:51 as a feature value here means that there is value 51 in column 128. This is SVMLight format first introduced in svmlight and is good for representing sparse vectors. All indices that are not mentioned by name are omitted from the list and those features have 0 value. In other words, all columns from 1 to 127 are 0 in your example.
Note: the indexing of the columns in Spark sparse vectors like above starts from 0. So, there is a column with index 0, and 0:100 is a possible entry in the SVMLight format.

Resources