Reducing Model file size in LIBSVM - svm

I want to reduce the model file size . Can we reduce it by reducing the number of digits in the weights of the model file. The number of classes in my model file is around 3800 and the number of features is around 357000. Here is some excerpt from the model file. Can I reduce the number of digits in these weights.
solver_type L2R_L2LOSS_SVC_DUAL
nr_class 3821
nr_feature 357021
bias -1.000000000000000
w
-0.6298615183549175 -0.6884816945277815 -0.9850473581929793
-0.2730180225739936 -0.4444522939544599 -0.3045368061994185
-0.6752904784743610 -0.4936186126242763 -0.8167435931134331
-0.8747648882598349 -0.4980187300672689 -0.8255372912521536
-0.3329812532124196 -0.1751416471640286 -0.7447656595877303
-0.4240569914873799 -0.9004909961812873 -0.9857813112641359
-0.3674085365663847 -0.4819407419877990 -0.3645238468547681
-0.5827397105860186 -0.7290781581209491 -0.8615229165775795
-0.3975308017493017 -0.6522787326004871 -0.9846626520798610
-0.5583216247458188 -0.9488816092738117 -0.6469158771901011
-0.2306256734853684 -0.2940612946888093 -0.6895719661937446
-0.3041407180695167 -0.5602587606930518 -0.4434458835686698
-0.3960629365410545 -0.7512211790407204 -0.6082476608695304
-1.336132842955273 -0.6057066303450040 -0.5726087731282288
-0.4918814547677718 -0.7606578865363953 -0.2951659264868926
-0.3881680788359501 -0.3109241231671961 -0.7078707491799914
-0.3623625688446360 -0.4430137729068305 -0.9279271098475936
-0.2290838088700753 -0.3870980678621480 -0.8000332693180561
-0.7964744879675550 -0.4950551119251316 -0.5201500981458075
-0.6654200978736288 -0.9037766341356712 -0.5921799507740539
-0.4552915755388566 -0.8048467444625557 -0.08638961422716016
-0.3175800991399296 -0.8889281355804046 -0.8889673432972257
0.009443893188055608 -0.3033030733905986 -0.6063958370642328
-0.7781676697747630 -0.9969339455729528 -0.7847641855193951
-0.3709450948897945 -0.9293821956430142 -0.6711216076980766
-0.6472048031763484 -0.2844660995208588 -0.4547657013618363
-0.3093274839631762 -0.8264594986328345 -0.2693948669009715
-0.5691246530468883 -0.5816949288414970 -0.7988407843132017
-0.5846410991542126 -0.6102733673192773 -0.9474472897104326
-0.4619018809588187 -0.6922626991585266 -0.8529509393486879
-0.9341690394723746 -0.2048861760333368 -0.5763255438056814
-0.4753823007333206 -0.9847858814169310 -0.6084670508904806
-0.6097889096385636 -0.1558026578670219 -0.5407452525949980
-0.8426597160875828 -0.5728578082647764 -0.6254655056167889
-0.5002570985981800 -0.5660289375686121 -0.6966970933117435
-0.3595184568720410 -0.8869769517170271 -0.8293060581021244
-0.7660244640066636 -0.9191108227612158 -0.7495472111112249
-0.3250789003708131 -0.8545862221106031 -0.9847863669982040
-0.9862358540926807 -0.9843872487122278 -0.3764841688606632
-0.6665806111063707 -0.6998869717621219 -0.8398491506346015
-0.7498849663083538 -0.2584536929034274 -0.8798094698402976
-0.8659064866640068 -0.8540212609217359 -0.4705628403387491
-0.9848057457322186 -0.5870303872290659 -0.9105115844147157
-0.6855534064105064 -0.7447256224770895 -0.9845164901161550
-0.9267803381073205 -0.6874399094864110 -0.9868490844056681
-0.9871049327408159 -0.9127271706215343 -0.8894132571749456
-0.7481430771200624 -0.7661512147794380 -0.4619076734386954
-0.3463253354355214 -0.7324122395130058 -0.7198934949704492
-0.3869971300152642 -0.3580173602243875 -0.8144411145869335
-0.4708508640578066 -0.7583061726079500 -0.6102585014526588
-0.2323551831668570 -0.7124730357532248 -0.6407019387626708
-0.8770555543363814 -0.7747723882503575 -0.8880529094965369
-0.5221765657051773 -0.8927103129537772 -0.8873570244928761
-0.6814118942525524 -0.4812414843861851 -0.07723442473878635
-0.3004215736435181 -0.7901826925719376 -0.6000050603345796
-0.9391488020802135 -0.6130019120301854 -0.6519260224181763
-0.6312423953207323 -0.6236684911320279 -0.8319901021019791
-0.9846585341126538 -0.8241847119432536 -0.9849733862258551
0.03619613868867930 -0.9402473523400392 -0.4963043182116479
-0.06988396609313940 -0.6160025364808686 -0.9485679374403244
-0.9552678112333591 -0.2951058860501357 -0.9871232492575841
-0.2801466899229405 -0.5623043303

Related

Numbers of hidden layers and units in AutoKeras dense block

I am training a model with Autokeras. So far my best model is this:
structured_data_block_1/normalize:false
structured_data_block_1/dense_block_1/use_batchnorm:true
structured_data_block_1/dense_block_1/num_layers:2
structured_data_block_1/dense_block_1/units_0:32
structured_data_block_1/dense_block_1/dropout:0
structured_data_block_1/dense_block_1/units_1:32
dense_block_2/use_batchnorm:true
dense_block_2/num_layers:2
dense_block_2/units_0:128
dense_block_2/dropout:0
dense_block_2/units_1:16
dense_block_3/use_batchnorm:false
dense_block_3/num_layers:1
dense_block_3/units_0:32
dense_block_3/dropout:0
dense_block_3/units_1:32
regression_head_1/dropout:0
optimizer:"adam"
learning_rate:0.1
dense_block_2/units_2:32
structured_data_block_1/dense_block_1/units_2:256
dense_block_3/units_2:128
My first dense_block_1 has 2 layers (num_layers:2), how can I have three units / neurons then? It say units_0: 32, units_1: 32 and units_2: 256, this implies to me that I have three layers, so why is num_layers:2?
If I would want to recreate the above model in this code, how would I do it properly?
input_node = ak.StructuredDataInput()
output_node = ak.StructuredDataBlock(categorical_encoding=False, normalize=False)(input_node)
output_node = ak.DenseBlock()(output_node)
output_node = ak.DenseBlock()(output_node)
output_node = ak.RegressionHead()(output_node)
Thx for any input

How to go around truncating long sentences with Hugginface Tokenizers?

I am new to tokenizers. My understanding is that the truncate attribute just cuts the sentences. But I need the whole sentence for context.
For example, my sentence is :
"Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri daha sonra 980 yılında nasıl adlandırılmıştır? Ali bin Abbas'ın eseri Rezi'nin hangi isimli eserinden daha özlü ve daha sistematikdir? Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri İbn-i Sina'nın hangi isimli eserinden daha uygulamalı bir biçimde yazılmıştır? Kitab el-Maliki Avrupa'da Constantinus Africanus tarafından hangi dile çevrilmiştir? Kitab el-Maliki'nin ilk bölümünde neye ağırlık verilmiştir?
But when I use max_length=64, truncation=True and pad_to_max_length=True for my encoder(as suggested in the internet), half of sentence is being gone:
▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '-', 's', '▁Sina', '▁ad', 'lı', '▁es', 'eri', '▁daha', '▁sonra', '▁980', '▁yıl', 'ında', '▁na', 'sıl', '▁adlandır', 'ılmıştır', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁es', 'eri', '▁Rez', 'i', "'", 'nin', '▁', 'hangi', '▁is', 'imli', '▁es', 'erinden', '▁daha', '▁', 'özlü', '▁ve', '▁daha', '▁sistema', 'tik', 'dir', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '</s>']
And when I increase max length, CUDA is running out of memory of course. What should be my approach for long texts in the dataset?
My code for encoding:
input_encodings = tokenizer.batch_encode_plus(
example_batch['context'],
max_length=512,
add_special_tokens=True,
truncation=True,
pad_to_max_length=True)
target_encodings = tokenizer.batch_encode_plus(
example_batch['questions'],
max_length=64,
add_special_tokens=True,
truncation=True,
pad_to_max_length=True)
Yes, the truncate attribute just keeps the given number of subwords from the left. The workaround depends on the task you are solving and the data that you use.
Are the long sequence frequent in your data? If not, you can just safely throw away the instances because it is unlikely that the model would learn to generalize for long sequences anyway.
If you really need the long context, you have plenty of options:
Decrease the batch size (and perhaps do updates once after several batches).
Make the model smaller: use either a smaller dimension or fewer layers.
Use a different architecture: Transformers need quadratic memory w.r.t sequence length. Wouldn't an LSTM or CNN do the job? What architectures for long sequences (e.g., Reformer, Longformer).
If you need to use a pre-trained BERT-like model and there is no model of the size that would fit your needs, you can distill a smaller model or a model with a more suitable architecture yourself.
Perhaps you can split the input. In tasks like answer span selection, you can split the text where you are looking for an answer into smaller chunks and search in the chunks independently.

My tensorflow object detection model produced the average precision of zero for first class

I have trained the object detection model for three classes: id=1 (LR), id=2 (PM), id=3 (YR). Model produced AP(LR):0.002, PM:0.84,YR:1.00 and after that changed id=1 (YR), id=2(PM), id=3(YR). Model gives AP(YR):0.002, AP(PM):0.79, AP(LR):0.89.
Is this is taking first class as dummy class or there is another reason for that. Please help me out this.
Following are the changes i performed in the .config file to get the average precision.
eval_config: {
metrics_set: "pascal_voc_detection_metrics"
use_moving_averages: false
batch_size: 1;
num_visualizations: 20
max_num_boxes_to_visualize: 10
visualize_groundtruth_boxes: true
eval_interval_secs: 30
}

detect highest peaks automatically from noisy data python

Is there any way to detect the highest peaks using a python library without setting any parameter?. I'm developing a user interface and I want the algorithm to be able to detect highest peaks automatically...
I want it to be able to detect these peaks in picture below:
graph here
Data looks like this:
8.60291e-07
-1.5491e-06
5.64568e-07
-9.51195e-07
1.07203e-06
4.6521e-07
6.43967e-07
-9.86092e-07
-9.82323e-07
6.38977e-07
-1.93884e-06
-2.98309e-08
1.33543e-06
1.05064e-06
1.17332e-06
-1.53549e-07
-8.9357e-07
1.59176e-06
-2.17331e-06
1.46756e-06
5.63301e-07
-8.77556e-07
7.47681e-09
-8.30101e-07
-3.6647e-07
5.27046e-07
-1.94983e-06
1.89018e-07
1.22533e-06
8.00735e-07
-8.51166e-07
1.13437e-06
-2.75787e-07
1.79601e-06
-1.67875e-06
1.13529e-06
-1.29865e-06
9.9688e-07
-9.34486e-07
8.89931e-07
-3.88634e-07
1.15124e-06
-4.23569e-07
-1.8029e-07
1.20537e-07
4.10736e-07
-9.99077e-07
-3.62984e-07
2.97916e-06
-1.95828e-06
-1.07398e-06
2.422e-06
-6.33202e-07
-1.36953e-06
1.6694e-06
-4.71764e-07
3.98849e-07
-1.0071e-06
-9.72984e-07
8.13553e-07
2.64193e-06
-3.12365e-06
1.34049e-06
-1.30419e-06
1.48369e-07
1.26033e-06
-2.59872e-07
4.28284e-07
-6.44356e-07
2.99934e-07
8.34335e-07
3.53226e-07
-7.08252e-07
4.1243e-07
2.41525e-06
-8.92159e-07
8.82339e-08
4.31945e-06
3.75152e-06
1.091e-06
3.8204e-06
-1.21356e-06
3.35564e-06
-1.06234e-06
-5.99808e-07
2.18155e-06
5.90652e-07
-1.36728e-06
-4.97017e-07
-7.77283e-08
8.68263e-07
4.37645e-07
-1.26514e-06
2.26413e-06
-8.52966e-07
-7.35596e-07
4.11911e-07
1.7585e-06
-inf
1.10779e-08
-1.49507e-06
9.87305e-07
-3.85296e-06
4.31265e-06
-9.89227e-07
-1.33537e-06
4.1713e-07
1.89362e-07
3.21968e-07
6.80237e-08
2.31636e-07
-2.98523e-07
7.99133e-07
7.36305e-07
6.39862e-07
-1.11932e-06
-1.57262e-06
1.86305e-06
-3.63716e-07
3.83865e-07
-5.23293e-07
1.31812e-06
-1.23608e-06
2.54684e-06
-3.99796e-06
2.90441e-06
-5.20203e-07
1.36295e-06
-1.89317e-06
1.22366e-06
-1.10373e-06
2.71276e-06
9.48181e-07
7.70881e-06
5.17066e-06
6.21254e-06
1.3513e-05
1.47878e-05
8.78543e-06
1.61819e-05
1.68438e-05
1.16082e-05
5.74059e-06
4.92458e-06
1.11884e-06
-1.07419e-06
-1.28517e-06
-2.70949e-06
1.65662e-06
1.42964e-06
3.40604e-06
-5.82825e-07
1.98288e-06
1.42819e-06
1.65517e-06
4.42749e-07
-1.95609e-06
-2.1756e-07
1.69164e-06
8.7204e-08
-5.35324e-07
7.43546e-07
-1.08687e-06
2.07289e-06
2.18529e-06
-2.8161e-06
1.88821e-06
4.07272e-07
1.063e-06
8.47244e-07
1.53879e-06
-9.0799e-07
-1.26709e-07
2.40044e-06
-9.48166e-07
1.41788e-06
3.67615e-07
-1.29199e-06
3.868e-06
9.54654e-06
2.51951e-05
2.2769e-05
7.21716e-06
1.36545e-06
-1.32681e-06
-3.09641e-06
4.90417e-07
2.99335e-06
1.578e-06
6.0025e-07
2.90656e-06
-2.08258e-06
-1.54214e-06
2.19757e-07
3.74982e-06
-1.76944e-06
2.15018e-06
-1.01935e-06
4.37469e-07
1.39078e-06
6.39587e-07
-1.7807e-06
-6.16455e-09
1.61557e-06
1.59644e-06
-2.35217e-06
5.29449e-07
1.9169e-06
-7.54822e-07
2.00342e-06
-3.28452e-06
3.91663e-06
1.66016e-08
-2.65897e-06
-1.4064e-06
4.67987e-07
1.67786e-06
4.69543e-07
-8.90106e-07
-1.4584e-06
1.37915e-06
1.98483e-06
-2.3735e-06
4.45618e-07
1.91504e-06
1.09653e-06
-8.00873e-07
1.32321e-06
2.04846e-06
-1.50656e-06
7.23816e-07
2.06049e-06
-2.43918e-06
1.64417e-06
2.65411e-07
-2.66107e-06
-8.01788e-07
2.05121e-06
-1.74988e-06
1.83594e-06
-8.14026e-07
-2.69342e-06
1.81152e-06
1.11664e-07
-4.21863e-06
-7.20551e-06
-5.92407e-07
-1.44629e-06
-2.08136e-06
2.86105e-06
3.77911e-06
-1.91898e-06
1.41742e-06
2.67914e-07
-8.55835e-07
-9.8584e-07
-2.74115e-06
3.39044e-06
1.39639e-06
-2.4964e-06
8.2486e-07
2.02432e-06
1.65793e-06
-1.43094e-06
-3.36807e-06
-8.96515e-07
5.31323e-06
-8.27209e-07
-1.39221e-06
-3.3754e-06
2.12372e-06
3.08218e-06
-1.42947e-06
-2.36777e-06
3.86218e-06
2.29327e-06
-3.3941e-06
-1.67291e-06
2.63828e-06
2.21008e-07
7.07794e-07
1.8172e-06
-2.00082e-06
1.80664e-06
6.69739e-07
-3.95395e-06
1.92148e-06
-1.07187e-06
-4.04938e-07
-1.76553e-06
2.7099e-06
1.30768e-06
1.41812e-06
-1.55518e-07
-3.78302e-06
4.00137e-06
-8.38623e-07
4.54651e-07
1.00027e-06
1.32196e-06
-2.62717e-06
1.67865e-06
-6.99249e-07
2.8837e-06
-1.00516e-06
-3.68011e-06
1.61847e-06
1.90887e-06
1.59641e-06
4.16779e-07
-1.35245e-06
1.65717e-06
-2.92667e-06
3.6203e-07
2.53528e-06
-2.0578e-07
-3.41919e-07
-1.42154e-06
-2.33322e-06
3.07175e-06
-2.69165e-08
-8.21045e-07
2.3175e-06
-7.22992e-07
1.49069e-06
8.75488e-07
-2.02676e-06
-2.81158e-07
3.6004e-06
-3.94708e-06
4.72983e-06
-1.38873e-06
-6.92139e-08
-1.4678e-06
1.04251e-06
-2.06625e-06
3.10406e-06
-8.13873e-07
7.23694e-07
-9.78912e-07
-8.65967e-07
7.37335e-07
1.52563e-06
-2.33591e-06
1.78265e-06
9.58435e-07
-5.22064e-07
-2.29736e-07
-4.26996e-06
-6.61411e-06
1.14789e-06
-4.32697e-06
-5.32779e-06
2.12241e-06
-1.40726e-06
1.76086e-07
-3.77194e-06
-2.71326e-06
-9.49402e-08
1.70807e-07
-2.495e-06
4.22324e-06
-3.62476e-06
-9.56055e-07
7.16583e-07
3.01447e-06
-1.41229e-06
-1.67694e-06
7.61627e-07
3.55881e-06
2.31015e-06
-9.50378e-07
4.45251e-08
-1.94791e-06
2.27081e-06
-3.34717e-06
3.05688e-06
4.57062e-07
3.87326e-06
-2.39215e-06
-3.52682e-06
-2.05212e-06
5.26495e-06
-3.28613e-07
-5.76569e-07
-7.46338e-07
5.98795e-06
8.80493e-07
-4.82965e-06
2.56839e-06
-1.58792e-06
-2.2294e-06
1.83841e-06
2.65482e-06
-3.10474e-06
-3.46741e-07
2.45557e-06
2.01328e-06
-3.92606e-06
inf
-8.11737e-07
5.72174e-07
1.57245e-06
8.02612e-09
-2.901e-06
1.22079e-06
-6.31714e-07
3.06241e-06
1.20059e-06
-1.80344e-06
4.90784e-07
3.74243e-06
-2.94342e-07
-3.45764e-08
-3.42099e-06
-1.43695e-06
5.91064e-07
3.47308e-06
3.78232e-06
4.01093e-07
-1.58435e-06
-3.47375e-06
1.34943e-06
1.11768e-06
1.95212e-06
-8.28033e-07
1.53705e-06
6.38031e-07
-1.84702e-06
1.34689e-06
-6.98669e-07
1.81653e-06
-2.42355e-06
-1.35257e-06
3.04367e-06
-1.21976e-06
1.61896e-06
-2.69528e-06
1.84601e-06
6.45447e-08
-4.94263e-07
3.47568e-06
-2.00531e-06
3.56693e-06
-3.19446e-06
2.72141e-06
-1.39059e-06
2.20032e-06
-1.76819e-06
2.32727e-07
-3.47382e-07
2.11823e-07
-5.22614e-07
2.69846e-06
-1.47983e-06
2.14554e-06
-6.27594e-07
-8.8501e-10
7.89124e-07
-2.8653e-07
8.30902e-07
-2.12857e-06
-1.90887e-07
1.07593e-06
1.40781e-06
2.41641e-06
-4.52689e-06
2.37207e-06
-2.19479e-06
1.65131e-06
1.2706e-06
-2.18387e-06
-1.72821e-07
5.41687e-07
7.2879e-07
7.56927e-07
1.57739e-06
-3.79395e-07
-1.02887e-06
-1.20987e-06
1.43066e-06
8.96301e-08
5.09766e-07
-2.8812e-06
-2.35944e-06
2.25912e-06
-2.78967e-06
-4.69913e-06
1.60822e-06
6.9342e-07
4.6225e-07
-1.33276e-06
-3.59033e-06
1.11206e-06
1.83521e-06
2.39163e-06
2.3468e-08
5.91431e-07
-8.80249e-07
-2.77405e-08
-1.13184e-06
-1.28036e-06
1.66229e-06
2.81784e-06
-2.97589e-06
8.73413e-08
1.06439e-06
2.39075e-06
-2.76974e-06
1.20862e-06
-5.12817e-07
-5.19104e-07
4.51324e-07
-4.7168e-07
2.35608e-06
5.46906e-07
-1.66748e-06
5.85236e-07
6.42944e-07
2.43164e-07
4.01031e-07
-1.93646e-06
2.07416e-06
-1.16116e-06
4.27155e-07
5.2951e-07
9.09149e-07
-8.71887e-08
-1.5564e-09
1.07266e-06
-9.49402e-08
2.04016e-06
-6.38123e-07
-1.94241e-06
-5.17294e-06
-2.18622e-06
-8.26703e-06
2.54364e-06
4.32614e-06
8.3847e-07
-2.85309e-06
2.72345e-06
-3.42752e-06
-1.36871e-07
2.23346e-06
5.26825e-07
1.3566e-06
-2.17111e-06
2.1463e-07
2.06479e-06
1.76929e-06
-1.2655e-06
-1.3797e-06
3.10706e-06
-4.72189e-06
4.38138e-06
6.41815e-07
-3.25623e-08
-4.93707e-06
5.05743e-06
5.17578e-07
-5.30524e-06
3.62463e-06
5.68909e-07
1.16226e-06
1.10843e-06
-5.00854e-07
9.48761e-07
-2.18701e-06
-3.57635e-07
4.26709e-06
-1.50836e-06
-5.84412e-06
3.5054e-06
3.94019e-06
-4.7623e-06
2.05856e-06
-2.22992e-07
1.64969e-06
2.64694e-06
-8.49487e-07
-3.63562e-06
1.0386e-06
1.69461e-06
-2.05798e-06
3.60349e-06
3.42651e-07
-1.46686e-06
1.19949e-06
-1.60519e-06
2.37793e-07
6.12366e-07
-1.54669e-06
1.43668e-06
1.87009e-06
-2.22626e-06
2.15155e-06
-3.10571e-06
2.05188e-06
-4.40002e-07
2.06683e-06
-1.11362e-06
5.96924e-07
-2.64471e-06
2.4892e-06
1.13083e-06
-3.23181e-07
5.10651e-07
2.73499e-07
-1.24899e-06
1.40564e-06
-9.3158e-07
1.45947e-06
3.70544e-07
-1.62628e-06
-1.70215e-06
1.72098e-06
8.19031e-07
-5.57709e-07
1.10107e-06
-2.81845e-06
1.57654e-07
3.30716e-06
-9.75403e-07
1.73126e-07
1.30447e-06
7.64771e-08
-6.65344e-07
-1.4346e-06
5.03171e-06
-2.84576e-06
2.3212e-06
-2.73373e-06
2.16675e-08
2.24026e-06
-4.11682e-08
-3.36642e-06
1.78775e-06
1.28174e-08
-9.32068e-07
2.97177e-06
-1.05338e-06
9.42505e-07
2.02362e-07
-1.81326e-06
2.16995e-06
2.83722e-07
-1.2648e-06
9.21814e-07
-8.9447e-07
-1.61597e-06
3.5036e-06
-6.79626e-08
1.52823e-06
-2.98682e-06
5.57404e-07
9.5166e-07
7.10419e-07
-1.28528e-06
-3.76038e-07
-1.03845e-06
2.96631e-06
-1.18356e-06
-2.77313e-07
3.24149e-06
-1.85455e-06
-1.27747e-07
3.6264e-07
4.66431e-07
-1.54443e-06
1.38437e-06
-1.53119e-06
7.4231e-07
-1.2388e-06
1.99774e-06
1.15799e-06
1.39478e-06
-2.93527e-06
-2.03012e-06
2.46667e-06
2.16751e-06
-2.50354e-06
3.95905e-07
5.74371e-07
1.33575e-07
-3.98315e-07
4.93927e-07
-5.23987e-07
-1.74713e-07
6.49384e-07
-7.16766e-07
2.35733e-06
-4.91333e-08
-1.88138e-06
1.74722e-06
4.03503e-07
3.5965e-07
1.44836e-07]
The task you are describing could be treated like anomaly/outlier detection.
One possible solution is to use a Z-score transformation and treat every value with a z score above a certain threshold as an outlier. Because there is no clear definition of an outlier it won't be able to detect such peaks without setting any parameters (threshold).
One possible solution could be:
import numpy as np
def detect_outliers(data):
outliers = []
d_mean = np.mean(data)
d_std = np.std(data)
threshold = 3 # this defines what you would consider a peak (outlier)
for point in data:
z_score = (point - d_mean)/d_std
if np.abs(z_score) > threshold:
outliers.append(point)
return outliers
# create normal data
data = np.random.normal(size=100)
# create outliers
outliers = np.random.normal(100, size=3)
# combine normal data and outliers
full_data = data.tolist() + outliers.tolist()
# print outliers
print(detect_outliers(full_data))
If you only want to detect peaks, remove the np.abs function call from the code.
This code snippet is based on a Medium Post, which also provides another way of detecting outliers.

Why do mllib word2vec word vectors only have 100 elements?

I have a word2vec model that I created in PySpark. The model is saved as a .parquet file. I want to be able to access and query the model (or the words and word vectors) using vanilla Python because I am building a flask app that will allow a user to enter words of interest for finding synonyms.
I've extracted the words and word vectors, but I've noticed that while I have approximately 7000 unique words, my word vectors have a length of 100. For example, here are two words "serious" and "breaks". Their vectors only have a length of 100. Why is this? How is it able to then reconstruct the entire vector space with only 100 values for each word? Is it simply only giving me the top 100 or the first 100 values?
vectors.take(2)
Out[48]:
[Row(word=u'serious', vector=DenseVector([0.0784, -0.0882, -0.0342, -0.0153, 0.0223, 0.1034, 0.1218, -0.0814, -0.0198, -0.0325, -0.1024, -0.2412, -0.0704, -0.1575, 0.0342, -0.1447, -0.1687, 0.0673, 0.1248, 0.0623, -0.0078, -0.0813, 0.0953, -0.0213, 0.0031, 0.0773, -0.0246, -0.0822, -0.0252, -0.0274, -0.0288, 0.0403, -0.0419, -0.1122, -0.0397, 0.0186, -0.0038, 0.1279, -0.0123, 0.0091, 0.0065, 0.0884, 0.0899, -0.0479, 0.0328, 0.0171, -0.0962, 0.0753, -0.187, 0.034, -0.1393, -0.0575, -0.019, 0.0151, -0.0205, 0.0667, 0.0762, -0.0365, -0.025, -0.184, -0.0118, -0.0964, 0.1744, 0.0563, -0.0413, -0.054, -0.1764, -0.087, 0.0747, -0.022, 0.0778, -0.0014, -0.1313, -0.1133, -0.0669, 0.0007, -0.0378, -0.1093, -0.0732, 0.1494, -0.0815, -0.0137, 0.1009, -0.0057, 0.0195, 0.0085, 0.025, 0.0064, 0.0076, 0.0676, 0.1663, -0.0078, 0.0278, 0.0519, -0.0615, -0.0833, 0.0643, 0.0032, -0.0882, 0.1033])),
Row(word=u'breaks', vector=DenseVector([0.0065, 0.0027, -0.0121, 0.0296, -0.0467, 0.0297, 0.0499, 0.0843, 0.1027, 0.0179, -0.014, 0.0586, 0.06, 0.0534, 0.0391, -0.0098, -0.0266, -0.0422, 0.0188, 0.0065, -0.0309, 0.0038, -0.0458, -0.0252, 0.0428, 0.0046, -0.065, -0.0822, -0.0555, -0.0248, -0.0288, -0.0016, 0.0334, -0.0028, -0.0718, -0.0571, -0.0668, -0.0073, 0.0658, -0.0732, 0.0976, -0.0255, -0.0712, 0.0899, 0.0065, -0.04, 0.0964, 0.0356, 0.0142, 0.0857, 0.0669, -0.038, -0.0728, -0.0446, 0.1194, -0.056, 0.1022, 0.0459, -0.0343, -0.0861, -0.0943, -0.0435, -0.0573, 0.0229, 0.0368, 0.085, -0.0218, -0.0623, 0.0502, -0.0645, 0.0247, -0.0371, -0.0785, 0.0371, -0.0047, 0.0012, 0.0214, 0.0669, 0.049, -0.0294, -0.0272, 0.0642, -0.006, -0.0804, -0.06, 0.0719, -0.0109, -0.0272, -0.0366, 0.0041, 0.0556, 0.0108, 0.0624, 0.0134, -0.0094, 0.0219, 0.0164, -0.0545, -0.0055, -0.0193]))]
Any thoughts on the best way to reconstruct this model in vanilla python?
Just to improve on the comment by zero323, for anyone else who arrives here.
Word2Vec has a default setting to create word vectors of 100dims. To change this:
model = Word2Vec(sentences, size=300)
when initializing the model will create vectors of 300 dimensions.
I think the problem lays with your minCount parameter value for the Word2Vec model.
If this value is too high, less words get used in the training of the model resulting in a words vector of only 100.
7000 unique words is not a lot.
Try setting the minCount lower than the default 5.
model.setMinCount(value)
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=word2vec#pyspark.ml.feature.Word2Vec

Resources