language model with SRILM - nlp
I'm trying to build a language model using SRILM.
I have a list of phrases and I create the model using:
./ngram-count -text corpus.txt -order 3 -ukndiscount -interpolate -unk -lm corpus.lm
After this I tried to make some example to see the probabilities of different phrases and it turned out that has a log probability of -0.9.
The problem is that there are some words in the training with a lower log probability. For example there are 5 "abatantuono" and its log probability is -4.8.
I think this is strange because a phrase <s> <unk> </s> is more probable than <s> abatantuono </s> and in the training set the 3-gram <s> abatantuono </s> is also present!
This can be seen here:
% ./ngram -lm corpus.lm -ppl ../../../corpus.txt.test -debug 2 -unk
reading 52147 1-grams
reading 316818 2-grams
reading 91463 3-grams
abatantuono
p( abatantuono | <s> ) = [2gram] 1.6643e-05 [ -4.77877 ]
p( </s> | abatantuono ...) = [3gram] 0.717486 [ -0.144186 ]
1 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -4.92296 ppl= 289.386 ppl1= 83744.3
abatantonno
p( <unk> | <s> ) = [1gram] 0.00700236 [ -2.15476 ]
p( </s> | <unk> ...) = [1gram] 0.112416 [ -0.949172 ]
1 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -3.10393 ppl= 35.6422 ppl1= 1270.36
file ../../../corpus.txt.test: 2 sentences, 2 words, 0 OOVs
0 zeroprobs, logprob= -8.02688 ppl= 101.56 ppl1= 10314.3
What do you think the problem could be?
Thank you
This is a flagged problem of SRILM (see Kenneth Heafield's thesis - footnote on page 30 and his website notes on SRILM). The way the mass is allocated to unknown word can assign them a higher probability compared to the seen rare words in the training data. You can have a look at KenLM package which has only the implementation of Modified Kneser-Ney (generally performs better than Kneser-Ney smoothing) but does the mass allocation to unknown words in a way that prevents this from happening.
Related
fasttext: why do aligned vectors contain only one value per word?
I was taking a look at the Fasttext aligned vectors of some languages and was surprised to find that each vectors consisted of one value only. I was expecting a Matrix witch multidimensional vectors belonging to each word, instead there is only one column of numbers. I'm very new to this field and was wondering if somebody could explain to me, how this single number belongig to each word came to be and wether I'm looking at a semantic space as I was expecting or something different (if so what is it and are alingend multidimensional semantic spaces available somewhere?)
I think you may be misinterpreting those files. When I look at one of those files – for example wiki.en.align.vec – each line is a word-token, then 300 different values (to provide a 300-dimensional word-vector). For example, the 4th line of the file is: the -0.0324 -0.0462 -0.0087 0.0994 0.0147 -0.0198 -0.0811 -0.0362 0.0445 0.0402 -0.0199 -0.1173 0.0906 -0.0304 -0.0320 -0.0374 -0.0249 -0.0099 0.0017 0.0719 -0.0834 0.0382 -0.1141 -0.0288 -0.0666 -0.0365 -0.0006 0.0098 0.0282 0.0310 -0.0773 0.0755 -0.0528 0.1225 -0.0138 -0.0879 0.0036 -0.0593 0.0416 -0.0588 0.0266 -0.0011 -0.0419 0.0141 0.0388 -0.0597 -0.0203 0.0444 0.0253 -0.0316 0.0352 -0.0318 -0.0473 0.0347 -0.0250 0.0289 0.0426 0.0218 -0.0254 0.0486 -0.0252 -0.0904 0.1607 -0.0379 0.0231 -0.0988 -0.1213 -0.0926 -0.1116 0.0345 -0.1856 -0.0409 0.0306 -0.0653 -0.0377 -0.0301 0.0361 0.1212 0.0105 -0.0354 0.0552 0.0363 -0.0427 0.0555 -0.0031 -0.0830 -0.0325 0.0415 -0.0461 -0.0615 -0.0412 0.0060 0.1680 -0.1347 0.0271 -0.0438 0.0364 0.0121 0.0018 -0.0138 -0.0625 -0.0161 -0.0009 -0.0373 -0.1009 -0.0583 0.0038 0.0109 -0.0068 0.0319 -0.0043 -0.0412 -0.0506 -0.0674 0.0426 -0.0031 0.0788 0.0924 0.0559 0.0449 0.1364 0.1132 -0.0378 0.1060 0.0130 0.0349 0.0638 0.1020 0.0459 0.0634 -0.0870 0.0447 -0.0124 0.0167 -0.0603 0.0297 -0.0298 0.0691 -0.0280 0.0749 0.0474 0.0275 0.0255 0.0184 0.0085 0.1116 0.0233 0.0176 0.0327 0.0471 0.0662 -0.0353 -0.0387 -0.0336 -0.0354 -0.0348 0.0157 -0.0294 0.0710 0.0299 -0.0602 0.0732 -0.0344 0.0419 0.0773 0.0119 -0.0550 0.0377 0.0808 -0.0424 -0.0977 -0.0386 -0.0334 -0.0384 -0.0520 0.0641 0.0049 0.1226 -0.0011 -0.0131 0.0224 0.0138 -0.0243 0.0544 -0.0164 0.1194 0.0916 -0.0755 0.0565 0.0235 -0.0009 -0.0818 0.0953 0.0873 -0.0215 0.0240 -0.0271 0.0134 -0.0870 0.0597 -0.0073 -0.0230 -0.0220 0.0562 -0.0069 -0.0796 -0.0118 0.0059 0.0221 0.0509 0.1175 0.0508 -0.0044 -0.0265 0.0328 -0.0525 0.0493 -0.1309 -0.0674 0.0148 -0.0024 -0.0163 -0.0241 0.0726 -0.0165 0.0368 -0.0914 0.0197 0.0018 -0.0149 0.0654 0.0912 -0.0638 -0.0135 -0.0277 -0.0078 0.0092 -0.0477 0.0054 -0.0153 -0.0411 -0.0177 0.0874 0.0221 0.1040 0.1004 0.0595 -0.0610 0.0650 -0.0235 0.0257 0.1208 0.0129 -0.0086 -0.0846 0.1102 -0.0338 -0.0553 0.0166 -0.0602 0.0128 0.0792 -0.0181 0.0046 -0.0548 -0.0394 -0.0546 0.0425 0.0048 -0.1172 -0.0925 -0.0357 -0.0123 0.0371 -0.0142 0.0157 0.0442 0.1186 0.0834 -0.0293 0.0313 -0.0287 0.0095 0.0080 0.0566 -0.0370 0.0257 0.1032 -0.0431 0.0544 0.0323 -0.1076 -0.0187 0.0407 -0.0198 -0.0255 -0.0505 0.0827 -0.0650 0.0176 Thus every one of the 2,519,370 word-tokens has a 300-dimensional vector. If this isn't what you're seeing, you should explain further. If this is what you're seeing and you were expecting something else, you should explain further what you were expecting.
How I can get the vectors for words that were not present in word2vec vocabulary?
I have check the previous post link but it doesn't seems to work for my case:- I have pre trained word2vec model: import gensim model = Word2Vec.load('w2v_model') Now I have a pandas dataframe with keywords: keyword corruption people budget cambodia ....... ...... All I want to add the vectors for each keyword in its corresponding columns but when I use model['cambodia'] it throw me error as KeyError: "word 'cambodia' not in vocabulary" so I have update the keyword as: model.train(['cambodia']) But this won't work out for me, when I use model['cambodia'] it still giving an error as KeyError: "word 'cambodia' not in vocabulary". How to update new words into word2vec vocabulary so i can get its vectors? Expected output will be:- keyword V1 V2 V3 V4 V5 V6 corruption 0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846 people .............................................................. budget ...........................................................
You can initial the first vector as [0,0,...0]. And the word that not in vocabulary can set to 0. keyword V1 V2 V3 V4 V5 V6 0 0 0 0 0 0 0 1 0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846 2 .............................................................. 3 ........................................................... You can use two dicts to solve the problem. word2id['corruption']=1 vec['corruption']=[0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846] ... word2id['cambodia']=0 vec['cambodia']=[0 0 0 0 0 0]
Why do mllib word2vec word vectors only have 100 elements?
I have a word2vec model that I created in PySpark. The model is saved as a .parquet file. I want to be able to access and query the model (or the words and word vectors) using vanilla Python because I am building a flask app that will allow a user to enter words of interest for finding synonyms. I've extracted the words and word vectors, but I've noticed that while I have approximately 7000 unique words, my word vectors have a length of 100. For example, here are two words "serious" and "breaks". Their vectors only have a length of 100. Why is this? How is it able to then reconstruct the entire vector space with only 100 values for each word? Is it simply only giving me the top 100 or the first 100 values? vectors.take(2) Out[48]: [Row(word=u'serious', vector=DenseVector([0.0784, -0.0882, -0.0342, -0.0153, 0.0223, 0.1034, 0.1218, -0.0814, -0.0198, -0.0325, -0.1024, -0.2412, -0.0704, -0.1575, 0.0342, -0.1447, -0.1687, 0.0673, 0.1248, 0.0623, -0.0078, -0.0813, 0.0953, -0.0213, 0.0031, 0.0773, -0.0246, -0.0822, -0.0252, -0.0274, -0.0288, 0.0403, -0.0419, -0.1122, -0.0397, 0.0186, -0.0038, 0.1279, -0.0123, 0.0091, 0.0065, 0.0884, 0.0899, -0.0479, 0.0328, 0.0171, -0.0962, 0.0753, -0.187, 0.034, -0.1393, -0.0575, -0.019, 0.0151, -0.0205, 0.0667, 0.0762, -0.0365, -0.025, -0.184, -0.0118, -0.0964, 0.1744, 0.0563, -0.0413, -0.054, -0.1764, -0.087, 0.0747, -0.022, 0.0778, -0.0014, -0.1313, -0.1133, -0.0669, 0.0007, -0.0378, -0.1093, -0.0732, 0.1494, -0.0815, -0.0137, 0.1009, -0.0057, 0.0195, 0.0085, 0.025, 0.0064, 0.0076, 0.0676, 0.1663, -0.0078, 0.0278, 0.0519, -0.0615, -0.0833, 0.0643, 0.0032, -0.0882, 0.1033])), Row(word=u'breaks', vector=DenseVector([0.0065, 0.0027, -0.0121, 0.0296, -0.0467, 0.0297, 0.0499, 0.0843, 0.1027, 0.0179, -0.014, 0.0586, 0.06, 0.0534, 0.0391, -0.0098, -0.0266, -0.0422, 0.0188, 0.0065, -0.0309, 0.0038, -0.0458, -0.0252, 0.0428, 0.0046, -0.065, -0.0822, -0.0555, -0.0248, -0.0288, -0.0016, 0.0334, -0.0028, -0.0718, -0.0571, -0.0668, -0.0073, 0.0658, -0.0732, 0.0976, -0.0255, -0.0712, 0.0899, 0.0065, -0.04, 0.0964, 0.0356, 0.0142, 0.0857, 0.0669, -0.038, -0.0728, -0.0446, 0.1194, -0.056, 0.1022, 0.0459, -0.0343, -0.0861, -0.0943, -0.0435, -0.0573, 0.0229, 0.0368, 0.085, -0.0218, -0.0623, 0.0502, -0.0645, 0.0247, -0.0371, -0.0785, 0.0371, -0.0047, 0.0012, 0.0214, 0.0669, 0.049, -0.0294, -0.0272, 0.0642, -0.006, -0.0804, -0.06, 0.0719, -0.0109, -0.0272, -0.0366, 0.0041, 0.0556, 0.0108, 0.0624, 0.0134, -0.0094, 0.0219, 0.0164, -0.0545, -0.0055, -0.0193]))] Any thoughts on the best way to reconstruct this model in vanilla python?
Just to improve on the comment by zero323, for anyone else who arrives here. Word2Vec has a default setting to create word vectors of 100dims. To change this: model = Word2Vec(sentences, size=300) when initializing the model will create vectors of 300 dimensions.
I think the problem lays with your minCount parameter value for the Word2Vec model. If this value is too high, less words get used in the training of the model resulting in a words vector of only 100. 7000 unique words is not a lot. Try setting the minCount lower than the default 5. model.setMinCount(value) https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=word2vec#pyspark.ml.feature.Word2Vec
user defined feature in CRF++
I tried to add more feature to CRF++ template. According to How can I tell CRF++ classifier that a word x is captilized or understanding punctuations? training sample The DT 0 1 0 1 B-MISC Oxford NNP 0 1 0 1 I-MISC Companion NNP 0 1 0 1 I-MISC to TO 0 0 0 0 I-MISC Philosophy NNP 0 1 0 1 I-MISC feature template # Unigram U00:%x[-2,0] U01:%x[-1,0] U02:%x[0,0] U03:%x[1,0] U04:%x[2,0] U05:%x[-1,0]/%x[0,0] U06:%x[0,0]/%x[1,0] U07:%x[-2,0]/%x[-1,0]/%x[0,0] #shape feature U08:%x[-2,2] U09:%x[-1,2] U10:%x[0,2] U11:%x[1,2] U12:%x[2,2] B The traing phase is ok. But I get no ouput with crf_test tilney#ubuntu:/data/wikipedia/en$ crf_test -m validation_model test.data tilney#ubuntu:/data/wikipedia/en$ Everything works fine if ignore the shape fearture above. where did I go wrong?
I figured this out. It's the problem with my test data. I thought that every feature should be taken from the trained model, so I only have two columns in my test data: word tag, which turns out that the test file should have the exact same format as the training data do!
Weka ignoring unlabeled data
I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When I test the model obtained from my labeled training data on an independent set of unlabeled test data, Weka ignores all the unlabeled instances. Can anybody please guide me how to solve this? Someone has already asked this question here before but there wasn't any appropriate solution provided. Here is a sample test file: #relation referents #attribute feature1 NUMERIC #attribute feature2 NUMERIC #attribute feature3 NUMERIC #attribute feature4 NUMERIC #attribute class{1 -1} #data 1, 7, 1, 0, ? 1, 5, 1, 0, ? -1, 1, 1, 0, ? 1, 1, 1, 1, ? -1, 1, 1, 1, ?
The problem is that when you specify a training set -t train.arff and a test set test.arff, the mode of operation is to calculate the performance of the model based on the test set. But you can't calculate a performance of any kind without knowing the actual class. Without the actual class, how will you know if your prediction if right or wrong? I used the data you gave as train.arff and as test.arff with arbitrary class labels assigned by me. The relevant output lines are: === Error on training data === Correctly Classified Instances 4 80 % Incorrectly Classified Instances 1 20 % Kappa statistic 0.6154 Mean absolute error 0.2429 Root mean squared error 0.4016 Relative absolute error 50.0043 % Root relative squared error 81.8358 % Total Number of Instances 5 === Confusion Matrix === a b <-- classified as 2 1 | a = 1 0 2 | b = -1 and === Error on test data === Total Number of Instances 0 Ignored Class Unknown Instances 5 === Confusion Matrix === a b <-- classified as 0 0 | a = 1 0 0 | b = -1 Weka can give you those statistics for the training set, because it knows the actual class labels and the predicted ones (applying the model on the training set). For the test set, it can't get any information about the performance, because it doesn't know about the true class labels. What you might want to do is: java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t train.arff -T test.arff -p 1-4 which in my case would give you: === Predictions on test data === inst# actual predicted error prediction (feature1,feature2,feature3,feature4) 1 1:? 1:1 1 (1,7,1,0) 2 1:? 1:1 1 (1,5,1,0) 3 1:? 2:-1 0.786 (-1,1,1,0) 4 1:? 2:-1 0.861 (1,1,1,1) 5 1:? 2:-1 0.861 (-1,1,1,1) So, you can get the predictions, but you can't get a performance, because you have unlabeled test data.