I'm trying to create a CRF model that segments Japanese sentences into words. At the moment I'm not worried about perfect results as it's just a test. The training goes fine but when it's finished it always gives the same guess for every sentence I try to tag.
"""Labels: X: Character is mid word, S: Character starts a word, E:Character ends a word, O: One character word"""
Sentence:広辞苑や大辞泉には次のようにある。
Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
Truth:['S', 'X', 'E', 'O', 'S', 'X', 'E', 'O', 'O', 'O', 'O', 'S', 'E', 'O', 'S', 'E', 'O']
Sentence:他にも、言語にはさまざまな分類がある。
Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
Truth:['O', 'O', 'O', 'O', 'S', 'E', 'O', 'O', 'S', 'X', 'X', 'X', 'E', 'S', 'E', 'O', 'S', 'E', 'O']
When looking at the transition info for the model:
{('E', 'E'): -3.820618,
('E', 'O'): 3.414133,
('E', 'S'): 2.817927,
('E', 'X'): -3.056175,
('O', 'E'): -4.249522,
('O', 'O'): 2.583123,
('O', 'S'): 2.601341,
('O', 'X'): -4.322003,
('S', 'E'): 7.05034,
('S', 'O'): -4.817578,
('S', 'S'): -4.400028,
('S', 'X'): 6.104851,
('X', 'E'): 4.985887,
('X', 'O'): -5.141898,
('X', 'S'): -4.499069,
('X', 'X'): 4.749289}
This looks good since all the transitions with negative values are impossible,
E -> X for example, going from the end of a word to the middle of the following one. S -> E gets has the highest value, and as seen above the model simply gets into a pattern of labeling S then E repeatedly until the sentence ends. I followed this demo when trying this, though that demo is for separating Latin. My features are similarly just n-grams:
['bias',
'char=ま',
'-2-gram=さま',
'-3-gram=はさま',
'-4-gram=にはさま',
'-5-gram=語にはさま',
'-6-gram=言語にはさま',
'2-gram=まざ',
'3-gram=まざま',
'4-gram=まざまな',
'5-gram=まざまな分',
'6-gram=まざまな分類']
I've tried changing labels to just S and X for start and other, but this just causes the model to repeat S,X,S,X till it runs out of characters. I've gone up to 6-grams in both directions which took a lot longer but didn't change anything. Tried training for more iterations and changing the L1 and L2 constants a bit. I've trained on up to 100,000 sentences which is about as far as I can go as it takes almost all 16GB of my ram to do so. Are my features structured wrong? How do I get the model to stop guessing in a pattern, is that even what's happening? Help would be appreciated, and let me know if I need to add more info to the question.
Turns out I was missing a step. I was passing raw sentences to the tagger rather than passing features, because the CRF can apparently accept character strings as if it were a list of almost featureless entries it was just defaulting to guessing the highest rated transition rather than raising an error. I'm not sure if this will help anyone else given it was a stupid mistake but I'll put an answer here until I decide whether or not I want to remove the question.
Related
I'm doing a coding exercise and it's to build a password generator. I understand I need to utilize the for loop with the list containing the elements but I'm having trouble getting multiple random elements. If the user input is 5, I'm able to generate a random letter and 5 times of the same element but I can't get it to generate 5 different elements. What code do I need to utilize to generate random elements depending on user input? I know my code and logic is incorrect but I can't figure out how else to get around this. Any feedback is much appreciated, thank you.
import random
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
nr_letters= int(input("How many letters would you like in your password?\n"))
for letter in letters:
random_letter = random.choice(letters) * nr_letters
print(random_letter)
There could be better ways - I've just used your code.
The for loop you are using is redundant.
Can do something like -
import random
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
nr_letters= int(input("How many letters would you like in your password?\n"))
random_letter=''
for i in range (nr_letters):
random_letter += random.choice(letters)
print(random_letter)
You actually don't have to use for loop to get your desired password.
import random
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
nr_letters= int(input("How many letters would you like in your password?\n"))
random_letter = "".join(random.choices(letters, k= nr_letters))
print(random_letter)
but if you must use loop, just pass the above code under loop as you wish. Happy coding.
I have a list which looks like seen=['poll','roll','toll','told']
I need to compare characters from each of the elements from that list.
When I try to strip those charcters using
for i in range(len(seen)):
chain1=[]
for j in range(len(seen)):
chain1.append(seen[i][j])
print(chain1)
I get an output like this
['p', 'o', 'l', 'l']
['r', 'o', 'l', 'l']
['t', 'o', 'l', 'l']
['t', 'o', 'l', 'd']
Since these are all different lists I cant seem to iterate over them.
My thinking is, if I can manage to get those lists into a single list of list I can do my iterations.
Any suggestions on how to make it into a list of list or some other way to iterate over those words?
you can merge it like below:
seen=['poll','roll','toll','told']
alist=[]
for i in seen:
chain=[]
for j in i:
chain.append(j)
alist.append(chain)
print(alist)
Output:
[['p', 'o', 'l', 'l'], ['r', 'o', 'l', 'l'], ['t', 'o', 'l', 'l'], ['t', 'o', 'l', 'd']]
why we use this particular range from 97 to 123? And I want to know more about alphabets using map ?
list(map(chr,range(97,123)))
ASCII codes for the lower case English alphabets range from 97 to 122.
The range function in the line you provided above, creates an iterable object with the elements from 97 to 122. You are mapping these with the chr method. This method returns the associated ASCII character. For example,
>>> chr(97)
'a'
>>> chr(100)
'd'
>>> chr(122)
'z'
Now, your map function doing all these operations for the numbers between 97 to 123.
>>> map(chr,range(97,123))
<map object at 0x000002EEAE8F46C8>
But the map returns the map object, and to convert that to a list , you can use list method.
>>> list(map(chr,range(97,123)))
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Regards
Im trying to sort a list of objects based on frequency of occurrence (increasing order) of characters. Im seeing that the sort behaves differently if list has numbers versus characters. Does anyone know why this is happening?
Below is a list of numbers sorted by frequency of occurrence.
# Sort list of numbers based on increasing order of frequency
nums = [1,1,2,2,2,3]
countMap = collections.Counter(nums)
nums.sort(key = lambda x: countMap[x])
print(nums)
# Returns correct output
[3, 1, 1, 2, 2, 2]
But If I sort a list of characters, the order of 'l' and 'o' is incorrect in the below example:
# Sort list of characters based on increasing order of frequency
alp = ['l', 'o', 'v', 'e', 'l', 'e', 'e', 't', 'c', 'o', 'd', 'e']
countMap = collections.Counter(alp)
alp.sort(key = lambda x: countMap[x])
print(alp)
# Returns Below output - characters 'l' and 'o' are not in the correct sorted order
['v', 't', 'c', 'd', 'l', 'o', 'l', 'o', 'e', 'e', 'e', 'e']
# Expected output
['v', 't', 'c', 'd', 'l', 'l', 'o', 'o', 'e', 'e', 'e', 'e']
Sorting uses stable sort - that means if you have the same sorting criteria for two elements they keep their relative order/positioning (here it being the amount of 2 for both of them).
from collections import Counter
# Sort list of characters based on increasing order of frequency
alp = ['l', 'o', 'v', 'e', 'l', 'e', 'e', 't', 'c', 'o', 'd', 'e']
countMap = Counter(alp)
alp.sort(key = lambda x: (countMap[x], x)) # in a tie, the letter will be used to un-tie
print(alp)
['c', 'd', 't', 'v', 'l', 'l', 'o', 'o', 'e', 'e', 'e', 'e']
This fixes it by using the letter as second criteria.
To get your exact output you can use:
# use original position as tie-breaker in case counts are identical
countMap = Counter(alp)
pos = {k:alp.index(k) for k in countMap}
alp.sort(key = lambda x: (countMap[x], pos[x]))
print(alp)
['v', 't', 'c', 'd', 'l', 'l', 'o', 'o', 'e', 'e', 'e', 'e']
See Is python's sorted() function guaranteed to be stable? or https://wiki.python.org/moin/HowTo/Sorting/ for details on sorting.
The title says it all. I want to generate the alphabet as a vector of characters. I did consider simply creating a range of 97-122 and converting it to characters, but I was hoping there would be a nicer looking way, such as Python's string.ascii_lower.
The resulting vector or string should have the characters a-z.
Hard-coding this sort of thing makes sense, as it can then be a compiled constant, which is great for efficiency.
static ASCII_LOWER: [char; 26] = [
'a', 'b', 'c', 'd', 'e',
'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o',
'p', 'q', 'r', 's', 't',
'u', 'v', 'w', 'x', 'y',
'z',
];
(Decide for yourself whether to use static or const.)
This is pretty much how Python does it in string.py:
lowercase = 'abcdefghijklmnopqrstuvwxyz'
# ...
ascii_lowercase = lowercase
Collecting the characters of a str doesn't seem like a bad idea...
let alphabet: Vec<char> = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".chars().collect();
Old question but you can create a range of chars, so
('a'..='z').into_iter().collect::<Vec<char>>()