Related
I have JSON data which looks like this:
{
"text":"Dispute Case ID MM-E-904982837 the amount of $20.06 should be AU dollars total is $62.34 US dollars.",
"spans":[
{"start":82,"end":99,"label":"dis_amt","ngram":"$62.34 US dollars"},
{"start":45,"end":51,"label":"dis_amt","ngram":"$20.06"}
]
}
I want to bring this data into the below format:
Wherever there is label in spans I replace that in the original text with label. It may comprise of more than 1 token.
**First step**: ['Dispute', 'Case', 'ID', 'MM-E-904982837', 'the', 'amount', 'of', '$20.06', 'should', 'be', 'AU', 'dollars', 'total', 'is', '$62.34', 'US', 'dollars.']
**Second step**: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'dis_amount', 'O', 'O', 'O', 'dispute_amount', 'O', 'O', 'dis_amt', 'dis_amt', 'dis_amt']
My Code:
for data in data-set:
data = ast.literal_eval(data)
text=data['text']
split_txt=text.split()
print(split_txt)
nerd_label=['O' for i in range(len(split_txt))]
for sp in data['spans']:
ngrams=sp['ngram']
split_ngram=ngrams.split()
for ngram in split_ngram:
if ngram in split_txt:
idx=split_txt.index(ngram)
nerd_label[idx]=sp['label']
I get this wrong output:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'dis_amt', 'O', 'O', 'O', 'dis_amt', 'O', 'O', 'dis_amt', 'dis_amt', 'O']
How can I join [5, 'N', 'K', 'r', 9, 'j', 'K', '(', 'E', 't'] into making it a single string like this "5NKr9jK(Et"
Just join it :):
julia> join([5, 'N', 'K', 'r', 9, 'j', 'K', '(', 'E', 't'])
"5NKr9jK(Et"
The individual elements of the vector are converted to string using the print function.
I'm doing a coding exercise and it's to build a password generator. I understand I need to utilize the for loop with the list containing the elements but I'm having trouble getting multiple random elements. If the user input is 5, I'm able to generate a random letter and 5 times of the same element but I can't get it to generate 5 different elements. What code do I need to utilize to generate random elements depending on user input? I know my code and logic is incorrect but I can't figure out how else to get around this. Any feedback is much appreciated, thank you.
import random
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
nr_letters= int(input("How many letters would you like in your password?\n"))
for letter in letters:
random_letter = random.choice(letters) * nr_letters
print(random_letter)
There could be better ways - I've just used your code.
The for loop you are using is redundant.
Can do something like -
import random
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
nr_letters= int(input("How many letters would you like in your password?\n"))
random_letter=''
for i in range (nr_letters):
random_letter += random.choice(letters)
print(random_letter)
You actually don't have to use for loop to get your desired password.
import random
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
nr_letters= int(input("How many letters would you like in your password?\n"))
random_letter = "".join(random.choices(letters, k= nr_letters))
print(random_letter)
but if you must use loop, just pass the above code under loop as you wish. Happy coding.
Im trying to sort a list of objects based on frequency of occurrence (increasing order) of characters. Im seeing that the sort behaves differently if list has numbers versus characters. Does anyone know why this is happening?
Below is a list of numbers sorted by frequency of occurrence.
# Sort list of numbers based on increasing order of frequency
nums = [1,1,2,2,2,3]
countMap = collections.Counter(nums)
nums.sort(key = lambda x: countMap[x])
print(nums)
# Returns correct output
[3, 1, 1, 2, 2, 2]
But If I sort a list of characters, the order of 'l' and 'o' is incorrect in the below example:
# Sort list of characters based on increasing order of frequency
alp = ['l', 'o', 'v', 'e', 'l', 'e', 'e', 't', 'c', 'o', 'd', 'e']
countMap = collections.Counter(alp)
alp.sort(key = lambda x: countMap[x])
print(alp)
# Returns Below output - characters 'l' and 'o' are not in the correct sorted order
['v', 't', 'c', 'd', 'l', 'o', 'l', 'o', 'e', 'e', 'e', 'e']
# Expected output
['v', 't', 'c', 'd', 'l', 'l', 'o', 'o', 'e', 'e', 'e', 'e']
Sorting uses stable sort - that means if you have the same sorting criteria for two elements they keep their relative order/positioning (here it being the amount of 2 for both of them).
from collections import Counter
# Sort list of characters based on increasing order of frequency
alp = ['l', 'o', 'v', 'e', 'l', 'e', 'e', 't', 'c', 'o', 'd', 'e']
countMap = Counter(alp)
alp.sort(key = lambda x: (countMap[x], x)) # in a tie, the letter will be used to un-tie
print(alp)
['c', 'd', 't', 'v', 'l', 'l', 'o', 'o', 'e', 'e', 'e', 'e']
This fixes it by using the letter as second criteria.
To get your exact output you can use:
# use original position as tie-breaker in case counts are identical
countMap = Counter(alp)
pos = {k:alp.index(k) for k in countMap}
alp.sort(key = lambda x: (countMap[x], pos[x]))
print(alp)
['v', 't', 'c', 'd', 'l', 'l', 'o', 'o', 'e', 'e', 'e', 'e']
See Is python's sorted() function guaranteed to be stable? or https://wiki.python.org/moin/HowTo/Sorting/ for details on sorting.
I'm trying to create a CRF model that segments Japanese sentences into words. At the moment I'm not worried about perfect results as it's just a test. The training goes fine but when it's finished it always gives the same guess for every sentence I try to tag.
"""Labels: X: Character is mid word, S: Character starts a word, E:Character ends a word, O: One character word"""
Sentence:広辞苑や大辞泉には次のようにある。
Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
Truth:['S', 'X', 'E', 'O', 'S', 'X', 'E', 'O', 'O', 'O', 'O', 'S', 'E', 'O', 'S', 'E', 'O']
Sentence:他にも、言語にはさまざまな分類がある。
Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
Truth:['O', 'O', 'O', 'O', 'S', 'E', 'O', 'O', 'S', 'X', 'X', 'X', 'E', 'S', 'E', 'O', 'S', 'E', 'O']
When looking at the transition info for the model:
{('E', 'E'): -3.820618,
('E', 'O'): 3.414133,
('E', 'S'): 2.817927,
('E', 'X'): -3.056175,
('O', 'E'): -4.249522,
('O', 'O'): 2.583123,
('O', 'S'): 2.601341,
('O', 'X'): -4.322003,
('S', 'E'): 7.05034,
('S', 'O'): -4.817578,
('S', 'S'): -4.400028,
('S', 'X'): 6.104851,
('X', 'E'): 4.985887,
('X', 'O'): -5.141898,
('X', 'S'): -4.499069,
('X', 'X'): 4.749289}
This looks good since all the transitions with negative values are impossible,
E -> X for example, going from the end of a word to the middle of the following one. S -> E gets has the highest value, and as seen above the model simply gets into a pattern of labeling S then E repeatedly until the sentence ends. I followed this demo when trying this, though that demo is for separating Latin. My features are similarly just n-grams:
['bias',
'char=ま',
'-2-gram=さま',
'-3-gram=はさま',
'-4-gram=にはさま',
'-5-gram=語にはさま',
'-6-gram=言語にはさま',
'2-gram=まざ',
'3-gram=まざま',
'4-gram=まざまな',
'5-gram=まざまな分',
'6-gram=まざまな分類']
I've tried changing labels to just S and X for start and other, but this just causes the model to repeat S,X,S,X till it runs out of characters. I've gone up to 6-grams in both directions which took a lot longer but didn't change anything. Tried training for more iterations and changing the L1 and L2 constants a bit. I've trained on up to 100,000 sentences which is about as far as I can go as it takes almost all 16GB of my ram to do so. Are my features structured wrong? How do I get the model to stop guessing in a pattern, is that even what's happening? Help would be appreciated, and let me know if I need to add more info to the question.
Turns out I was missing a step. I was passing raw sentences to the tagger rather than passing features, because the CRF can apparently accept character strings as if it were a list of almost featureless entries it was just defaulting to guessing the highest rated transition rather than raising an error. I'm not sure if this will help anyone else given it was a stupid mistake but I'll put an answer here until I decide whether or not I want to remove the question.