Can I put 2 or 3 or event more sentence in reference text that will be used to generate the language model - cmusphinx

I have read some tutorial on creating lm. I was a little confuse about weather.txt.
I have sentences like this: (putting silence mark at the start and end of the sentence.)
<s> OK </s>
<s> IS THERE A COMPUTER IN MY ROOM </s>
can I put two sentences into 1 sentence ? (putting silence mark at the start and end of the sentence.)
<s> OK . IS THERE A COMPUTER IN MY ROOM </s>
we tried, seem works fine , provided with the wav file , say the 2 sentence together.
Even more , can I put all about 60~100 sentences into one , and with end punctuation spot : '.' to split each sentence ?
<s> OK . IS THERE A COMPUTER IN MY ROOM。YES,THERE IS ON THE DESK. RIGHT. ARE THERE ANY BALLS </s>

You can do anything, but it's better to have one line per sentence.
Dots must be removed before lm training.

Related

OpenAI - Limit TL/DR Summarization to X characters & complete sentences

I'm currently learning how to use OpenAI API for text summarization for a project. Overall, it's pretty amazing but there is one thing I'm struggling with.
I need a tl/dr summary that is 1 - 2 complete sentences with max of 250 characters. I can play around with the MaximumLength option but if I make it too short, the summary often just ends up with a sentence that is just cut off in the middle.
Another problem is - if there is a bullet list in the main text, the summary will be a few of those bullets. Again, I need 1-2 complete sentences, not bullets.
Lastly, if the main text is quite short, often my summary will be 2 sentences that say the exact same thing with a slight variation.
I've tried this using the various engines (text-davinci, davinci-instruct-beta). Any suggestions on how I can instruct/guide OpenAI to give me the output that I'm looking for? Or, do I need to start doing the "Fine Tuning" option. If I feed it 1,000+ examples of 1-2 sentences with < 250 characters & no bullets, will it understand what I need?
Many thanks in advance.

number of tokenized sentences does not match number of sentences in text

I have some problems with the nltk.sent_tokenize function.
My text (that I want to tokenize) consist of 54116 sentences that are separated by a dot. I removed other punctuation.
I would like to tokenize my text on a sentence level by using nltk.sent_tokenize.
However, if I apply tokenized_text = sent_tokenize(mytext), the length of tokenized_text is only 51582 instead of 54116.
Any ideas, why this could happen?
Kind regards
This would typically happen because the model for Sentence Boundary Detection can not detect all sentence boundaries correctly — typically limited by its accuracy, which would be of the order of 97%-99%. That said, since you are claiming that the corpus has sentences strictly separated by a "dot", you may simply split it on '.', provided there are no abbreviations like Prof. or Dr. or Sr. etc. You may like to refer to https://www.aclweb.org/anthology/C12-2096.pdf for further details.

Algorithm For Determining Sentence Subject Similarity

I'm looking to generate an algorithm that can determine the similarity of a series of sentences. Specifically, given a starter sentence, I want to determine if the following sentence is a suitable addition.
For example, take the following:
My dog loves to drink water.
All is good, this is just the first sentence.
The dog hates cats.
All is good, both sentences reference dogs.
It enjoys walks on the beach.
All is good, "it" is neutral enough to be an appropriate communication.
Pizza is great with pineapple on top.
This would not be a suitable addition, as the sentence does not build on to the "narrative" created by the first three sentences.
To outline the project a bit, I've created a library that generated Markov text chains based on the input text. That text is then corrected grammatically to produce viable sentences. I now want to string these sentences together to create coherent paragraphs.

Annotating sentence in multiple lines in GATE

I have an issue with Sentence Splitter module in GATE. My text is something like this:
Social history. He drank a lot in his young age. He did
not attend a school. He was depressed of his condition.
While we are sure that the sentences should be splitted like
Sentence 1: Social history.
Sentence 2: He drank a lot in his young age.
Sentence 3: He did not attend a school.
Sentence 4: He was depressed of his condition.
The ANNIE Sentence Splitter recognises that the text in different lines should be grouped in different sentences, thus results this:
Sentence 1: Social history.
Sentence 2: He drank a lot in his young age.
Sentence 3: He did
Sentence 4: not attend a school.
Sentence 5: He was depressed of his condition.
That is because the sentence is separated in multiple lines. Is there a way to tell the sentence splitter that the sentence might be comes in more than one line? Or is there any better method to recognise sentences in such type of text?
Thank you :)
Try using RegEx Sentence Splitter instead of Annie.
With the ANNIE Sentence Splitter, you have the parameter TransducerURL which by default points to something like:
/PATH-TO-GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main-single-nl.jape
In this folder there is also a jape file called:
/PATH-TO-GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main.jape
If you change it it should work.

String-matching algorithm for noisy text

I have used OCR (optical character recognition) to get texts from images. The images contain book covers. Because of the images are so noisy, some characters are misrecognised, or some noises are recognised as a character.
Examples:
"w COMPUTER Nnwonxs i I "(Compuer Networks)
"s.ll NEURAL NETWORKS C "(Neural Networks)
"1llllll INFRODUCIION ro PROBABILITY ti iitiiili My "(Introduction of Probability)
I builded a dictionary with words, but i want to somehow match the recognised text with the dictionary. I tried LCS (Longest Common subsequence), but its not so effective.
What is the best string matching algorithm for this kind of problem? (So a part of string is just noise, but also the important part of string can has some misrecognised characters)
That's really a big question. Followings are something I know about it. For more details, you can read some related papers.
For single word, use Hamming Distance to calculate the similarity between the word your recognized by OCR and those in your dictionary;
this step is used to correct the the words have been recognized by OCR but do not exist.
Eg:
If the result of OCR is INFRODUCIION which dosen't exist in your dictionary, you can find out the Hamming Distance of word 'INTRODUCTION' is 2. So it may be mis-recognized as 'INFRODUCIION'.
However, the same word may be recognized as different words with the same Hamming Distance between them.
Eg: If the result of OCR is the CAY, you may find CAR and CAT are both with the same Hamming Distance of 1, so that will be confused.
In this case, there are several things can be used for analyze:
Still for single word, the image different between CAT and CAY is less that CAR and CAY. So for this reason, CAT seems the right word with a greater probability.
Then let us the context to caculate another probability. If the whold sentence is 'I drove my new CAY this morning', as for people usually drive a CAR but not a CAT, we have a better chance to regard the word CAY as CAR but not CAT.
For the frequency of the words used in the similar articles, use TF-TDF.
Are you saying you have a dictionary that defines all words that are acceptable?
If so, it should be fairly straight forward to take each word and find the closest match in your dictionary. Set a match threshold and discard the word if it does not reach the threshold.
I would experiment with the Soundex and Metaphone algorithms or the Levenshtein Distance algorithm.

Resources