Microsoft translator engine customization: parallel txt files - azure

I am trying to perform some NMT engine customization for Japanese but I am having some difficulties uploading parallel txt files. I've gathered 10k parallel sentences and I've put them into two txt files:
As the guide suggested, I've also been careful to remove sentences containing the \n and \r characters in them, but upon uploading I get the following:
What's wrong?

We display the sentence counts because the model training engine operates at the sentence level. The expected format of the txt parallel file set is one sentence for each line. During the upload process we do run a sentence breaker which identifies end of sentence markers and breaks accordingly. This is why the count of sentences do not always match the count of lines. Sentences are the units we operate on, not lines of the input file. That's why we focus on sentences rather than lines.
This is also why we suggest removing newline characters within sentences. The newline is considered an end of sentence marker, so having newlines within a sentence creates a false sentence break.
In response to your second concern, we do run a sentence aligning process on most data that is submitted. If there is an inconsistent number of sentences in the uploaded parallel files we can usually get most of the sentence pairs, as long as the sentences are fairly close.

After some "debugging" I've noticed that the number shown in the portal is the number of sentences (instead of lines, my bad!). I find it kind of confusing (and not really useful in my opinion). What would be the usefulness of displaying this information?
In addition, I've noticed that there is no warning if you upload one file containing less lines than the second file (which would make the parallel files not parallel anymore - the whole point of parallel files is to have X lines in the source file, and X lines on the target file). It would be helpful if at least a warning was shown to prevent mistakes (if you use parallel files and len(f1)!=len(f2) it's a great indicator that something is off)

Related

koRpus--tokenize command on large folder of word files

I have made some headway in getting koRpus to analyze my data, but there are lingering problems.
The 'tokenize' command seems to work--kind of. I run the following line of code:
word <- tokenize("/Users/gdballingrud/Desktop/WPSCASES 1/", lang="en")
And it produces a 'Large krp.text' object. However, the size of the file (5.6 MB) is far less than the size of the file I reference in the code (260 MB). Further, when I use the 'readability' command to generate text analysis scores (like so:)
all <- readability(word)
It returns one readability score for the whole krp.text object (one per readability measure, I mean).
I need readability scores on each Word file I have in my folder, and I need to use koRpus (others like quanteda don't generate some of the readability measures that I need, like LIX and kuntzsch's text-redundandz-index).
Is anyone experienced enough with koRpus to point out what I have done wrong? The recurring problems are: 1) getting the tokenize command to recognize each file in my folder, and 2) getting readability scores for each separate file.
Thanks,
Gordon

spaCy Doc.sents not Splitting Correctly

In an NLP text summarization example, I've come across a weird situation. The example uses the spaCy library to process the text. I'm explaining the situation through the two cases below.
In the first case (see the first pic), spaCy doesn't split the sentences after the period character, as you see in the red outlined part, "won by the Whites.".
In the second case (see the second pic), after I've moved the sentence up, ending with "Whites.", spaCy does split the sentences after the period character, as you see in the red outlined part, "won by the Whites.,". Note that this time there is a comma at the end of the sentence ending with "Whites.". That means that this sentence has been split from the next sentence unlike in the first case.
I've observed this situation by moving the sentence to another position as well.
Nothing has come to my mind except this might be a bug. (I've copied the text to a text editor and then pasted to the notebook to make sure that there is not a special character next to the period.)
What do you think?
I'm sharing the notebook here so that you can play with it:
https://colab.research.google.com/drive/1MXRIrak0y680U84g0a0glpjX-clkkdtG?usp=sharing
I think your problem might be that it might be a list in the second one. But feel free to correct me if I'm wrong.

Mallet topic modeling: remove most common words

I'm new with Mallet and topic modeling in the field of art history. I'm working with Mallet 2.0.8 and command line (I don't know yet Java). I'd like to remove most common and least common words (10 times in the whole corpus, as D. Mimno recommend) before training the model because the results aren't clean (even with the stoplist), which is not surprising.
I've found that prune command could be usefull, with options like prune-document-freq. Is it right? Or does it exist another way? Someone could explain me the whole procedure in details (for example: create/input Vectors2Vectors file and at which stage and then?)? It would be much appreciated!
I'm sorry for this question, I'm a beginner with Mallet and text mining! But it's quite exciting!
Thanks a lot for your help!
There are two places you can use Mallet to curate the vocabulary. The first is in data import, for example the import-file command. The --remove-stopwords option removes a fixed set of English stopwords. This is here for backwards compatibility reasons, and is probably not a bad idea for some English-language prose, but you can generally do better by creating a custom lists. I would recommend using instead the --stoplist-file option along with the name of a file. All words in this file, separated by spaces and/or newlines, will be removed. (Using both options will remove the union of the two lists, probably not what you want.) Another useful option is --replacement-files, which allows you to specify multi-word strings to treat as single words. For example, this file:
black hole
white dwarf
will convert "black hole" into "black_hole". Here newlines are treated differently from spaces. You can also specify multi-word stopwords with --deletion-files.
Once you have a Mallet file, you can modify that file with the prune command. --prune-count N will remove words that occur fewer than N times in any document. --prune-document-freq N will remove words that occur at least once in N documents. This version can be more robust against words that occur a lot in one document. You can also prune by proportion: --min-idf removes infrequent words, --max-idf removes frequent words. A word with IDF 10.0 occurs less than once in 20000 documents, a word with IDF below 2.0 occurs in more than 13% of the collection.

OpenNLP SentenceDetector doesn't recognize whole sentence

I'm working on a research project and I need a NLP program to detect sentences in many different circumstances. I was advised to use OpenNLP and I am convinced to use it after reading it's wiki pages. So, I use OpenNLP in order to detect sentences as well as any words or phrases which are not belong to a sentence (also called sentence fragments).
OpenNLP accepts .txt files as input if you want to redirect the input. If you want to use .doc file as input, you have to convert it to a .txt file. My problem starts right here.
I have many different files in different formats. I would like to detect sentences in each file if they consist any text. Therefore, I started to convert each potentially text containing file to a .txt file. The conversion process is not perfect. For example, if a sentence too long (say longer than a line), then conversion tool gets the both lines of the sentence as separated sentences. This results OpenNLP produces each line as different sentences because of eoln character at the end of the first line.
My question is, is there anyway that I can parameterize or configure OpenNLP to recognize whole sentence (first and second line together)?
I suggest you, use apache Tika for that conversion of different files.
Apache Tika has AutoDetectParser which detects different file types and extracts the data in it (Even metadata if you want) and you can save that into a .txt file.
Try your paragraph with new lines replaced with spaces with CoreNLP: nlp.stanford.edu:8080/corenlp/process

How to automatically detect sentence fragments in a text file

I am working on a project and need a tool or an API in order to detect sentence fragments in large text. There are many solutions such as OpenNLP for detecting sentences in given file. However, I wasn't able to find any explicit solution to the problem of finding words, phrases or event character combinations which are not belong to any grammatically correct sentences.
Any help will be greatly appreciated.
Thanks,
Lorderon
you could use n-grams as a work around:
Suppose you have a large collection of text with real sentences for reference. You could extract all sequences of 1,2,3,4,5, or more words and then in your text double check if the fragments from your text exist as n-grams.
you can download n-grams directly from google: http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html but you might need a lot of traffic.
You could also count the n-grams yourself in this case you can take the parsed data sets of the wikipedia from my website:
http://glm.rene-pickhardt.de/data/ and the source code from https://github.com/renepickhardt/generalized-language-modeling-toolkit in order to create the ngrams yourself (or any other ngram toolkit like srilm, kylm, opengrm,...)

Resources