Mallet topic modeling: remove most common words

Mallet topic modeling: remove most common words - topic-modeling

I'm new with Mallet and topic modeling in the field of art history. I'm working with Mallet 2.0.8 and command line (I don't know yet Java). I'd like to remove most common and least common words (10 times in the whole corpus, as D. Mimno recommend) before training the model because the results aren't clean (even with the stoplist), which is not surprising.
I've found that prune command could be usefull, with options like prune-document-freq. Is it right? Or does it exist another way? Someone could explain me the whole procedure in details (for example: create/input Vectors2Vectors file and at which stage and then?)? It would be much appreciated!
I'm sorry for this question, I'm a beginner with Mallet and text mining! But it's quite exciting!
Thanks a lot for your help!

There are two places you can use Mallet to curate the vocabulary. The first is in data import, for example the import-file command. The --remove-stopwords option removes a fixed set of English stopwords. This is here for backwards compatibility reasons, and is probably not a bad idea for some English-language prose, but you can generally do better by creating a custom lists. I would recommend using instead the --stoplist-file option along with the name of a file. All words in this file, separated by spaces and/or newlines, will be removed. (Using both options will remove the union of the two lists, probably not what you want.) Another useful option is --replacement-files, which allows you to specify multi-word strings to treat as single words. For example, this file:
black hole
white dwarf
will convert "black hole" into "black_hole". Here newlines are treated differently from spaces. You can also specify multi-word stopwords with --deletion-files.
Once you have a Mallet file, you can modify that file with the prune command. --prune-count N will remove words that occur fewer than N times in any document. --prune-document-freq N will remove words that occur at least once in N documents. This version can be more robust against words that occur a lot in one document. You can also prune by proportion: --min-idf removes infrequent words, --max-idf removes frequent words. A word with IDF 10.0 occurs less than once in 20000 documents, a word with IDF below 2.0 occurs in more than 13% of the collection.

Related

How to use NLP to detect sentences in a long text?

I am using automatic speech recognition to extract text from an audio file. However, the output is just a long sequence of words with no punctuation whatsoever. What I'd like to do is use some NLP technique to estimate beginnings and endings of sentences, or, in other words, predict positions of punctuation markers. I found that CoreNLP can do sentence splitting, but apparently only if punctuation is already present.

You may find relevant info in the answers to this other question: Sentence annotation in text without punctuation.
In particular, one of the answers claims the deepsegment package works well on unpunctuated text.

In spoken language you often find that people don't use sentences, but that the clauses simply run into each other. The degree to which this happens depends on the formality and setting -- a speech will conform more to written sentence structures than a conversation in a pub among friends.
One approach you could try is to identify words that typically begin/end sentences in written text, and see if that can help you segmenting your data. Or look for verbs, and then try to find boundaries between them; this might be clause boundaries rather than sentence boundaries, but as I said, in spoken language there often are no sentences.

Text semantic preprocessing

Let assume that I have a dataset of car accidents. Each accident has a textual description made using a set of cameras and other sensors.
Suppose now I have only the data of a single camera (e.g. the frontal) and I want to remove all the sentences of the description that are not related to it. I think a basic and easy solution could be to use a boolean retrieval system using a set of specific keywords to remove unwanted sentences, but I don't know neither if it is a good idea ner if it could work; could someone suggest me any idea? What kind of statistics might be useful to study this problem? Thanks

Regex could be one solution.
I created a regex matching the word "front", case insensitive, which searches for front and then get the whole sentences with one or more matches.
The results can be trimmed some from starting white spaces. (Can probably be removed as well with some fine tuning.)
The word you can swap out through some variable taking values from a list, if you need "front", "rear", "side", "right", "left" or other.
Regex Example https://regex101.com/r/ZHU0kr/5

Microsoft translator engine customization: parallel txt files

I am trying to perform some NMT engine customization for Japanese but I am having some difficulties uploading parallel txt files. I've gathered 10k parallel sentences and I've put them into two txt files:
As the guide suggested, I've also been careful to remove sentences containing the \n and \r characters in them, but upon uploading I get the following:
What's wrong?

We display the sentence counts because the model training engine operates at the sentence level. The expected format of the txt parallel file set is one sentence for each line. During the upload process we do run a sentence breaker which identifies end of sentence markers and breaks accordingly. This is why the count of sentences do not always match the count of lines. Sentences are the units we operate on, not lines of the input file. That's why we focus on sentences rather than lines.
This is also why we suggest removing newline characters within sentences. The newline is considered an end of sentence marker, so having newlines within a sentence creates a false sentence break.
In response to your second concern, we do run a sentence aligning process on most data that is submitted. If there is an inconsistent number of sentences in the uploaded parallel files we can usually get most of the sentence pairs, as long as the sentences are fairly close.

After some "debugging" I've noticed that the number shown in the portal is the number of sentences (instead of lines, my bad!). I find it kind of confusing (and not really useful in my opinion). What would be the usefulness of displaying this information?
In addition, I've noticed that there is no warning if you upload one file containing less lines than the second file (which would make the parallel files not parallel anymore - the whole point of parallel files is to have X lines in the source file, and X lines on the target file). It would be helpful if at least a warning was shown to prevent mistakes (if you use parallel files and len(f1)!=len(f2) it's a great indicator that something is off)

Dividing string of characters to words and sentences (English only)

I'm looking for a solution to following task. I take few random pages from random book in English and remove all non letter characters and convert all chars to lower case. As a result I have something like:
wheniwasakidiwantedtobeapilot...
Now what I'm looking for is something that could reverse that process with quite a good accuracy. I need to find words and sentence separators. Any ideas how to approach this problem? Are there existing solutions I can base on without reinventing the wheel?

This is harder than normal tokenization since the basic tokenization task assumes spaces. Basically all that normal tokenization has to figure out is, for example, whether punctuation should be part of a word (like in "Mr.") or separate (like at the end of a sentence). If this is what you want, you can just download the Stanford CoreNLP package which performs this task very well with a rule-based system.
For your task, you need to figure out where to put in the spaces. This tutorial on Bayesian inference has a chapter on word segmentation in Chinese (Chinese writing doesn't use spaces). The same techniques could be applied to space-free English.
The basic idea is that you have a language model (an N-Gram would be fine) and you want to choose a splitting that maximizes the probability the data according to the language model. So, for example, placing a space between "when" and "iwasakidiwantedtobeapilot" would give you a higher probability according to the language model than placing a split between "whe" and "niwasakidiwantedtobeapilot" because "when" is a better word than "whe". You could do this many times, adding and removing spaces, until you figured out what gave you the most English-looking sentence.
Doing this will give you a long list of tokens. Then when you want to split those tokens into sentences you can actually use the same technique except instead of using a word-based language model to help you add spaces between words, you'll use a sentence-based language model to split that list of tokens into separate sentences. Same idea, just on a different level.

The tasks you describe are called "words tokenization" and "sentence segmentation". There are a lot of literature about them in NLP. They have very simple straightforward solutions, as well as advanced probabilistic approaches based on language model. Choosing one depends on your exact goal.

Large free block of english non-pronoun text

As part of teaching myself python I've written a script which allows a user to play hangman. At the moment, the hangman word to be guessed is simply entered manually at the start of the script's code.
I want instead for the script to choose randomly from a large list of english words. This I know how to do - my problem is finding that list of words to work from in the first place.
Does anyone know of a source on the net for, say, 1000 common english words where they can be downloaded as a block of text or something similar that I can work with?
(My initial thought was grabbing a chunk of a novel from project gutenburg [this project is only for my own amusement and won't be available anywhere else so copyright etc doesn't matter hugely to me btw], but anything like that is likely to contain too many names or non-standard words that wouldn't be suitable for hangman. I need text that only has words legal for use in scrabble, basically).
It's a slightly odd question for here I suppose, but actually I thought the answer might be of use not just to me but anyone else working on a project for a wordgame or similar that needs a large seed list of words to work from.
Many thanks for any links or suggestions :)

Would this be useful?

Have you tried /usr/share/dict/words?

Create text list manually
Grab text from Project Gutenberg, Wikipedia or some other source. Go through the text and count how many times each word is found. The words that are found most frequently will be pronouns, conjunctions, etc... Just throw them out.
Proper Nouns will likely be the least frequently found words unless of course your text is a story, then the character names will likely be found quite often. Probably the best way to handle proper nouns is to use many sources and count how many sources the word is found in. Essentially, words that are common among a lot of different sources will likely not be proper nouns. Words that are specific to one text source, you can throw out. This idea is related to tfidf.
Once you have calculated these word frequencies, it's also easy to just look over the words, and tweak your list as necessary.
Use Wordnet
Another idea is to download words from Wordnet. Wordnet tells the parts of speech for a lot of words. You could just stick to nouns and verbs for your purpose.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string