mallet topic modling: How to deactive lowercase?

mallet topic modling: How to deactive lowercase? - topic-modeling

I'm conducting an topic modeling experiment with Mallet on german texts. Since german nouns begin with uppercase, I want to keep this feature. Does anyone know how to deactivate lowercasing?

Add --preserve-case when importing documents.

Related

Gender Detection for Nouns in Spanish

I am implementing a search engine in Spanish. In order to ensure gender neutrality, I need to get the gender of nouns in Spanish - e.g. "pintora" (painter, female) and "pintor" (painter, male). I am currently using FAIR library - that it is really great for NER in Spanish. However, I cannot find any good implementation/library for gender detection in Spanish nouns. Could you help me?
Thank you in advance for your help

After using multiple search engines, including academic ones to perhaps try and find research papers covering topics pertaining to Spanish word gender detection and other related terms, there seems to be no one that has tackled the problem and implemented a solution in a modern library.
Regardless, you can still tackle the problem by running a Spanish Part of Speech (PoS) tagger (for example, RuPERTa-base (Spanish RoBERTa) + POS) to detect nouns/pronouns, combine those labels with your NER output where required, and then write your own rules for determining the gender of particular nouns/pronouns based on Spanish grammar rules (such as those detailed in A New Reference Grammar of Modern Spanish, specifically Chapter 1 Gender of nouns).
Hopefully that helps give you some direction if you don't end up finding a ready-made implementation.

Determining Grammatical Validity of Text Input

I am looking for some way to determine if textual input takes the form of a valid sentence; I would like to provide a warning to the user if not. Examples of input I would like to warn the user about:
"dog hat can ah!"
"slkj ds dsak"
It seems like this is a difficult problem, since grammars are usually derived from textbanks, and the words in the provided sentence input might not appear in the grammar. It also seems like parsers maybe make assumptions that the textual input is comprised of valid English words to begin with. (just my brief takeaway from playing around with Stanford NLP's GUI tool). My questions are as follows:
Is there some tool available to scan through text input and determine if it is made up of valid English words, or at least offer a probability on that? If not, I can write this, just wondering if it already exists. I figure this would be step 1 before determining grammatical correctness.
My understanding is that determining whether a sentence is grammatically correct is done simply by attempting to parse the sentence and see if it is possible. Is that accurate? Are there probabilistic parsers that offer a degree of confidence when ambiguity is encountered? (e.g., a proper noun not recognized)
I hesitate to ask this last question, since I saw it was asked on SO over a decade ago, but any updates as to whether there is a basic, readily available grammar for NLTK? I know English isn't simple, but I am truly just looking to parse relatively simple, single sentence input.
Thanks!

A starting point are classification models trained on the Corpus of Linguistic Acceptability (CoLA) task. There are several recent blog articles on how to fine tune the BERT models from HuggingFace (python) for this task. Here is one such blog article. You can also find already fine-tuned models for CoLA for various BERT flavors in the HuggingFace model zoo.

How extracting meaning of sentences for sentiment analysis using NLP

"I had safe journey" ,assume this is a feedback for a driver ,provided by a passenger. I need to extract theses information from this sentence..
"I had safe journey" ->
SUBJECT= "driving"
SENTIMENT= "positive"
I tried with NLP Extracting Information from Text method. But I don't know how recognized Entities from these kind of sentences.How am I supposed to do that ?

To categorize entities of a sentence or a sentence as a whole, you first need to have defined set of classes/categories/groups.
for eg: To categorize journey to travelling/driving, you should train your system/algorithm to identify specific pattern of sentences which will fall under the category of driving/journey.
This training involves concepts of machine learning, Text Categorization is what you should be searching for.
Here is a reference (to just give you an idea) and you can find many more over the web.
Good Luck!
Note: Below are some links from Coursera which offers a course on NLP
Link 1
Link 2

Building a thesaurus from corpus

I am working on a natural language processing application. I have a text describing 30 domains. Each domain is defined with a short paragraph that explains it. My aim is to build a thesaurus from this text so I can determine from an input string which domains are concerned. The text is about 5000 words and each domains is described by 150 words. My questions are :
Do I have a long enough text to create a thesaurus from ?
Is my idea of building a thesaurus legit or should I just use NLP libraries to analyse my corpus and the input string ?
At the moment, I have calculated the number total of occurrence of each words grouped by domains because I first thought of a indexed approach. But I am really not sure which method is the best. Does someone have experience in both NLP and thesaurus building ?

I think what you are looking for is topic modeling. Given a word, you want to get the probability of which domain the word belongs to. I would recommend using off the shelf algorithms that implement LDA (Latent Dirichlet Algorithm).
Alternatively, you can visit David Blei's website. He has written some great software that implements LDA, and topic modeling in general. He also has presented several tutorials for topic modeling for beginners.

If your goal is to build a thesaurus then build a thesaurus; if your goal is not to build a thesaurus, then you better use stuff available out there.
More generally, for any task in NLP - from data acquisition to machine translation - you're gonna face numerous problems (both technical and theoretical), and it is very easy to stray from the path, as these problems are - most of the time - fascinating.
Whatever the task is, build a system using existing resources. Then you get the big picture; then you can start thinking about improving component A or B.
Good luck.

How can I analyze pieces of text for positive or negative words?

I'm looking for some sort of module (preferably for python) that would allow me to give that module a string about 200 characters long. The module should then return how many positive or negative words that string had. (e.g. love, like, enjoy vs. hate, dislike, bad)
I'd really like to avoid having to reinvent the wheel in natural language processing, so if there is anything you guys know of that would allow me to do what I described above, it'd be a huge time-saver if you could share.
Thanks for the help!

I think you're looking for sentiment analysis. Here's a Twitter sentiment app.
Here's a question about sentiment analysis using Python.

Before you analyse pieces of text you need to preprocess given text by striping punctuation, repair language, split spaces,lower the whole text and store the words in an iterable data structure.
For some basic sentiment analysis, following techniques can be used:
Bag of words
In bag of words technique we basically go through a bag(file) of words and check if the iterable made by us contains these. If it does then we assign some value to each word's presence in order to weigh the total sentiment of the text.
This link should help you understand more about this
https://en.wikipedia.org/wiki/Bag-of-words_model
Keyword Extraction and Tagging
Keywords and important information can be extracted from the input text by tagging the elements and then removing unwanted data.
For example:
My name is John.
Here John, name are the information and "is" isn't really needed.
Similarly verbs and other unimportant things can be removed in order to retain only the main information.
Chunking and Chinking helps.
This link must be of help.
http://nltk.org/book/ch07.html

You can tokenize your text and get the sentiment using existing sentiment analysis tools. The most comprehensive sentiment analysis tool that I know is SentiBench. This is basically a survey study of all sentiment analysis tools. As well as the code and examples on how to use the code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string