Text catecagorization for classifying gender from blog entries - text

I am researching on "text analysis".
Now I have a set of corpus, and I know gender of the author of each file.
So, how to build feature vectors for classifiers (bayes, SVM...). Or can you suggest some useful document for me about this topic? Thank you!

Related

Label Dutch reviews on specific customer categories for language classification

I am looking for a classification module that is able to classify reviews in custom categories. This needs to be done for specifically Dutch reviews.
Does anyone have an idea what package would be most suitable for such a kind of project?
Thank you in advance.
Kind regards
I am trying to find a package that is able to classify reviews on custom made categories.

How do I combine all Bert embeddings to form a feature?

Thank you in advance for any help offered. I am working on a product classification task. I embeded customer reviews one by one for every single product by Bert. I want to form a new feature called "customer review" (a vector representation for reviews) for products I want to classify. Is it feasible to form this feature by combining all Bert embeddings of one specific product? If so, what should I do? Any suggestion is appreciated.

What is the impact of word frequency on Gensim LDA Topic modelling

I am trying to use Gensim LDA modelling to topic model of dataset of food recipes. I wish to have topics based the key ingredients in the recipe. But the recipe text contains more words that are generic English and are not ingredient names. Hence my topic outcome is not as good as expected. I am trying to understand the impact of word frequency in the LDA topic outcome. Thanks.
Have you tried removing stop-words from the data on which you construct LDA model?
Also, please bear in mind that it is not really possible to influence the assignment of words among the topics. This has been discussed in the answer to this question: how to improve word assignement in different topics in lda

Detecting questions in text

I have a project where I need to analyze a text to extract some information if the user who post this text need help in something or not, I tried to use sentiment analysis but it didn't work as expected, my idea was to get the negative post and extract the main words in the post and suggest to him some articles about that subject, if there is another way that can help me please post it below and thanks.
for the dataset i useed, it was a dataset for sentiment analyze, but now I found that it's not working and I need a dataset use for this subject.
Please use the NLP methods before processing the sentiment analysis. Use the TFIDF, Word2Vector to create vectors on the given dataset. And them try the sentiment analysis. You may also need glove vector for the conducting analysis.
For this topic I found that this field in machine learning is called "Natural Language Questions" it's a field where machine learning models trained to detect questions in text and suggesting answer for them based on data set you are working with, check this article for more detail.

Sources of classified sentiment data?

I'm looking to train a naive Bayes with some new data sources that haven't been used before. I've already looked at the Lee & Pang corpus of IMDB reviews and the MPQA opinion corpus. I'm looking for new web services that fit the following criteria.
Easily Classified - must have a like/dislike or 5 star rating
Readily available
Pertain to new material (less important than the first two)
Here are some samples I have come up with on my own.
Etsy API
Rotten Tomatoes API
Yelp API
Any other suggestions would be much appreciated =)
In Pang&Lee's later work (2008) "Opinion Mining and Sentiment Analysis" here they have a section for publicly available resources. It has links to those corpora.
Take a look at sentiment140. It has a corpus that you can download and train with. You can easily extend to new tweets.

Resources