I am preprocessing data for topic modeling, and I ran into a problem when trying to split my data into sentences.
I am working with party manifestos from the Manifesto Project (package ManifestoR), and I am trying to split with quanteda::corpus_reshape.
This is my code:
mp_setapikey("manifesto_apikey.txt")
corpus_green <- mp_corpus(party == "41113")
corpus_ger <- corpus(corpus_green)
corpus_ger_sent <- corpus_reshape(corpus_ger, to = "sentences")
Some of the texts aren’t split into sentences but are still bigger text chunks - see photos.
I want to continue with putting the data in a tidy data frame, then into a document-feature-matrix, and then do topic modeling.
Can someone help fix the sentences, please?
Related
I'm very new to data analytics and English is not my first language so please bear with me.
Currently working in Weka doing a sentiment (positive / negative) analysis of tweets. I have applied the StringToWordVector (including stemming and removing stop words) and afterwards ranked and removed all unnecessary attributes. I am left with 773 words / attributes.
My goal is to create a word cloud in either Tableau or Wordart that of course highlights how frequent each word is and also whether the word has a predominantly positive or negative
connotation.
I am struggling to find a way to process my Weka file so that the 773 words, their frequency and whether they have a positive or negative sentiment is saved and easily read into Tableau or Wordart.
Can anyone help with how to accomplish this?
I have attached a picture of the Weka preprocess window where I'm currently stuck.
Thanks!
My process so far:
RemoveWithValues to remove unwanted sentiment categories, so I'm left with 'positive' and 'negative'
StringToWordVector followed by ranking and removal of some attributes / words
Stuck... When loading the data into Tableau in it's current state just creates a big mess.
Goal is to visualize the data in a meaningful way. Additions to a word cloud are also welcome
I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot
Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.
I'm currently learning how to use OpenAI API for text summarization for a project. Overall, it's pretty amazing but there is one thing I'm struggling with.
I need a tl/dr summary that is 1 - 2 complete sentences with max of 250 characters. I can play around with the MaximumLength option but if I make it too short, the summary often just ends up with a sentence that is just cut off in the middle.
Another problem is - if there is a bullet list in the main text, the summary will be a few of those bullets. Again, I need 1-2 complete sentences, not bullets.
Lastly, if the main text is quite short, often my summary will be 2 sentences that say the exact same thing with a slight variation.
I've tried this using the various engines (text-davinci, davinci-instruct-beta). Any suggestions on how I can instruct/guide OpenAI to give me the output that I'm looking for? Or, do I need to start doing the "Fine Tuning" option. If I feed it 1,000+ examples of 1-2 sentences with < 250 characters & no bullets, will it understand what I need?
Many thanks in advance.
Thanks for taking the time to read my question!
So I am running an experiment to see if I can predict whether an individual has been diagnosed with depression (or at least says they have been) based on the words (or tokens)they use in their tweets. I found 139 users that at some point tweeted "I have been diagnosed with depression" or some variant of this phrase in an earnest context (.e. not joking or sarcastic. Human beings that were native speakers in the language of the tweet were used to discern whether the tweet being made was genuine or not).
I then collected the entire public timeline of tweets of all of these users' tweets, giving me a "depressed user tweet corpus" of about 17000 tweets.
Next I created a database of about 4000 random "control" users, and with their timelines created a "control tweet corpus" of about 800,000 tweets.
Then I combined them both into a big dataframe,which looks like this:
,class,tweet
0,depressed,tweet text .. *
1,depressed,tweet text.
2,depressed,# tweet text
3,depressed,저 tweet text
4,depressed,# tweet text😚
5,depressed,# tweet text😍
6,depressed,# tweet text ?
7,depressed,# tweet text ?
8,depressed,tweet text *
9,depressed,# tweet text ?
10,depressed,# tweet text
11,depressed,tweet text *
12,depressed,#tweet text
13,depressed,
14,depressed,tweet text !
15,depressed,tweet text
16,depressed,tweet text. .
17,depressed,tweet text
...
50595,control,#tweet text?
150596,control,"# tweet text."
150597,control,# tweet text.
150598,control,"# tweet text. *"
150599,control,"#tweet text?"t
150600,control,"# tweet text?"
150601,control,# tweet text?
150602,control,# tweet text.
150603,control,#tweet text~
150604,control,# tweet text.
Then I trained a multinomial naive bayes classifier using an object from the CountVectorizer() class imported from the sklearn library:
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(tweet_corpus['tweet'].values)
classifier = MultinomialNB()
targets = tweet_corpus['class'].values
classifier.fit(counts, targets)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior= True)
Unfortunately, after running a 6-fold cross validation test, the results suck and I am trying to figure out why.
Total tweets classified: 613952
Score: 0.0
Confusion matrix:
[[596070 743]
[ 17139 0]]
So, I didn't properly predict a single depressed person's tweet! My initial thought is that I have not properly normalized the counts of the control group, and therefore even tokens which appear more frequently among the depressed user corpus are over represented in the control tweet corpus due to its much larger size. I was under the impression that .fit() did this already, so maybe I am on the wrong track here? If not, any suggestions on the most efficient way to normalize the data between two groups of disparate size?
You should use a re-sampling techniques to deal with unbalanced classes. There are many ways to do that "by hand" in Python, but I recommend unbalanced learn which compiles re-sampling techniques commonly used in datasets showing strong between-class imbalance.
If you are using Anaconda, you can use:
conda install -c glemaitre imbalanced-learn.
or simply:
pip install -U imbalanced-learn
This library is compteible with sci-kit learn. Your dataset looks very interesting, is it public? Hope this helps.
I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...
So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!
So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
...
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.