I have a data set includes restaurant reviews. I've processed my data and this is how my data set look like(0 and 1 shows is it positive or negative review):
0 ['wow', 'loved', 'place'] 1
1 ['crust', 'good'] 0
2 ['not', 'tasty', 'texture', 'nasty'] 0
3 ['stopped', 'late', 'may', 'bank', 'holiday', ... 1
4 ['the', 'selection', 'menu', 'great', 'prices'] 1
To be brief, i want to use PorterStemmer and this is how i studied to use it:
for i in range(1000):
for word in df['Review'][i]:
word = stemmer.stem(word)
I studied to use porterstemmer to stemming but it did not work. Any word did not stem(for example, in first data i expected the 'loved' word should become a 'love'). My data is still same with the dataframe which i shared above and i could not fix this.
Your code - which would be much easier to run/debug if it were a minimal reproducible example - is missing one line to replace the original word with the result of the stemming:
for i in range(1000):
for word in df['Review'][i]:
word = stemmer.stem(word)
df['Review'][i] = word ########## added
if you also add:
print( f"{word=}" )
the stemming output is:
word='wow'
word='love'
word='place'
word='crust'
word='good'
word='not'
word='tasti'
word='textur'
word='nasti'
word='stop'
word='late'
word='may'
word='bank'
word='holiday'
word='the'
word='select'
word='menu'
word='great'
word='price'
Next time you ask a question you should make the code a minimal reproducible example - this does two things: 1. getting your code to be minimal and confirming that the code you post still shows the same problem often helps to find/fix the problem, and 2) it's much easier for readers of your question to test for themselves. Never forget you're asking people to use their own time and effort to try help you with the only reward being at best a point or two of reputation; providing good runnable short code and a clear description of what's wrong with it will help you get an answer.
Related
I made a neural network that shows me the probability that a comment is positive, calculated from 0 to 1.
In fact, I can now enter new data and it offers me results in this line
Dcnn(np.array([tokenizer.encode("I feel very happy with the product")]), training = False).numpy()
Then the result shows me something like this
array([[0.9083]] , dtype = float32)
as you can see i introduced a text , now i would like to make a loop to give it n texts. I would be happy if someone can help me
i am expecting to get the result of the comment for each comment something like this
Text 1: "......." ; prob: 0.0002
Text 2: "......." ; prob: 0.7840
This should be as simple as this:
for index, comment in enumerate(comments, 1):
pred_proba = Dcnn(np.array([tokenizer.encode(comment)]), training = False).numpy()[0][0]
print(f"Text {index}: '{comment}'; Probability: {pred_proba}")
Hope this helps!
The dataset has 14k rows and has many titles, etc.
I am a beginner in Pandas and Python and I'd like to know how to proceed with getting the output of first name and last name from this dataset.
Dataset:
0 Pr.Doz.Dr. Klaus Semmler Facharzt für Frauenhe...
1 Dr. univ. (Budapest) Dalia Lax
2 Dr. med. Jovan Stojilkovic
3 Dr. med. Dirk Schneider
4 Marc Scheuermann
14083 Bag Kinderarztpraxis
14084 Herr Ulrich Bromig
14085 Sohn Heinrich
14086 Herr Dr. sc. med. Amadeus Hartwig
14087 Jasmin Rieche
for name in dataset:
first = name.split()[-2]
last = name.split()[-1]
# save here
This will work for most names, not all. For repeatability you may need a list of titles such as (dr., md., univ.) to skip over
As it doesn't contain any structure, you're out of luck. An ad-hoc solution could be to just write down a list of all locations/titles/conjunctions and other noise you've identified and then strip those from the rows. Then, if you notice some other things you'd like to exclude, just add them to your list.
This will not solve the issue of certain rows having their name in reverse order. So it'll require you to manually go over everything and check if the row is valid, but it might be quicker than editing each row by hand.
A simple, brute-force example would be:
excludes = {'dr.', 'herr', 'budapest', 'med.', 'für', ... }
new_entries = []
for title in all_entries:
cleaned_result = []
parts = title.split(' ')
for part in parts:
if part.lowercase() not in excludes:
cleaned_result.append(part)
new_entries.append(' '.join(cleaned_result))
I have some files that need to be sorted by name, unfortunately I can't use a regular sort, because I also want to sort the numbers in the string, so I did some research and found that what I am looking for is called natural sorting.
I tried the solution given here and it worked perfectly.
However, for strings like PresserInc-1_10.jpg and PresserInc-1_11.jpg which causes that specific natural key algorithm to fail, because it only matches the first integer which in this case would be 1 and 1, and so it throws off the sorting. So what I think might help is to match all numbers in the string and group them together, so if I have PresserInc-1_11.jpg the algorithm should give me 111 back, so my question is, is this possible ?
Here's a list of filenames:
files = ['PresserInc-1.jpg', 'PresserInc-1_10.jpg', 'PresserInc-1_11.jpg', 'PresserInc-10.jpg', 'PresserInc-2.jpg', 'PresserInc-3.jpg', 'PresserInc-4.jpg', 'PresserInc-5.jpg', 'PresserInc-6.jpg', 'PresserInc-11.jpg']
Google: Python natural sorting.
Result 1: The page you linked to.
But don't stop there!
Result 2: Jeff Atwood's blog that explains how to do it properly.
Result 3: An answer I posted based on Jeff Atwood's blog.
Here's the code from that answer:
import re
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
return sorted(l, key=alphanum_key)
Results for your data:
PresserInc-1.jpg
PresserInc-1_10.jpg
PresserInc-1_11.jpg
PresserInc-2.jpg
PresserInc-3.jpg
etc...
See it working online: ideone
If you don't mind third party libraries, you can use natsort to achieve this.
>>> import natsort
>>> files = ['PresserInc-1.jpg', 'PresserInc-1_10.jpg', 'PresserInc-1_11.jpg', 'PresserInc-10.jpg', 'PresserInc-2.jpg', 'PresserInc-3.jpg', 'PresserInc-4.jpg', 'PresserInc-5.jpg', 'PresserInc-6.jpg', 'PresserInc-11.jpg']
>>> natsort.natsorted(files)
['PresserInc-1.jpg',
'PresserInc-1_10.jpg',
'PresserInc-1_11.jpg',
'PresserInc-2.jpg',
'PresserInc-3.jpg',
'PresserInc-4.jpg',
'PresserInc-5.jpg',
'PresserInc-6.jpg',
'PresserInc-10.jpg',
'PresserInc-11.jpg']
I am working on Time Series Forecasting(Daily entry) using pyramid-arima auto_arima in python where y is my target and x_features are all exogenous variables. I want best order model based on lowest aic, But auto_arima returns only few order combinations.
PFA where 1st code line (start_p = start_q = 0 & max_p = 0, max_q = 3) returns all 4 combinations, but 2nd code line(start_p = start_q = 0 & max_p = 3, max_q = 3) returns only 7 combinations , din't gave (0,1,2) and (0,1,3) and others, which leads wrong model selection based on aic. All other parameters are as default e.g max_order = 10
Is there anything I am missing or wrongly done?
Thankyou in advance.
You say error_action='ignore', so probably (0,1,2) and (0,1,3) (and other orders) gave errors, so they didn't appear in the results.
(I don't have enough reputation to write a comment, sorry).
The number of models autoarima trains is based on the data you feed in and also the stepwise= True if it is True autoarima uses a proven way to reduce number of iterations to find the best model and it is the best 90% cases unless data is very varying.
If you want the rest of models also to run as it isnt taking alot of time to execute try keeping stepwise=False where it trains with all possible param combinations.
Hope this helps
I want to get total number of positive sentences and negative sentences from a dataset which I did the testing of it. So how can I count total number of positive and negative sentences?
import sklearn
from sklearn.datasets import load_files
moviedirt = r'C:\\Users\\premier\\Downloads\\Reviews\\test'
movie_test = load_files(moviedirt , shuffle=True)
movie_test.target_names
movie_test.data[0:10000]
from sklearn.pipeline import Pipeline # use pipeline for feature extraction and algorithm
pipeline = Pipeline([('vect',CountVectorizer(stop_words='english')),
('tfidf',TfidfTransformer()),('clf',MultinomialNB(fit_prior=False))])
clf = pipeline.fit(movie_train.data , movie_train.target) # classifier is train
predict1 = clf.predict(movie_test.data)
for review, category in zip(movie_test.data , predict1): #use loop
print('%r => %s' % (review, movie_train.target_names[category]))
This is the full testing code.
here is the output:
b"Don't hate Heather Graham because she's beautiful, hate her because she's
fun to watch in this movie. Like the hip clothing and funky surroundings, the
actors in this flick work well together. Casey Affleck is hysterical and
Heather Graham literally lights up the screen. The minor characters - Goran
Visnjic {sigh} and Patricia Velazquez are as TALENTED as they are gorgeous.
Congratulations Miramax & Director Lisa Krueger!" => pos
b'I don\'t know how this movie has received so many positive comments. One
can call it "artistic" and "beautifully filmed", but those things don\'t make
up for the empty plot that was filled with sexual innuendos. I wish I had not
wasted my time to watch this movie. Rather than being biographical, it was a
poor excuse for promoting strange and lewd behavior. It was just another
Hollywood attempt to convince us that that kind of life is normal and OK.
From the very beginning I asked my self what was the point of this movie,and
I continued watching, hoping that it would change and was quite disappointed
that it continued in the same vein. I am so glad I did not spend the money to
see this in a theater!' => neg
import numpy as np
# Number of pos/neg samples in your training set
print(np.unique(movie_train.target, return_counts=True))
# Number of pos/neg samples in your predictions
print(np.unique(predict1, return_counts=True))