How do i remove words that appears less than x time for example words appear less than 3 times in pandas dataframe. I use nltk as non english word removal, however the result is not good. I assume that word apear less than 3 times as non english words.
input_text=["this is th text one tctst","this is text two asdf","this text will be remove"]
def clean_non_english(text):
text=" ".join(w for w in nltk.wordpunct_tokenize(text)if w.lower() in words or not w.isalpha())
return text
Dataset['text']=Dataset['text'].apply(lambda x:clean_non_english(x))
Desired output
input_text=["this is text ","this is text ","this is text"]
so the word appear in the list less than 3 times will be removed
Try this
input_text=["this is th text one tctst","this is text two asdf","this text will be remove"]
all_ = [x for y in input_text for x in y.split(' ') ]
a, b = np.unique(all_, return_counts = True)
to_remove = a[b < 3]
output_text = [' '.join(np.array(y.split(' '))[~np.isin(y.split(' '), to_remove)])
for y in input_text]
Related
I'm a beginner in python and I have a small question. There is a plain text with timing tags (hours:minutes:seconds). 'Some text 1' and 'Some text 3' are advertisements. I need to calculate the total duration of the advertisement in the text. How to do it?
<time=”08:00:00"><type=ad> some text 1 <time=”08:02:24"></type=ad> some text 2 <time=”08:10:18"><type=ad> some text 3 <time=”08:12:20"></type=ad>
If each time stamp is preceded by the string, then you should find the number of occurrences in the string using the count method e.g. num = string.count("time=")
Then you can use a for loop over this range to find the actual strings.
times = []
index = 0
for i in range(0, num):
time = string.index("time=", index) # search for "time=", starting from index
times.append(string[time+6:time+14])
index = time+1
Then you can split the strings into hour-minute-seconds integer strings in a list like so:
hms = [timestr.split(":") for timestr in times]
In all, the following full code should work sufficiently
string = '<time=”08:00:00"><type=ad> some text 1 <time=”08:02:24"></type=ad> some text 2 <time=”08:10:18"><type=ad> some text 3 <time=”08:12:20"></type=ad>'
num = string.count("time=")
times = []
index = 0
for i in range(0, num):
time = string.index("time=", index) # search for "time=", starting from index
times.append(string[time+6:time+14])
index = time+1
print(times)
hms = [timestr.split(":") for timestr in times]
print(hms)
Leave a comment if you have any questions
Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!
Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.
I have a document and it contain numbers in between is there a way I can replace all the numbers to the English equivalent ?
eg:
My age is 10. I am in my 7th grade.
expected-o/p :
My age is Ten and I am in my seventh grade.
Thanks in advance
You'll want to take a look at num2words.
You'll have to construct regexp to catch the numbers you want to replace and pass them to num2words. Based on example provided, you also might need the ordinal flag.
import re
from num2words import num2words
# this is just an example NOT ready to use code
text = "My age is 10. I am in my 7th grade."
to_replace = set(re.findall('\d+', text)) # find numbers to replace
longest = sorted(to_replace, key=len, reverse=True) # sort so longest are replaced first
for m in longest:
n = int(m) # convert from string to number
result = num2words(n) # generate text representation
text = re.sub(m, result, text) # substitute in the text
print(text)
edited to reflect that OP wants to catch all digits
I have a challenge in my class that is to split a sentence into a list of separate words using iteration. I can't use any .split functions. Anybody had any ideas?
sentence = 'how now brown cow'
words = []
wordStartIndex = 0
for i in range(0,len(sentence)):
if sentence[i:i+1] == ' ':
if i > wordStartIndex:
words.append(sentence[wordStartIndex:i])
wordStartIndex = i + 1
if i > wordStartIndex:
words.append(sentence[wordStartIndex:len(sentence)])
for w in words:
print('word = ' + w)
Needs tweaking for leading spaces or multiple spaces or punctuation.
I never miss an opportunity to drag out itertools.groupby():
from itertools import groupby
sentence = 'How now brown cow?'
words = []
for isalpha, characters in groupby(sentence, str.isalpha):
if isalpha: # characters are letters
words.append(''.join(characters))
print(words)
OUTPUT
% python3 test.py
['How', 'now', 'brown', 'cow']
%
Now go back and define what you mean by 'word', e.g. what do you want to do about hyphens, apostrophes, etc.
I want to check if one of the words in "a" is within "text"
text = "testing if this works"
a = ['asd' , 'test']
print text.find(a)
how can I do this?
thanks
If you want to check whether any of the words in a is in text, use, well, any:
any(word in text for word in a)
If you want to know the number of words in a that occur in text, you can simply add them:
print('Number of words in a that match text: %s' %
sum(word in text for word in a))
If you want to only match full words (i.e. you don't want to match test the word testing), split the text into words, as in:
words = set(text.split())
any(word in words for word in a)
In [20]: wordset = set(text.split())
In [21]: any(w in wordset for w in a)
Out[21]: False
Regexes can be used to search for multiple match patterns in a single pass:
>>> import re
>>> a = ['asd' , 'test']
>>> regex = re.compile('|'.join(map(re.escape, sorted(a, key=len, reverse=True))))
>>> print bool(regex.search(text)) # determine whether there are any matches
True
>>> print regex.findall(text) # extract all matching text
['test']
>>> regex.search(text).start() # find the position of the first match
0