I got a question for you, first of all the code here:
from urllib import request
from collections import Counter
from nltk import word_tokenize
URL = 'https://www.gutenberg.org/files/46/46-0.txt'
RESPONSE = request.urlopen(URL)
RAW = RESPONSE.read().decode('utf8')
print('\n')
type(RAW)
print('\n')
len(RAW)
TOKENS = word_tokenize(RAW)
print(type(TOKENS))
X = print(len(TOKENS))
print(TOKENS[:X])
print('\n')
c = Counter(RAW)
print(c.most_common(30))
Here is the first Output, I get. With that one I am satisfied.
['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'A', 'Christmas', 'Carol', ',', 'by', 'Charles',...]
Here is the second part of the output which do not makes me satisfied:
[(' strong text', 28438), ('e', 16556), ('t', 11960), ('o', 10940), ('a', 10092), ('n', 8868), ('i', 8791),...]
Here is my question: As you can see I am counting the most frequently occuring strings in a text, but the Problem is I want to count the whole elements of the list of words: The final part of second output should look something like that:
[('Dickens', 28438), ('Project', 16556), ('Gutenberg', 11960),...]
and not as you can see above in the second part of output. I want to show the 30 most frequently used Words in the text, and not parts of elements in elements of the list.
Do you know how I can solve that Problem? Thanks for helping.
Try changing this one
c = Counter(TOKENS)
Here attached your full code with change
from urllib import request
from collections import Counter
from nltk import word_tokenize
URL = 'https://www.gutenberg.org/files/46/46-0.txt'
RESPONSE = request.urlopen(URL)
RAW = RESPONSE.read().decode('utf8')
print('\n')
type(RAW)
print('\n')
len(RAW)
TOKENS = word_tokenize(RAW)
print(type(TOKENS))
X = print(len(TOKENS))
print(TOKENS[:X])
print('\n')
c = Counter(TOKENS)
print(c.most_common(500))
Related
To understand the values of each variable, I improved a script for replacement from Udacity class. I convert the codes in a function into regular codes. However, my codes do not work while the codes in the function do. I appreciate it if anyone can explain it. Please pay more attention to function "tokenize".
Below codes are from Udacity class (CopyRight belongs to Udacity).
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet'])
# import statements
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
def load_data():
df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
X = df.text.values
y = df.category.values
return X, y
def tokenize(text):
detected_urls = re.findall(url_regex, text) # here, "detected_urls" is a list for sure
for url in detected_urls:
text = text.replace(url, "urlplaceholder") # I do not understand why it can work while does not work in my code if I do not convert it to string
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
clean_tokens = []
for tok in tokens:
clean_tok = lemmatizer.lemmatize(tok).lower().strip()
clean_tokens.append(clean_tok)
return clean_tokens
X, y = load_data()
for message in X[:5]:
tokens = tokenize(message)
print(message)
print(tokens, '\n')
Below is its output:
I want to understand the variables' values in function "tokenize()". Following are my codes.
X, y = load_data()
detected_urls = []
for message in X[:5]:
detected_url = re.findall(url_regex, message)
detected_urls.append(detected_url)
print("detected_urs: ",detected_urls) #output a list without problems
# replace each url in text string with placeholder
i = 0
for url in detected_urls:
text = X[i].strip()
i += 1
print("LN1.url= ",url,"\ttext= ",text,"\n type(text)=",type(text))
url = str(url).strip() #if I do not convert it to string, it is a list. It does not work in text.replace() below, but works in above function.
if url in text:
print("yes")
else:
print("no") #always show no
text = text.replace(url, "urlplaceholder")
print("\nLN2.url=",url,"\ttext= ",text,"\n type(text)=",type(text),"\n===============\n\n")
The output is shown below.
The outputs for "LN1" and "LN2" are same. The "if" condition always output "no". I do not understand why it happens.
Any further help and advice would be highly appreciated.
What is the fastest way to find all the substrings in a string without using any modules and without making duplicates
def lols(s):
if not s:
return 0
lst = []
for i in range (len(s)):
for j in range(i , len(s)+1):
if not s[i:j] :
pass
elif len(s[i:j]) == len(set(s[i:j])):
lst.append(s[i:j])
res = (max(lst , key=len))
s = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~"
s = s*100
lols(s)
this function works fine with strings smaller than 1000, but it freezes when the example string is used and a time limit is exceeded for large strings
Problem
I recommend that you don't try this with super long strings like your s!
If you're able to install nltk so that it works (I just recently had a problem with that but managed to solve it by installing it to the Windows Sandbox, see here: Python3: Could not find "vcomp140.dll (or one of its dependencies)" while trying to import nltk), then this is one way to do it
from nltk import ngrams
def lols(s):
lst = []
for i in range(1, len(s)):
lst.extend([''.join(j) for j in ngrams(s, i)])
lst.append(s)
return lst
If not, you can do this instead of "from nltk import ngrams".
import collections, itertools
def ngrams(words, n):
"""Edited from https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams"""
d = collections.deque(maxlen=n)
d.extend(words[:n])
words = words[n:]
answer = []
for window, word in zip(itertools.cycle((d,)), words):
answer.append(''.join(window))
d.append(word)
answer.append(''.join(window))
return answer
Demo:
>>> lols('username')
['u', 's', 'e', 'r', 'n', 'a', 'm', 'e', 'us', 'se', 'er', 'rn', 'na', 'am', 'me', 'use', 'ser', 'ern', 'rna', 'nam', 'ame', 'user', 'sern', 'erna', 'rnam', 'name', 'usern', 'serna', 'ernam', 'rname', 'userna', 'sernam', 'ername', 'usernam', 'sername', 'username']
Maybe your function performance slows down like n2 or n!. (O(n2) or O(n!)) Or the memory is tight.
About the maximum size of your string which you can print in stdout using print function, since you are have to pass your text as a python object to print function and since the max size of your variable is depend on your platform it could be 2**31 - 1 on a 32-bit platform and 2* *63 - 1 on a 64-bit platform.
for more in information go to sys.maxsize
def bibek():
test_list=[[]]
x=int(input("Enter the length of String elements using enter -: "))
for i in range(0,x):
a=str(input())
a=list(a)
test_list.append(a)
del(test_list[0]):
def filt(b):
d=['b','i','b']
if b in d:
return True
else:
return False
for t in test_list:
x=filter(filt,t)
for i in x:
print(i)
bibek()
suppose test_list=[['b','i','b'],['s','i','b'],['r','i','b']]
output should be ib since ib is common among all
an option is to use set and its methods:
test_list=[['b','i','b'],['s','i','b'],['r','i','b']]
common = set(test_list[0])
for item in test_list[1:]:
common.intersection_update(item)
print(common) # {'i', 'b'}
UPDATE: now that you have clarified your question i would to this:
from difflib import SequenceMatcher
test_list=[['b','i','b','b'],['s','i','b','b'],['r','i','b','b']]
# convert the list to simple strings
strgs = [''.join(item) for item in test_list]
common = strgs[0]
for item in strgs[1:]:
sm = SequenceMatcher(isjunk=None, a=item, b=common)
match = sm.find_longest_match(0, len(item), 0, len(common))
common = common[match.b:match.b+match.size]
print(common) # 'ibb'
the trick here is to use difflib.SequenceMatcher in order to get the longest string.
one more update after clarification of your question this time using collections.Counter:
from collections import Counter
strgs='app','bapp','sardipp', 'ppa'
common = Counter(strgs[0])
print(common)
for item in strgs[1:]:
c = Counter(item)
for key, number in common.items():
common[key] = min(number, c.get(key, 0))
print(common) # Counter({'p': 2, 'a': 1})
print(sorted(common.elements())) # ['a', 'p', 'p']
I am trying to create a list of dictionaries that contain lists of words at 'body' and 'summ' keys using spacy. I am also using BeautifulSoup since the actual data is raw html.
This i what I have so far
from pymongo import MongoClient
from bs4 import BeautifulSoup as bs
import spacy
import string
clt = MongoClient('localhost')
db1 = clt['mchack']
db2 = clt['clean_data']
nlp = spacy.load('en')
valid_shapes = ['X.X','X.X.','X.x','X.x.','x.x','x.x.','x.X','x.X.']
cake = list()
sent_x = list()
temp_b = list()
temp_s = list()
sent_y = list()
table = str.maketrans(dict.fromkeys(string.punctuation))
for item in db1.article.find().limit(1):
finale_doc = {}
x = bs(item['news']['article']['Body'], 'lxml')
y = bs(item['news']['article']['Summary'], 'lxml')
for content in x.find_all('p'):
v = content.text
v = v.translate(table)
sent_x.append(v)
body = ' '.join(sent_x)
for content in y.find_all('p'):
v = content.text
v = v.translate(table)
sent_y.append(v)
summ = ' '.join(sent_y)
b_nlp = nlp(body)
s_nlp = nlp(summ)
for token in b_nlp:
if token.is_alpha:
temp_b.append(token.text.lower())
elif token.shape_ in valid_shapes:
temp_b.append(token.text.lower())
elif token.pos_=='NUM':
temp_b.append('<NUM>')
elif token.pos_=="<SYM>":
temp_b.append('<SYM>')
for token in s_nlp:
if token.is_alpha:
temp_s.append(token.text.lower())
elif token.shape_ in valid_shapes:
temp_s.append(token.text.lower())
elif token.pos_=='NUM':
temp_s.append('<NUM>')
elif token.pos_=="<SYM>":
temp_s.append('<SYM>')
finale_doc.update({'body':temp_b,'summ':temp_s})
cake.append(finale_doc)
print(cake)
del sent_x[:]
del sent_y[:]
del temp_b[:]
del temp_s[:]
del finale_doc
print(cake)
The first print statement gives proper output
'summ': ['as', 'per', 'the', 'budget', 'estimates', 'we', 'are', 'going', 'to', 'spend', 'rs', '<NUM>', 'crore', 'in', 'the', 'next', 'year'],
'body': ['central', 'government', 'has', 'proposed', 'spendings', 'worth', 'over', 'rs', '<NUM>', 'crore', 'on', 'medical', 'and', 'cash', 'benefits', 'for', 'workers', 'and', 'family', 'members']}]
However, after emptying the lists sent_x, sent_y, temp_b and temp_s, the output comes:
[{'summ': [], 'body': []}]
You keep passing the references to temp_b and temp_s. That's why after emptying these lists cake's content also changes (values of the dictionary are the same objects as temp_b and temp_s)!
You simply need to make a copy before appending the finale_doc dict to cake list.
finale_doc.update({'body': list(temp_b), 'summ': list(temp_s)})
You should try creating a minimal reproducible version of this, as it would meet stack overflow guidelines and you would be likely to answer your own problem.
I think what you are asking is this:
How can I empty a list without changing other instances of that list?
I made some code and I think it should work:
items = []
contents = []
for value in (1, 2):
contents.append(value)
items.append(contents)
print(contents)
del contents[:]
print(items)
This prints [1], [2] like I want, but then it prints [[], []] instead of [[1], [2]].
Then I could answer your question:
Objects (including lists) are permanent, this won't work
Instead of modifying (adding to and then deleting) the same list, you probably want to create a new list inside the loop. You can verify this by looking at id(contents) and id(items[0]), etc., and see they are all the same list. You can even do contents.append(None); print(items) and see that you now have [None, None].
Try doing
for ...
contents = []
contents.append(value)
instead of
contents = []
for ...
del contents[:]
Edit: Another answer suggests making a copy of the values as you add them. This will work, but in your case I feel that making a copy and then nulling is unnecessarily complicated. This might be appropriate if you continued to add to the list.
I need to count words that are inside a webpage using python3. Which module should I use? urllib?
Here is my Code:
def web():
f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html")
lu = f.read()
print(lu)
With below self explained code you can get a good starting point for counting words within a web page:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# We get the url
r = requests.get("https://en.wikiquote.org/wiki/Khalil_Gibran")
soup = BeautifulSoup(r.content)
# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
# We sum the two countesr and get a list with words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common()
If you want for example the first 10 most common words you just do:
total.most_common(10)
Which in this case outputs:
In [100]: total.most_common(10)
Out[100]:
[('the', 2097),
('and', 1651),
('of', 998),
('in', 625),
('i', 592),
('a', 529),
('to', 529),
('that', 426),
('is', 369),
('my', 365)]