Word Frequency in a WikiPedia Article - python-3.x

How can i get the frequency of a specified word in a wikipedia Article without storing the whole article and then process it ? For eg , How may times the word "India" occurs in this article https://simple.wikipedia.org/wiki/India

Here's a simple-minded example that reads the web page line by line. But there is no guarantee the HTML is broken into lines. (It is in this case, over 1300 of them.)
import re
import urllib.request
from collections import Counter
URL = 'https://simple.wikipedia.org/wiki/India'
counter = Counter()
with urllib.request.urlopen(URL) as source:
for line in source:
words = re.split(r"[^A-Z]+", line.decode('utf-8'), flags=re.I)
counter.update(words)
for word in ['India', 'Indian', 'Indians']:
print('{}: {}'.format(word, counter[word]))
OUTPUT
> python3 test.py
India: 547
Indian: 75
Indians: 11
>
This also counts terms if they appear in the HTML structure of the page, not just the content.
If you want to focus on content, consider the Pywikibot python library which uses the preferred MediaWiki API to extract content, though it appears to be based on a "complete page at a time" model which you noted you're trying to avoid. Regardless, that module's documentation points to a list of similar, but more advanced, packages that you might want to review.

Related

Is there a way to fetch the url from google search result when a csv file full of keyword is uploaded in Python?

Is it possible to obtain the url from Google search result page, given the keyword? Actually, I have a csv file that contains a lot of companies name. And I want there website which shows up on the top of search result in google, when I upload that csv file it fetch the company name/keyword and put it on the search field.
For eg: - stack overflow, this is one of the entry in my csv file and it should be fetched and put in the search field, and it should return the best match/first url from search result. Eg: - www.stackoverflow.com
And this returned result should be stored in the same file which I have uploaded and next to the keyword for it searched.
I am not aware much about these concepts, so any help will be very appreciated.
Thanks!
google package has one dependency on beautifulsoup which need to be installed first.
then install :
pip install google
search(query, tld='com', lang='en', num=10, start=0, stop=None, pause=2.0)
query : query string that we want to search for.
tld : tld stands for top level domain which means we want to search our result on google.com or google.in or some other domain.
lang : lang stands for language.
num : Number of results we want.
start : First result to retrieve.
stop : Last result to retrieve. Use None to keep searching forever.
pause : Lapse to wait between HTTP requests. Lapse too short may cause Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option.
Return : Generator (iterator) that yields found URLs. If the stop parameter is None the iterator will loop forever.
Below code is the solution for your question.
import pandas
from googlesearch import search
df = pandas.read_csv('test.csv')
result = []
for i in range(len(df['keys'])):
for j in search(df['keys'][i], tld="com", num=10, stop=1, pause=2):
result.append(j)
dict1 = {'keys': df['keys'], 'url': result}
df = pandas.DataFrame(dict1)
df.to_csv('test.csv')
Sample input format file image:
Output File Image:

Unknown encoding of files in a resulting Beautiful Soup txt file

I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end (EDIT: I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup.
I've tried using online decoder, because I thought that maybe this is connected to Base64 encoding, but it seems that none of the known encoding could help me. I saw that at the beginning of some files, there is something like: "created with Certent Disclosure Management 6.31.0.1" and other programs, I thought maybe this causes the encoding. Nevertheless Word is able to open these files, so I guess there must be a known key to it. This is a sample encoded data:
M1G2RBE#MN)T='1,SC4,]%$$Q71T3<XU#[AHMB9#*E1=E_U5CKG&(77/*(LY9
ME$N9MY/U9DC,- ZY:4Z0EWF95RMQY#J!ZIB8:9RWF;\"S+1%Z*;VZPV#(MO
MUCHFYAJ'V#6O8*[R9L<VI8[I8KYQB7WSC#DMFGR[E6+;7=2R)N)1Q\24XQ(K
MYQDS$>UJ65%MV4+(KBRHJ3HFIAR76#G/F$%=*9FOU*DM-6TSTC$Q\[C$YC$/
And a sample file from the 13 000 that I downloaded.
Below I insert the BeautifulSoup that I use to extract text. It does its' job, but I need to find a clue to this encoded string and somehow decode it in the Python code below.
from bs4 import BeautifulSoup
with open("98752-TOROTEL INC-10-K-2019-07-23", "r") as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
print(soup.getText())
with open("extracted_test.txt", "w", encoding="utf-8") as f:
f.write(soup.getText())
f.close()
What I want to achieve is decoding of this dummy string in the end of the file.
Ok, this is going to be somewhat messy, but will get you close enough to what you are looking for, without using regex (which is notoriously problematic with html). The fundamental problem you'll be facing is that EDGAR filings are VERY inconsistent in their formatting, so what may work for one 10Q (or 10K or 8K) filing may not work with a similar filing (even from the same filer...) For example, the word 'item' may appear in either lower or uppercase (or mixed), hence the use of the string.lower() method, etc. So there's going to be some cleanup, under all circumstances.
Having said that, the code below should get you the RISK FACTORS sections from both filings (including the one which has none):
url = [one of these two]
from bs4 import BeautifulSoup as bs
response = requests.get(url)
soup = bs(response.content, 'html.parser')
risks = soup.find_all('a')
for risk in risks:
if 'item' in str(risk.attrs).lower() and '1a' in str(risk.attrs).lower():
for i in risk.findAllNext():
if 'item' in str(i.attrs).lower():
break
else:
print(i.text.strip())
Good luck with your project!

Web scraping an http text file page at repeated intervals

I have successfully written code to web scrape an https text page
https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt
this page is automatically updated every 60sec. I have used beautifulSoup4 to do so. Here are my two questions: 1)how do I call a loop to re-scrape the page every 60 seconds? 2) since there are no html tags associated with the page how can only scrape a specific line of data?
I was thinking that I might have to save the scraped page as a CVS file then use the saved page to extract the data I need. However, I'm hoping that this can all be done without saving the page to my local machine. I was hoping that there is some python package that can do all of this for me without saving the page.
import bs4 as bs
import urllib
sauce = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
soup = bs.BeautifulSoup (sauce,'lxml')
print (soup)
I would like to automatically scrape the first line of data every 60 seconds Here is an example first line of data
2019 03 30 1233 58572 45180 9.94e-09 1.00e-09
The header that goes with this data is
YR MO DA HHMM Day Day Short Long
Ultimately I would like to use PyAutoGUI to trigger a ccd imaging application to start a sequence of images when the 'Short' and or "Long" x-ray flux reaches e-04 or greater.
Every tool has its place.
BeautifulSoup is a wonderful tool, but the .txt suffix on that URL is a big hint that this isn't quite the HTML input which bs4 was designed for.
Recommend you use a simpler approach for this fairly simple input.
from itertools import filterfalse
def is_comment(line):
return (line.startswith(':')
or line.startswith('#'))
lines = list(filterfalse(is_comment, sauce.split('\n')))
Now you can do word split on each line to convert to CSV or pandas dataframe.
Or you can just use lines[0] to access the first line.
For example, you might parse it out in this way:
yr, mo, da, hhmm, jday, sec, short, long = map(float, lines[0].split())

Confusion as to how to reference HTML code in Python

Here's my situation: I want to go to Yahoo's NHL site, here: http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01
The above link is for April's scores.
I'm trying to get my Python code to display for me the various scores of the various table rows, as well as with the names of the teams.
However, the issue is that I am not good at referencing HTML code when using xpath--I also feel like my code may actually be really wrong, too.
Here it is:
from xml import etree
from urllib.request import urlopen
website_score_yahoo = urlopen('http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01', 'r').read()
binary_data = data.encode(ascii)
result = etree.HTML(data)
for tr in result.xpath(''):
print (result.xpath)
The "for tr in result.xpath(''):" is left blank in the parentheses due to the issue I listed above.
It's a lot to cover. Sorry about that.

How do I get a list of the most common words in various languages?

Stack Overflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.
How do I get such a list of the most common English words? Or most common words in other languages? Is this something I can just get off the Google website?
A word frequency list is what you want. You can also make your own, or customize one for use within a particular domain, and it is a nice way to become familiar with some good libraries. Start with some text such as discussed in this question, then try out some variants of this back-of-the-envelope script:
from nltk.stem.porter import PorterStemmer
import os
import string
from collections import defaultdict
ps = PorterStemmer()
word_count = defaultdict(int)
source_directory = '/some/dir/full/of/text'
for root, dirs, files in os.walk(source_directory):
for item in files:
current_text = os.path.join(root, item)
words = open(current_text, 'r').read().split()
for word in words:
entry = ps.stem_word(word.strip(string.punctuation).lower())
word_count[entry] += 1
results = [[word_count[i], i] for i in word_count]
print sorted(results)
This gives the following on a couple of books downloaded, re the most common words:
[2955, 'that'], [4201, 'in'], [4658, 'to'], [4689, 'a'], [6441, 'and'], [6705, 'of'], [14508, 'the']]
See what happens when you filter out the most common x y or z number from your queries, or leave them out of your text search index entirely. Also might get some interesting results if you include real world data -- for example "community" "wiki" is not likely a common word on a generic list, but on SO that obviously wouldn't be the case and you might want to exclude them.

Resources