How can i get the search results of a word in txt file from a Wikipedia site using bs4/python? - python-3.x

I searched a word namely 'Eudicots' in a wikipedia page. The search url shows 262 titles. How can i write the titles in a txt file. Is it possible by BeautifulSoup4/python? How ?

import requests, bs4
url = 'https://ta.wikipedia.org/w/index.php?title=specal:Search&limit=500&offset=0&profile=default&search=Eudicots&searchToken=doo0wuq364b1m60hlcb894gt6'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
t_tags = soup.find_all('div', class_="mw-search-result-heading")
with open('a.txt', 'w') as f:
for t in t_tags:
print(t.text, file=f)
out:
இருவித்திலைத் தாவரம்
கழுதைப்பிட்டி-மூலிகை
ஃபபேசியே பூக்குடும்பத்தின் பேரினங்கள் பட்டியல்
வில்வம்
பாலை (மரம்)
சந்தனம்
ஆத்தி
தோடம்பழம்
வேம்பு
நெல்லி
செங்கொடுவேரி
கரந்தை

Related

How do I combine paragraphs of web pages (from a text file containing urls)?

I want to read all the web pages and extract text from them and then remove white spaces and punctuation. My goal is to combine all the words in all the webpage and produce a dictionary that counts the number of times a word appears across all the web pages.
Following is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import re
def web_parsing(filename):
with open (filename, "r") as df:
urls = df.readlines()
for url in urls:
uClient = ureq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
par = page_soup.findAll('p')
for node in par:
#print(node)
text = ''.join(node.findAll(text = True))
#text = text.lower()
#text = re.sub(r"[^a-zA-Z-0-9 ]","",text)
text = text.strip()
print(text)
The output I got is:
[Paragraph1]
[paragraph2]
[paragraph3]
.....
What I want is:
[Paragraph1 paragraph2 paragraph 3]
Now, if I split the text here it gives me multiple lists:
[paragraph1], [paragraph2], [paragraph3]..
I want all the words of all the paragraphs of all the webpages in one list.
Any help is appreciated.
As far as I understood your question, you have a list of nodes from which you can extract a string. You then want these strings to be merged into a single string. This can simply done by creating an empty string and then adding the subsequent strings to it.
result = ""
for node in par:
text = ''.join(node.finAll(text=True)).strip()
result += text
print(result) # "Paragraph1 Paragraph2 Paragraph3"
prin([result]) # ["Paragraph1 Paragraph2 Paragraph3"]

How to print the table from a website using python script?

Here is my python script so far.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'my_company_website'
#opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs each product
containers = page_soup.findAll("div",{"class":"navigator-content"})
print (containers)
After this, in inspect element it is like below,
<div class ="issue-table-container">
<div>
<table id ="issuetable" class>
<thead>...</thead>
<tbody>...<t/body> (This contains all the information i want to print)
</table>
How to print the table and export to csv
For each of the containers you should grab the table [1], then you have to find the body of the table and iterate over its rows [2] and compile a line for your csv file with the table cells (td) [3]
for container in containers:
table = container.find(id="issuetable") [1]
#if you are exactly sure of the structure and/or if the tables have different/unique ids and there is only one table per container you can also do:
table = container.table [1]
for tr in table.tbody.find_all("tr"): [2]
line = ""
for td in tr: [3]
line += td.text+"," #Adding the text in the td to the line followed by the separator of your choice in this case comma
csvfile.write(line[:-1]+"/n") #add the line (replace "/n" with your system's new line character for extra portability)
There are different ways of navigating the soup tree depending on your need and on how flexible you script needs to be.
Have a look at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ and check out the find / find_all sections.
Good luck!
/Teo

Use bs4 to scrape specific html table among several tables in same page

So I want to scrape the last table titled "Salaries" on this website http://www.baseball-reference.com/players/a/alberma01.shtml
url = 'http://www.baseball-reference.com/players/a/alberma01.shtml'
r = urllib.request.urlopen(url).read()
soup = BeautifulSoup(r)
I've tried
div = soup.find('div', id='all_br-salaries')
and
div = soup.find('div', attrs={'id': 'all_br-salaries'})
When I print div I see the data from the table but when I try something like:
div.find('thead')
div.find('tbody')
I get nothing. My question is how can I select the table correctly so I can iterate over the tr/td & th tags to extract the data?
The reason? The HTML for that table is — don't ask me why — in a comment field. Therefore, dig the HTML out of the comment, turn that into soup and mine the soup in the usual way.
>>> import requests
>>> page = requests.get('http://www.baseball-reference.com/players/a/alberma01.shtml').text
>>> from bs4 import BeautifulSoup
>>> table_code = page[page.find('<table class="sortable stats_table" id="br-salaries"'):]
>>> soup = BeautifulSoup(table_code, 'lxml')
>>> rows = soup.findAll('tr')
>>> len(rows)
14
>>> for row in rows[1:]:
... row.text
...
'200825Baltimore\xa0Orioles$395,000? '
'200926Baltimore\xa0Orioles$410,000? '
'201027Baltimore\xa0Orioles$680,0002.141 '
'201128Boston\xa0Red\xa0Sox$875,0003.141 '
'201229Boston\xa0Red\xa0Sox$1,075,0004.141contracts '
'201330Cleveland\xa0Indians$1,750,0005.141contracts '
'201431Houston\xa0Astros$2,250,0006.141contracts '
'201532Chicago\xa0White\xa0Sox$1,500,0007.141contracts '
'201532Houston\xa0Astros$200,000Buyout of contract option'
'201633Chicago\xa0White\xa0Sox$2,000,0008.141 '
'201734Chicago\xa0White\xa0Sox$250,000Buyout of contract option'
'2017 StatusSigned thru 2017, Earliest Free Agent: 2018'
'Career to date (may be incomplete)$11,385,000'
EDIT: I found that this was in a comment field by opening the HTML for the page in the Chrome browser and then look down through it for the desired table. This is what I found. Notice the opening <!--.

(Python)- How to store text extracted from HTML table using BeautifulSoup in a structured python list

I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']

How can i get the special search results of a word in txt file from a Wikipedia site using bs4/python?

I searched a word namely 'Monocots' in a Wikipedia's special search mode page. The special search (special:WhatLinksHere) results shows these words. How can i write the words in a txt file. Is it possible by BeautifulSoup4/python? How ?
import bs4, requests
r = requests.get('https://ta.wikipedia.org/w/index.php?title=special:WhatLinksHere/Monocots&limit=500')
soup = bs4.BeautifulSoup(r.text, 'lxml')
for li in soup.find(id='mw-whatlinkshere-list').find_all('li'):
print(li.a['title'])
out:
கத்தூரி மஞ்சள்
கோரை
எருவை (புல்)
வார்ப்புரு:Taxonomy/Asparagus
வார்ப்புரு:Taxonomy/Asparagoideae
வார்ப்புரு:Taxonomy/Asparagaceae
வார்ப்புரு:Taxonomy/Asparagales
சாத்தாவாரி
மலையன்கிழங்கு
குழிவாழை
கருப்பன் புல்
காட்டுச்சேனை
துடைப்பப்புல்
கொண்டை ராகிசு
குறத்தி நிலப்பனை
செவ்வள்ளிக் கொடி
சாலமிசிரி
This is very simple task, and BS4 is right tool to use.

Resources