Extracting a table from Webpage in Python - python-3.x

I have tried reading a table from a website. It can be seen from my code that I have gone too far to get the table, I would appreciate if someone give me an opportunity to learn a quick method to do the same.
Here's my code:
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606&aa=1&style=N"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
text = soup.get_text()
with open('myfile.txt', 'w') as file:
file.writelines(text)
with open('myfile.txt','r') as g:
f = g.readlines()
tab = f[12:31]
table = [x.strip() for x in tab]
Every time running the code messes up with writing and reading the file.

You shouldn't need files. Filter for the pre tag instead, to target the table alone.
soup = BeautifulSoup(html)
text=soup.find('pre')
table = [x.strip() for x in text]

Related

How to scrape multiple pages with requests in python

recently started getting into web scraping and i have managed ok but now im stuck and i cant find the answer or figure it out.
Here is my code for scraping and exporting info from a single page
import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
print (heading, reading)
import csv
from datetime import datetime
# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([heading, reading, datetime.now()])
Problem occurs when i try to scrape multiple pages at the same time.
They are all the same just pagination changes eg
https://www.example.com/page.aspx?sign=1
https://www.example.com/page.aspx?sign=2
https://www.example.com/page.aspx?sign=3
https://www.example.com/page.aspx?sign=4 etc
Instead of writing the same code 20 times how do i stick all the data in a tuple or an array and export to csv.
Many thanks in advance.
Just try it out with a loop, until you got no page available (request is not OK). Should be easy to get.
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
results = []
page_number = 1
while True:
response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
if response.status_code != 200:
break
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
# write a list
# results.append([heading, reading, datetime.now()])
# or tuple.. your call
results.append((heading, reading, datetime.now()))
page_number = page_number + 1
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
for result in results:
writer.writerow(result)

I am trying to import html text in CSV with beautiful soup but script outputs "blank" CSV

I have written below code to get the data from the website in CSV.
Basically I am interested in text like this in entirety.
(Beskrivning:
PL-CH-DSPTC-AD10Semi-professional Technician / Administrator: Vocationally trained positions that need both practical and theoretical understanding and some significant advanced vocational experience to perform broad range of varying tasks and issues, in related field of work. Work performed is still procedurized, however issues and problems)
So My table should have one row for each description please
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
url = "http://www.altrankarlstad.com/wisp"
page = requests.get(url)
pagetext = page.text
soup = BeautifulSoup(pagetext, 'html.parser')
gdp_table = soup.find("table", attrs={"class": "table workitems-table mt-2"})
def table_to_df(table):
return pd.DataFrame([[td.text for td in row.findAll('td')] for row in table.tbody.findAll('tr')])
file = open("data.csv", 'w')
for row in soup.find_all('tr'):
for col in row.find_all('td'):
print(col.text)

Scraping data and putting it in different columns using BeautifulSoup

I have written a script to scrape data from a website. It has 2 columns. But I want to add another column to it (abstract column). How can I do this inside the same loop? I need to get the 'abstract' data in the third column. Image attached below.
The code is below:
import requests
import csv
from bs4 import BeautifulSoup
file = "Details181.csv"
Headers = ["Category", "Vulnerabilities", "Abstract"]
url = "https:/en/weakness?po={}"
with open(file, 'w', newline='') as f:
csvriter = csv.writer(f, delimiter=',', quotechar='"')
csvriter.writerow(Headers)
for page in range(1, 131):
r = requests.get(url.format(page))
soup = BeautifulSoup(r.text, 'lxml')
for title in soup.select('div.title > h1'):
csvriter.writerow([title.strip() for title in
title.text.split(':')]);
According to your description, I guess abstract and category, vulnerability maybe have the common father div element.
Then I try to find the common div and extract data in every loop, finally, I verified my guess, I also add default value for vulnerability when title has no vulnerability content
The following code run successfully
import requests
import csv
from bs4 import BeautifulSoup
file = "Details181.csv"
Headers = ["Category", "Vulnerabilities", "Abstract"]
url = "https://vulncat.fortify.com/en/weakness?po={}"
with open(file, 'w', newline='') as f:
csvriter = csv.writer(f, delimiter=',', quotechar='"')
csvriter.writerow(Headers)
for page in range(1, 131):
r = requests.get(url.format(page))
soup = BeautifulSoup(r.text, 'lxml')
# find the common father div info
all_father_info = soup.find_all("div", class_="detailcell weaknessCell panel")
for father in all_father_info:
# find the son div info, then extract data
son_info_12 = father.find('h1').text.split(":")
if len(son_info_12) == 2:
category, vulnerability = son_info_12[0].strip(), son_info_12[1].strip()
elif len(son_info_12) == 1:
category = son_info_12[0].strip()
vulnerability = ""
else:
category, vulnerability = "", ""
# find the son div info, then extract abstract
abstract = father.find("div", class_="t").text.strip()
# write data into csv file
csvriter.writerow([category, vulnerability, abstract])

BeautifulSoup4 not findall() not getting all of the links on the webpage

I am trying to grab all of the 'a' links from a webpage:
from bs4 import BeautifulSoup
import requests
source_code = requests.get(starting_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
and the list printed out it not all links on the page. if I try and print out plain_text, I can sea all these links, but they are not printed as href.
First week learning python! All help is greatly appreciated. Thanks!
Update: I forgot to share the plaint_text file here. Sorry for the confusion.
The plain_text is pretty long so I'll just post the starting_url
starting_url = 'https://freeexampapers.com/index.php?option=com_content&view=article&id=1&Itemid=101&jsmallfib=1&dir=JSROOT/IB'
and yes I'm a high school student:-)
Since you have not given any data sample we can give you sample that
you could try :-
soup = BeautifulSoup(html_page,"html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')
This should do it.
import re
import requests
from bs4 import BeautifulSoup
import os
import fileinput
Link = 'https://animetosho.org/view/jacobswaggedup-kill-la-kill-bd-1280x720-mp4-batch.n677876'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
#print soup
subtitles = soup.findAll('div',{'class':'links'})
#print subtitles
with open("Anilinks.txt", "w") as f:
for link in subtitles:
x = link.find_all('a', limit=26)
for a in x:
url = a['href']
f.write(url+'\n')
Now, if you want to do something like store the links in a text file, do the following.
# Store the links we need in a list
links_to_keep = []
with open("Anilinks.txt", "r") as f:
for line in f.readlines():
if 'solidfiles.com' in line:
links_to_keep.append(line)
# Write all the links in our list to the file
with open("Anilinks.txt", "w") as f:
for link in links_to_keep:
f.write(link)

the Accessing commented HTML Lines with BeautifulSoup

I am attempting to webscrape stats from this specific webpage: https://www.sports-reference.com/cfb/schools/louisville/2016/gamelog/
However, the table for the 'Defensive Game Log' appears to be commented out when I look at the HTML source (starts with <...!-- and ends with -->)
Because of this, when attempting to use BeautifulSoup4 the following code only grabs the offensive data that is not commented out while the defensive data is commented out.
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup
import re
accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link.read(), "lxml")
tables = soup.find_all(['th', 'tr'])
my_table = tables[0]
rows = my_table.findChildren(['tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print(value)
I am curious if there are any solutions to be able to add all of the defensive values into a list the same way the offensive data is stored be it inside or outside of BeautifulSoup4. Thanks!
Note that I added onto solution given below derived from here:
data = []
table = defensive_log
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
Comment object will give you what you want:
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup, Comment
accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link, "lxml")
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
comment=BeautifulSoup(str(comment), 'lxml')
defensive_log = comment.find('table') #search as ordinary tag
if defensive_log:
break

Resources