Scraping information from website using beautifullsoup wont work - python-3.x

I have been using beautiful soup to extract info from the website http://slc.bioparadigms.org
But I am only interested in the diseases and OMIM number, so for each SLC transporter which i already have in a list i want to extract these 2 characteristics. The thing is that both are related to class prt_col2. So if i search for this class i get a lot of hits. How can I only get the diseases? Also sometimes there are no diseases related to the SLC transporter or sometimes there is no OMIM number. How can i extract the information? I put some screenshots below to show you how it looks like. Any help will be highly appreciated! This is my first post here so forgive me for any mistakes or missing information. Thank you!
http://imgur.com/aTiGi84 other one is /L65HSym
So ideally the output will be for example:
transporter: SLC1A1
Disease: Epilepsy
OMIM: 12345
Edit: the code i have so far:
import os
import re
from bs4 import BeautifulSoup as BS
import requests
import sys
import time
def hasNumbers(inputString): #get transporter names which contain numbers
return any(char.isdigit() for char in inputString)
def get_list(file): #get a list of transporters
transporter_list=[]
lines = [line.rstrip('\n') for line in open(file)]
for line in lines:
if 'SLC' in line and hasNumbers(line) == True:
get_SLC=line.split()
if 'SLC' in get_SLC[0]:
transporter_list.append(get_SLC[0])
return transporter_list
def get_transporter_webinfo(transporter_list):
output_Website=open("output_website.txt", "w") # get the website content of all transporters
for transporter in transporter_list:
text = requests.get('http://slc.bioparadigms.org/protein?GeneName=' + transporter).text
output_Website.write(text) #ouput from the SLC tables website
soup=BS(text, "lxml")
disease = soup(text=re.compile('Disease'))
characteristics=soup.find_all("span", class_="prt_col2")
memo=soup.find_all("span", class_='expandable prt_col2')
print(transporter,disease,characteristics[6],memo)
def convert(html_file):
file2= open(html_file, 'r')
clean_file= open('text_format_SLC','w')
soup=BS(file2,'lxml')
clean_file.write(soup.get_text())
clean_file.close()
def main():
start_time=time.time()
os.chdir('/home/Programming/Fun stuff')
sys.stdout= open("output_SLC.txt","w")
SLC_list=get_list("SLC.txt")
get_transporter_webinfo(SLC_list) #already have the website content so little redundant
print("this took",time.time() - start_time, "seconds to run")
convert("output_SLC.txt")
sys.stdout.close()
if __name__ == "__main__":
main()

No offence intended, I didn't feel like reading such a large piece of code as what you put in your question.
I would say it can be simplified.
You can get the complete list of links to the SLCs in the line at SLCs =. The next line shows how many there are, and the line beyond that exhibits the href attribute that the last link contains, as an example.
In each SLC's page I look for the string 'Disease' and then, if it's there, I navigate to the link nearby. I find the OMIM in a similar way.
Notice that I process only the first SLC.
>>> import requests
>>> import bs4
>>> main_url = 'http://slc.bioparadigms.org/'
>>> main_page = requests.get(main_url).content
>>> main_soup = bs4.BeautifulSoup(main_page, 'lxml')
>>> stem_url = 'http://slc.bioparadigms.org/protein?GeneName=SLC1A1'
>>> SLCs = main_soup.select('td.slct.tbl_cell.tbl_col1 a')
>>> len(SLCs)
418
>>> SLCs[-1].attrs['href']
'protein?GeneName=SLC52A3'
>>> stem_url = 'http://slc.bioparadigms.org/'
>>> for SLC in SLCs:
... SLC_page = requests.get(stem_url+SLC.attrs['href'], 'lxml').content
... SLC_soup = bs4.BeautifulSoup(SLC_page, 'lxml')
... disease = SLC_soup.find_all(string='Disease: ')
... if disease:
... disease = disease[0]
... diseases = disease.findParent().findNextSibling().text.strip()
... else:
... diseases = 'No diseases'
... OMIM = SLC_soup.find_all(string='OMIM:')
... if OMIM:
... OMIM = OMIM[0]
... number = OMIM.findParent().findNextSibling().text.strip()
... else:
... OMIM = 'No OMIM'
... number = -1
... SLC.text, number, diseases
... break
...
('SLC1A1', '133550', "Huntington's disease, epilepsy, ischemia, Alzheimer's disease, Niemann-Pick disease, obsessive-compulsive disorder")

Related

Why substring cannot be found in the target string?

To understand the values of each variable, I improved a script for replacement from Udacity class. I convert the codes in a function into regular codes. However, my codes do not work while the codes in the function do. I appreciate it if anyone can explain it. Please pay more attention to function "tokenize".
Below codes are from Udacity class (CopyRight belongs to Udacity).
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet'])
# import statements
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
def load_data():
df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
X = df.text.values
y = df.category.values
return X, y
def tokenize(text):
detected_urls = re.findall(url_regex, text) # here, "detected_urls" is a list for sure
for url in detected_urls:
text = text.replace(url, "urlplaceholder") # I do not understand why it can work while does not work in my code if I do not convert it to string
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
clean_tokens = []
for tok in tokens:
clean_tok = lemmatizer.lemmatize(tok).lower().strip()
clean_tokens.append(clean_tok)
return clean_tokens
X, y = load_data()
for message in X[:5]:
tokens = tokenize(message)
print(message)
print(tokens, '\n')
Below is its output:
I want to understand the variables' values in function "tokenize()". Following are my codes.
X, y = load_data()
detected_urls = []
for message in X[:5]:
detected_url = re.findall(url_regex, message)
detected_urls.append(detected_url)
print("detected_urs: ",detected_urls) #output a list without problems
# replace each url in text string with placeholder
i = 0
for url in detected_urls:
text = X[i].strip()
i += 1
print("LN1.url= ",url,"\ttext= ",text,"\n type(text)=",type(text))
url = str(url).strip() #if I do not convert it to string, it is a list. It does not work in text.replace() below, but works in above function.
if url in text:
print("yes")
else:
print("no") #always show no
text = text.replace(url, "urlplaceholder")
print("\nLN2.url=",url,"\ttext= ",text,"\n type(text)=",type(text),"\n===============\n\n")
The output is shown below.
The outputs for "LN1" and "LN2" are same. The "if" condition always output "no". I do not understand why it happens.
Any further help and advice would be highly appreciated.

Stripping list into 3 columns Problem. BeautifulSoup, Requests, Pandas, Itertools

Python novice back again! I got a lot of great help on this but am now stumped. The code below scrapes soccer match data and scores from Lehigh University soccer website. I am trying to split the scores format ['T', '0-0(2 OT)'] into 3 columns 'T', '0-0, '2 OT but I am running into problems. The issue lies in this part of the code:
=> for result in soup.findAll("div", {'class': 'sidearm-schedule-game-result'}):
=> result = result.get_text(strip=True).split(',')
I tried .split(',') but that did not work as it created ['T', '0-0(2 OT)']. Is there a way to split than into 3 columns 1) T, 2) 0-0 and 3) 2 OT???
All help much appreciated.
Thanks
import requests
from bs4 import BeautifulSoup
import pandas as pd
from itertools import zip_longest
d = []
n = []
res = []
op = []
yr = []
with requests.Session() as req:
for year in range(2003, 2020):
print(f"Extracting Year# {year}")
r = req.get(
f"https://lehighsports.com/sports/mens-soccer/schedule/{year}")
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
for date in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-date flex-item-1'}):
d.append(date.get_text(strip=True, separator=" "))
for name in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-name'}):
n.append(name.get_text(strip=True))
for result in soup.findAll("div", {'class': 'sidearm-schedule-game-result'}):
result = result.get_text(strip=True)
#result = result.get_text(strip=True).split(',')
res.append(result)
if len(d) != len(res):
res.append("None")
for opp in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-text'}):
op.append(opp.get_text(strip=True, separator=' '))
yr.append(year)
data = []
for items in zip_longest(yr, d, n, op, res):
data.append(items)
df = pd.DataFrame(data, columns=['Year', 'Date', 'Name', 'opponent', 'Result']).to_excel('lehigh.xlsx', index=False)
I'm going to focus here only on splitting the res list into three columns, and you can incorporate it into your code as you see fit. So let's say you have this:
res1='T, 0-0(2 OT)'
res2='W,2-1OT'
res3='T,2-2Game called '
res4='W,2-0'
scores = [res1,res2,res3,res4]
We split them like this:
print("result","score","extra")
for score in scores:
n_str = score.split(',')
target = n_str[1].strip()
print(n_str[0].strip(),' ',target[:3],' ',target[3:])
Output:
result score extra
T 0-0 (2 OT)
W 2-1 OT
T 2-2 Game called
W 2-0
Note that this assumes that no game ends with double digits scores (say, 11-2, or whatever); so this should work for your typical soccer game, but will fail with basketball :D

Print text output into single list with Selenium web scraper for loop

I am running the following program which scrapes this website. The program uses a list which fills 3 search fields on the website then prints the text of the selected page. It does this over and over again until the list_2.txt comes to an end.
Here is the code:
list_2 = [['7711564', '14', '93'], ['0511442', '7', '27']]
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
driver = webdriver.Firefox()
driver.get("https://www.airdrie.ca/index.cfm?serviceID=284")
for query in list_2:
driver.find_element_by_name("whichPlan").send_keys(query[0])
driver.find_element_by_name("whichBlock").send_keys(query[1])
driver.find_element_by_name("whichLot").send_keys(query[2])
driver.find_element_by_name("legalSubmit").click()
sleep(3)
text_element = driver.find_elements_by_xpath("//div[#class='datagrid']")
text_element2 =
driver.find_elements_by_xpath("//table[#class='quickkey_tbl ']")
txt = [x.text for x in text_element]
print(txt, '\n')
txt2 = [x.text for x in text_element2]
print(txt2, '\n')
driver.back()
driver.refresh()
sleep(2)
I want to be able to print ALL the results from each loop/iteration into a single list. I tried using += but this ends up printing double outputs for the first item on my list only.
You can try something like this:
results_list = []
for query in list_2:
...
txt = [x.text for x in text_element]
print(txt, '\n')
txt2 = [x.text for x in text_element2]
print(txt2, '\n')
results_list.append(txt + txt2)
...
Hope it helps you!

Socket Error Exceptions in Python when Scraping

I am trying to learn scraping,
I use exceptions lower down in the code to pass through errors because they dont affect the writing of data to csv
I keep getting a "socket.gaierror" but in the handling of that there is a "urllib.error.URLError" in the handling of that I get "NameError: name 'socket' is not defined" which seems circuitous
I kind of understand that using these exceptions may not be the best way to run the code but I cant seem to get past these errors and I dont know a way around or how to fix the errors.
If you have any suggestions outside of fixing the error exceptions that would be greatly appreciated as well.
import csv
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
base_url = 'http://www.fangraphs.com/' # used in line 27 for concatenation
years = ['2017','2016','2015'] # for enough data to run tests
#Getting Links for letters
player_urls = []
data = urlopen('http://www.fangraphs.com/players.aspx')
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
player_urls.append(base_url + link['href'])
#Getting Alphabet Links
test_for_playerlinks = 'players.aspx?letter='
player_alpha_links = []
for i in player_urls:
if test_for_playerlinks in i:
player_alpha_links.append(i)
# Getting Player Links
ind_player_urls = []
for l in player_alpha_links:
data = urlopen(l)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
ind_player_urls.append(link['href'])
#Player Links
jan = 'statss.aspx?playerid'
players = []
for j in ind_player_urls:
if jan in j:
players.append(j)
# Building Pitcher List
pitcher = 'position=P'
pitchers = []
pos_players = []
for i in players:
if pitcher in i:
pitchers.append(i)
else:
pos_players.append(i)
# Individual Links to Different Tables Sorted by Base URL differences
splits = 'http://www.fangraphs.com/statsplits.aspx?'
game_logs = 'http://www.fangraphs.com/statsd.aspx?'
split_pp = []
gamel = []
years = ['2017','2016','2015']
for i in pos_players:
for year in years:
split_pp.append(splits + i[12:]+'&season='+ year)
gamel.append(game_logs+ i[12:] + '&type=&gds=&gde=&season=' + year)
split_pitcher = []
gl_pitcher = []
for i in pitchers:
for year in years:
split_pitcher.append(splits + i[12:]+'&season=' + year)
gl_pitcher.append(game_logs + i[12:] + '&type=&gds=&gde=&season=' + year)
# Splits for Pitcher Data
row_sp = []
rows_sp = []
try:
for i in split_pitcher:
sauce = urlopen(i)
soup = BeautifulSoup(sauce, "html.parser")
table1 = soup.find_all('strong', {"style":"font-size:15pt;"})
row_sp = []
for name in table1:
nam = name.get_text()
row_sp.append(nam)
table = soup.find_all('table', {"class":"rgMasterTable"})
for h in table:
he = h.find_all('tr')
for i in he:
td = i.find_all('td')
for j in td:
row_sp.append(j.get_text())
rows_sp.append(row_sp)
except(RuntimeError, TypeError, NameError, URLError, socket.gaierror):
pass
try:
with open('SplitsPitchingData2.csv', 'w') as fp:
writer = csv.writer(fp)
writer.writerows(rows_sp)
except(RuntimeError, TypeError, NameError):
pass
I'm guessing your main problem was that you - without any sleep what so ever - queried the site for a huge amount of invalid urls (you create 3 urls for the years 2015-2017 for 22880 pitchers in total, but most of these do not fall within that scope so you have tens of thousands of queries that return errors).
I'm surprised your IP wasn't banned by site admin. That said: It would be better to do some filtering so you avoid all those error queries...
The filter I applied is not perfect. It checks if the years in the list either appears in the start or end the years given on the site (e.g. '2004 - 2015'). This also creates error links but no way near the amount the original script did.
In code it could look like this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from time import sleep
import csv
base_url = 'http://www.fangraphs.com/'
years = ['2017','2016','2015']
# Getting Links for letters
letter_links = []
data = urlopen('http://www.fangraphs.com/players.aspx')
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
try:
link = base_url + link['href']
if 'players.aspx?letter=' in link:
letter_links.append(link)
except:
pass
print("[*] Retrieved {} links. Now fetching content for each...".format(len(letter_links)))
# the data resides in two different base_urls:
splits_url = 'http://www.fangraphs.com/statsplits.aspx?'
game_logs_url = 'http://www.fangraphs.com/statsd.aspx?'
# we need (for some reason) players in two lists - pitchers_split and pitchers_game_log - and the rest of the players in two different, pos_players_split and pis_players_game_log
pos_players_split = []
pos_players_game_log = []
pitchers_split = []
pitchers_game_log = []
# and if we wanted to do something with the data from the letter_queries, lets put that in a list for safe keeping:
ind_player_urls = []
current_letter_count = 0
for link in letter_links:
current_letter_count +=1
data = urlopen(link)
soup = BeautifulSoup(data, "html.parser")
trs = soup.find('div', class_='search').find_all('tr')
for player in trs:
player_data = [tr.text for tr in player.find_all('td')]
# To prevent tons of queries to fangraph with invalid years - check if elements from years list exist with the player stat:
if any(year in player_data[1] for year in years if player_data[1].startswith(year) or player_data[1].endswith(year)):
href = player.a['href']
player_data.append(base_url + href)
# player_data now looks like this:
# ['David Aardsma', '2004 - 2015', 'P', 'http://www.fangraphs.com/statss.aspx?playerid=1902&position=P']
ind_player_urls.append(player_data)
# build the links for game_log and split
for year in years:
split = '{}{}&season={}'.format(splits_url,href[12:],year)
game_log = '{}{}&type=&gds=&gde=&season={}'.format(game_logs_url, href[12:], year)
# checking if the player is pitcher or not. We're append both link and name (player_data[0]), so we don't need to extract name later on
if 'P' in player_data[2]:
pitchers_split.append([player_data[0],split])
pitchers_game_log.append([player_data[0],game_log])
else:
pos_players_split.append([player_data[0],split])
pos_players_game_log.append([player_data[0],game_log])
print("[*] Done extracting data for players for letter {} out of {}".format(current_letter_count, len(letter_links)))
sleep(2)
# CONSIDER INSERTING CSV-PART HERE....
# Extracting and writing pitcher data to file
with open('SplitsPitchingData2.csv', 'a') as fp:
writer = csv.writer(fp)
for i in pitchers_split:
try:
row_sp = []
rows_sp = []
# all elements in the pitchers_split are lists. Player name is i[1]
data = urlopen(i[1])
soup = BeautifulSoup(data, "html.parser")
# append name to row_sp from pitchers_split
row_sp.append(i[0])
# the page has 3 tables with the class rgMasterTable, the first i Standard, the second Advanced, the 3rd Batted Ball
# we're only grabbing standard
table_standard = soup.find_all('table', {"class":"rgMasterTable"})[0]
trs = table_standard.find_all('tr')
for tr in trs:
td = tr.find_all('td')
for content in td:
row_sp.append(content.get_text())
rows_sp.append(row_sp)
writer.writerows(rows_sp)
sleep(2)
except Exception as e:
print(e)
pass
Since I'm not sure precisely how you wanted the data formatted on output you need some work on that.
If you want to avoid waiting for all letter_links to be extracted before you retrieve the actual pitcher stats (and fine tune your output) you can move the csv writer part up, so it runs as a part of the letter loop. If you do this don't forget to empty the pitchers_split list before grabbing another letter_link...

Python3 compare the output print

I'm trying to crawl a web page and print the text. But, I find some errors in print function when I compare the result of print function.
Compare the output of two print functions, some content is omitted when add a line number parameter in first print function.
My platform:
win7 32bit, Python3.5, MyCharm5.0
The code is:
import re
import requests
from bs4 import BeautifulSoup
baseBoardLink = 'https://bbs.ustc.edu.cn/cgi/'
def assayArticle(link):
fullLink = baseBoardLink + link
response = requests.get(fullLink)
html = response.text
bsobj = BeautifulSoup(html, 'html.parser')
content = bsobj.find('div', {'class': 'post_text'}).get_text()
print('bytes show:', bytes(content, encoding='utf-8'))
lines = re.findall(re.compile('\n[^\n]*'), content)
for line in range(len(lines)):
print('%d' % line, lines[line], end='')
print('*' * 50)
for line in range(len(lines)):
print(lines[line], end='')
startLink = '/bbscon?bn=CS&fn=M5787A417&num=8086'
assayArticle(startLink)
Here is the result:
The output of the first print:
The output of the second print:
I have this error for a few days now, and can't find a solution for it.

Resources