Navigating the html tree with BeautifulSoup and/or Selenium - python-3.x

I've just started using BeautifulSoup and came across an obstacle at the very beginning. I looked up similar posts but didn't find a solution to my specific problem, or there is something fundamental I’m not understanding. My goal is to extract Japanese words with their English translations and examples from this page.
https://iknow.jp/courses/566921
and save them in a dataFrame or a csv file.
I am able to see the parsed output and the content of some tags, but whenever I try requesting something with a class I'm interested in, I get no results. First I’d like to get a list of the Japanese words, and I thought I should be able to do it with:
import urllib
from bs4 import BeautifulSoup
url = ["https://iknow.jp/courses/566921"]
data = []
for pg in url:
r = urllib.request.urlopen(pg)
soup = BeautifulSoup(r,"html.parser")
soup.find_all("a", {"class": "cue"})
But I get nothing, also when I search for the response field:
responseList = soup.findAll('p', attrs={ "class" : "response"})
for word in responseList:
print(word)
I tried moving down the tree by finding children but couldn’t get to the text I want. I will be grateful for your help. Here are the fields I'm trying to extract:
After great help from jxpython, I've now stumbed upon a new challenge (perhaps this should be a new thread, but it's quite related, so maybe it's OK here). My goal is to create a dataframe or a csv file, each row containing a Japanese word, translation and examples with transliterations. With the lists created using:
driver.find_elements_by_class_name()
driver.find_elements_by_xpath()
I get lists with different number of element, so it's not possible to easily creatre a dataframe.
# len(cues) 100
# len(responses) 100
# len(transliterations)279 stramge number because some words don't have transliterations
# len(texts) 200
# len(translations)200
The transliterations lists contains a mix of transliterations for single words and sentences. I think to be able to get content to populate the first line of my dataframe I would need to loop through the
<li class="item">
content (xpath? #/html/body/div2/div/div/section/div/section/div/div/ul/li1) and for each extract the word with translation, sentences and transliteration...I'm not sure if this would be the best approach though...
As an example, the information I would like to have in the first row of my dataframe (from the box highlighted in screenshot) is:
行く, いく, go, 日曜日は図書館に行きます。, にちようび は としょかん に いきます。, I go to the library on Sundays.,私は夏休みにプールに行った。, わたし は なつやすみ に プール に いった。, I went to the pool during summer vacation.

The tags you are trying to scrape are not in the source code. Probably because the page is JavaScript rendered. Try this url to see yourself:
view-source:https://iknow.jp/courses/566921
The Python module Selenium solves this problem. If you would like I could write some code for you to start on.
Here is some code to start on:
from selenium import webdriver
url = 'https://iknow.jp/courses/566921'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(2)
cues = driver.find_elements_by_class_name('cue')
cues = [cue.text for cue in cues]
responses = driver.find_elements_by_class_name('response')
responses = [response.text for response in responses]
texts = driver.find_elements_by_xpath('//*[#class="sentence-text"]/p[1]')
texts = [text.text for text in texts]
transliterations = driver.find_elements_by_class_name('transliteration')
transliterations = [transliteration.text for transliteration in transliterations]
translations = driver.find_elements_by_class_name('translation')
translations = [translation.text for translation in translations]
driver.close()
Note: You first need to install a webdriver. I choose chrome.
Here is a link: https://chromedriver.storage.googleapis.com/index.html?path=2.41/. Also add this to your path!
If you have any other questions let me know!

Related

I want to get a element by Text in beautiful soup

elem = browser.find_element_by_partial_link_text("WEBSITE")
above code finds out element with a link text as WEBSITE, but I don't want to use Selenuim here and Find element by text by using bs4. I tried the following code but no results
elem = soup.find(text=re.compile('WEBSITE'))
By the documentation provided here. You can do something like below.
ele = soup.find('tag', string='text_for_search')

Python Selenium - how to get text in div after span

I have a list of urls that go to different anime on myanimelist.net. For each anime, I want to get the text for the genres for each anime that can be found on the website and add it to a list of strings (one element for each anime, not 5 separate elements if an anime has 5 genres listed)
Here is the HTML code for an anime on myanimelist.net. I want to essentially get the genre text at top of the image and put in a list so in the image shown, its entry in the list would be ["Mystery, Police, Psychological, Supernatural, Thriller, Shounen"] and for each url in my list, another string containing the genres for that anime is appended to the list.
This is the main part of my code
driver = webdriver.Firefox()
flist = [url1, url2, url3] #List of urls
genres = []
for item in flist:
driver.get(item) #Opens each url
elem = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div/div[16]").text
genres.append(elem)
The code works for some anime and not for others. Sometimes the position is different for some anime and instead of getting the info about the genres, I get info about the studio that produced the anime, etc.
What I want is to specify "Genres:" in the span class and get the genres that are listed below it as shown in my image above. I can't seem to find anything similar to what I'm looking for (though I might just not be phrasing my questions right as well as a lack of experience using xpaths)
driver.get('https://myanimelist.net/anime/35760/Shingeki_no_Kyojin_Season_3')
links = driver.find_elements_by_xpath("//div[contains(string(), 'Genres')]/a[contains(#href,'genre')]")
for link in links:
title= elem.get_attribute("title")
genres.append(title)
print(genres)
genresString = ",".join(genres)
print(genresString)
Sample Output:
['Action', 'Military', 'Mystery', 'Super Power', 'Drama', 'Fantasy', 'Shounen']
Action,Military,Mystery,Super Power,Drama,Fantasy,Shounen

How to extract data from multiple dt and dd tags in tabled form (within a looped statement) using python v3 beautiful soup v4?

Source:
I’ve only chosen one year for simplicity but my intention is for all years (n=117).
https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-File%20Report/
(2018 only)
https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-File%20Report/2018/
Resources:
I’ve found 2 blogs and 2 stack overflow forums that have steered my attempts to replicate their work but my lack of experience and the uniqueness of the website and task has proven difficult. I’ve tried next_siblings a little but to no success. Blog #1: Extract tabled data as a table:
https://journalistsresource.org/tip-sheets/research/python-scrape-website-data-criminal-justice
https://gist.github.com/phillipsm/404780e419c49a5b62a8 Blog #2: Extract data from tags into a table
https://www.dataquest.io/blog/web-scraping-beautifulsoup/
Stack overflow forum #1:
Using BeautifulSoup to extract specific dl and dd list elements
Stack overflow forum #2:
Use BeautifulSoup to get a value after a specific tag
Problems encountered:
1. Each year’s publications have different “Additional Publication Details”. To help with this I can run the code I have and compiled (which is not in tabled form) the unique dt tag text headers to make sure all are captured for 2018 (pasted below). But again to do this for all years would take time…right? I'll add in a comment if necessary.
2. For statements…I find I keep having to nest “for” statements to get to final webpage where publication details live (minimum of 2 links). This seems restricting in what/how I can return data and without limiting replicating returns ([:1]), my code can very easily fail (whether it’s from the source server or what have you).
3. I have to first extract dt element text, then extract dd element text.
Code:
(commented out dt element grab and print statements are only for my record keeping of what’s being done. Again, I compiled unique dt element text headers for reference…see comment above. Apologize upfront if my code is ‘dizzying’…)
import requests
from bs4 import BeautifulSoup
import csv
import re
import time
url =
'https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-
File%20Report'
url2 = 'https://pubs.er.usgs.gov'
response = requests.get(url)
data = response.text
pubti_links = []
soup = BeautifulSoup(data, "html.parser")
type(soup)
year_containers = soup.findAll('li',{'class':'pubs-browse-list-theme'})
for year in year_containers[:1]:
for a in soup.findAll('a'):
if '/browse/Report/USGS%20Numbered%20Series/Open-
File%20Report/2018' in a['href']:
link_containers = a.get('href')
#print (link_containers)
pubti_links = url2 + link_containers
#print (pubti_links)
for pubti_link in pubti_links[:1]:
response2 = requests.get(pubti_links)
soup2 = BeautifulSoup(response2.text, "html.parser")
time.sleep(2)
for elm in soup2.find_all('li',{'class':'pubs-browse-list-
theme'}):
for a_elm in elm.findAll('a'):
#print(a.get('href'))
pub_containers = a_elm.get('href')
pubdetails_links = url2 + pub_containers
response3 = requests.get(pubdetails_links)
soup3 = BeautifulSoup(response3.text,
"html.parser")
pubdetail_containers = soup3.findAll('dd',{'class':
["" "","dark"]})
dd_data = soup3.findAll('dd',{'class':[""
"","dark"]})
#dt_data = soup3.findAll('dt',{'class':[""
"","dark"]})
for dd_item in dd_data:
print(dd_item.string)
#for dt_item in dt_data:
#print (dt_item.string)
Desired result (the goal is to create a table of all USGS publication for each year):
Output Table example

Can't pull out the information from object using Beautiful Soup 4

I am working (for the first time) with scraping a website. I am trying to pull the latitude (in decimal degrees) from a website. I have managed to pull out the correct parent node that contains the information, but I am stuck on how to pull out the actual number from this. All of the searching I have done has only told me how to pull it out if I know the string (which I don't) or if the string is in a child node, which it isn't. Any help would be great.
Here is my code:
a_string = soup.find(string="Latitude in decimal degrees")
a_string.find_parents("p")
Out[46]: [<p><b>Latitude in decimal degrees</b><font size="-2">
(<u>see definition</u>)
</font><b>:</b> 35.7584895</p>]
test = a_string.find_parents("p")
print(test)
[<p><b>Latitude in decimal degrees</b><font size="-2"> (<u>see definition</u>)</font>
<b>:</b> 35.7584895</p>]
I need to pull out the 35.7584895 and save it as an object so I can append it to a dataset.
I am using Beautiful Soup 4 and python 3
The first thing to notice is that, since you have used the find_parents method (plural), test is a list. You need only the first item of it.
I will simulate your situation by doing this.
>>> import bs4
>>> HTML = '<p><b>Latitude in decimal degrees</b><font size="-2"> (<u>see definition</u>)</font><b>:</b> 35.7584895</p>'
>>> item_soup = bs4.BeautifulSoup(HTML, 'lxml')
The simplest way of recovering the textual content of this is to do this:
>>> item_soup.text
'Latitude in decimal degrees (see definition): 35.7584895'
However, you want the number. You can get this in various ways, two of which come to my mind. I assign the result of the previous statement to str so that I can manipulate the result.
>>> str = item_soup.text
One way is to search for the colon.
>>> str[1+str.rfind(':'):].strip()
'35.7584895'
The other is to use a regex.
>>> bs4.re.search(r'(\d+\.\d+)', str).groups(0)[0]
'35.7584895'

Incoherent results from feeds search through the API

I want to visualize in an earth map all feeds from the user 'airqualityegg'. In order to do this I wrote the following script with Python (if you are gonna try yourself, indent correctly the code in the text editor you use):
import json
import urllib
import csv
list=[]
for page in range(7):
url = 'https://api.xively.com/v2/feeds?user=airqualityegg&per_page=100page='+str(page)
rawData=urllib.urlopen(url)
#Loads the data in json format
dataJson = json.load(rawData)
print dataJson['totalResults']
print dataJson['itemsPerPage']
for entry in dataJson['results']:
try:
list2=[]
list2.append(entry['id'])
list2.append(entry['creator'])
list2.append(entry['status'])
list2.append(entry['location']['lat'])
list2.append(entry['location']['lon'])
list2.append(entry['created'])
list.append(list2)
except:
print 'failed to scrape a row'
def escribir():
abrir = open('all_users2_andy.csv', 'w')
wr = csv.writer(abrir, quoting=csv.QUOTE_ALL)
headers = ['id','creator', 'status','lat', 'lon', 'created']
wr.writerow (headers)
for item in list:
row=[item[0], item[1], item[2], item[3], item[4], item[5]]
wr.writerow(row)
abrir.close()
escribir()
I have included a call to 7 pages because the total numbers of feeds posted by this user are 684 (as you can see when writing directly in the browser 'https://api.xively.com/v2/feeds?user=airqualityegg')
The csv file that resulted from running this script does present duplicated rows, what might be explained for the fact that every time that a call is made to a page the order of results varies. Thus, a same row can be included in the results of different calls. For this reason I get less unique results that I should.
Do you know why might be that the results included in different pages are not unique?
Thanks,
María
You can try passing
order=created_at (see docs).
The problem is because by default order=updated_at, hence the chances are that results will appear different on each page.
You should also consider using the official Python library.

Resources