Python Selenium - how to get text in div after span - python-3.x

I have a list of urls that go to different anime on myanimelist.net. For each anime, I want to get the text for the genres for each anime that can be found on the website and add it to a list of strings (one element for each anime, not 5 separate elements if an anime has 5 genres listed)
Here is the HTML code for an anime on myanimelist.net. I want to essentially get the genre text at top of the image and put in a list so in the image shown, its entry in the list would be ["Mystery, Police, Psychological, Supernatural, Thriller, Shounen"] and for each url in my list, another string containing the genres for that anime is appended to the list.
This is the main part of my code
driver = webdriver.Firefox()
flist = [url1, url2, url3] #List of urls
genres = []
for item in flist:
driver.get(item) #Opens each url
elem = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div/div[16]").text
genres.append(elem)
The code works for some anime and not for others. Sometimes the position is different for some anime and instead of getting the info about the genres, I get info about the studio that produced the anime, etc.
What I want is to specify "Genres:" in the span class and get the genres that are listed below it as shown in my image above. I can't seem to find anything similar to what I'm looking for (though I might just not be phrasing my questions right as well as a lack of experience using xpaths)

driver.get('https://myanimelist.net/anime/35760/Shingeki_no_Kyojin_Season_3')
links = driver.find_elements_by_xpath("//div[contains(string(), 'Genres')]/a[contains(#href,'genre')]")
for link in links:
title= elem.get_attribute("title")
genres.append(title)
print(genres)
genresString = ",".join(genres)
print(genresString)
Sample Output:
['Action', 'Military', 'Mystery', 'Super Power', 'Drama', 'Fantasy', 'Shounen']
Action,Military,Mystery,Super Power,Drama,Fantasy,Shounen

Related

how to extract particular lines from the list?

I have a list and wanted to extract a particular line from the list. Below is my list
I wanted to extract 'src link' from the above list
example:
(src="https://r-cf.bstatic.com/xdata/images/hotel/square600/244245064.webp?k=8699eb2006da453ae8fe257eee2dcc242e70667ef29845ed85f70dbb9f61726a&o="). My final aim is to extract only the link. I have 20 records in the list. Hence, the need to extract 20 links from the same
My code (I stored the list in 'aas')
links = []
for i in aas:
link = re.search('CONCLUSION: (.*?)([A-Z]{2,})', i).group(1)
links.append(link)
````
I am getting an error: "expected string or bytes-like object"
Any suggestions?
As per the Beautiful Soup documentation, you can access a tag’s attributes by treating the tag like a dictionary, like so:
for img in img_list:
print(img["src"])

Problem with loop inside a try-except-else

I have a list of authors and books and I'm using an API to retrieve the descriptions and ratings of each of them. I'm not able to iterate the list and store the descriptions in a new list.
The Goodreads API gives me a chance to look for books' information, sending the title and author for the accuracy, or only the title. What I want is:
Loop the list of titles and authors, get the 1st title and 1st author and try to retrieve the description and save it in a third list.
If not found, try to retrieve the description using only the title and save it to the list.
If not found yet, add a standard error message.
I've tried the code below, but I'm not able to iterate the entire list and save the results.
#Running the API with author's name and book's title
book_url_w_author = 'https://www.goodreads.com/book/title.xml?author='+edit_authors[0]+'&key='+api_key+'&title='+edit_titles[0]
#Running the API with only book's title
book_url_n_author = 'https://www.goodreads.com/book/title.xml?'+'&key='+api_key+'&title='+edit_titles[0]
# parse book url with author
html_n_author = requests.get(book_url_w_author).text
soup_n_author = BeautifulSoup(html_n_author, "html.parser")
# parse book url without author
html_n_author = requests.get(book_url_n_author).text
soup_n_author = BeautifulSoup(html_n_author, "html.parser")
#Retrieving the books' descriptions
description = []
try:
#fetch description for url with author and title and add it to descriptions list
for desc_w_author in soup_w_author.book.description:
description.append(desc_w_author)
except:
#fetch description for url with only title and add it to descriptions list
for desc_n_author in soup_n_author.book.description:
description.append(desc_n_author)
else:
#return and inform that no description was found
description.append('Description not found')
Expected:
description = [description1, description2, description3, ....]

Python, Beautifulsoup - Extracting strings from tags based on items in list

I am trying to scrape the site https://www.livechart.me/winter-2019/tv to get the number of episodes that have currently aired for certain shows this season. I do this by extracting the "episode-countdown" tag data which gives something like "EP11:" then a timestamp after it, and then I slice that string to only give the number (in this case the "11") and then subtract by 1 to get how many episodes have currently aired (as the timestamp is for when EP11 will air).
I have a list of the different shows I am watching this season in order to filter what shows I extract the episode-countdown strings for instead of extracting the countdown for every show airing. The big problem I am having is that the "episode-countdown" strings are not in the same order as my list of shows I am watching. For example, if my list is [show1, show2, show3, show4], I might get the "episodes-countdown" string tag in the order of show3, show4, show1, show2 if they are listed in that order on the website.
My current code is as follows:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
def countdown():
html = Request('https://www.livechart.me/winter-2019/tv', headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(html)
soup = BeautifulSoup(page, 'html.parser')
shows = ['Jojo no Kimyou na Bouken: Ougon no Kaze', 'Dororo', 'Mob Psycho 100 II', 'Yakusoku no Neverland']
for tag in soup.find_all('article', attrs={'class': 'anime'}):
if any(x in tag['data-romaji'] for x in shows):
rlist = tag.find('div', attrs={'class': 'episode-countdown'}).text
r2 = rlist[:rlist.index(":")][2:]
print('{} has aired {} episodes so far'.format(tag['data-romaji'], int(r2)-1))
Each show listed on the website is inside of an "article" tag so for every show in the soup.find_all() statement, if the "data-romaji" (the name of the show listed on the website) matches a show in my "shows" list, then I extract the "episode-countdown" string and then slice the string to just the number as previously explained and then print to make sure I did it correctly.
If you go to the website, the order that the shows are listed are "Yakusoku no Neverland", "Mob Psycho", "Dororo", and "Jojo" which is the order that you get the episode-countdown strings in if you run the code. What I want to do is have it in order of my "shows" list so that I have a list of shows and a list of episodes aired that match each other. I want to add the episodes aired list as a column in a pandas dataframe I am currently building so having it not match the "shows" column would be a problem.
Is there a way for me to extract the "episode-countdown" string based on the order of my "shows" list instead of the order used on the website (if that makes sense)?
Is this what you're looking for?
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
html = Request('https://www.livechart.me/winter-2019/tv', headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(html)
soup = BeautifulSoup(page, 'html.parser')
shows = ['Jojo no Kimyou na Bouken: Ougon no Kaze', 'Dororo', 'Mob Psycho 100 II', 'Yakusoku no Neverland']
master = []
for show in shows:
for tag in soup.find_all('article', attrs={'class': 'anime'}):
show_info = []
if show in tag['data-romaji']:
show_info.append(tag['data-romaji'])
rlist = tag.find('div', attrs={'class': 'episode-countdown'}).text
r2 = rlist[:rlist.index(":")][2:]
show_info.append(r2)
master.append(show_info)
df=pd.DataFrame(master,columns=['Show','Episodes'])
df
Output:
Show Episodes
0 Jojo no Kimyou na Bouken: Ougon no Kaze 23
1 Dororo 11
2 Mob Psycho 100 II 11
3 Yakusoku no Neverland 11

Navigating the html tree with BeautifulSoup and/or Selenium

I've just started using BeautifulSoup and came across an obstacle at the very beginning. I looked up similar posts but didn't find a solution to my specific problem, or there is something fundamental I’m not understanding. My goal is to extract Japanese words with their English translations and examples from this page.
https://iknow.jp/courses/566921
and save them in a dataFrame or a csv file.
I am able to see the parsed output and the content of some tags, but whenever I try requesting something with a class I'm interested in, I get no results. First I’d like to get a list of the Japanese words, and I thought I should be able to do it with:
import urllib
from bs4 import BeautifulSoup
url = ["https://iknow.jp/courses/566921"]
data = []
for pg in url:
r = urllib.request.urlopen(pg)
soup = BeautifulSoup(r,"html.parser")
soup.find_all("a", {"class": "cue"})
But I get nothing, also when I search for the response field:
responseList = soup.findAll('p', attrs={ "class" : "response"})
for word in responseList:
print(word)
I tried moving down the tree by finding children but couldn’t get to the text I want. I will be grateful for your help. Here are the fields I'm trying to extract:
After great help from jxpython, I've now stumbed upon a new challenge (perhaps this should be a new thread, but it's quite related, so maybe it's OK here). My goal is to create a dataframe or a csv file, each row containing a Japanese word, translation and examples with transliterations. With the lists created using:
driver.find_elements_by_class_name()
driver.find_elements_by_xpath()
I get lists with different number of element, so it's not possible to easily creatre a dataframe.
# len(cues) 100
# len(responses) 100
# len(transliterations)279 stramge number because some words don't have transliterations
# len(texts) 200
# len(translations)200
The transliterations lists contains a mix of transliterations for single words and sentences. I think to be able to get content to populate the first line of my dataframe I would need to loop through the
<li class="item">
content (xpath? #/html/body/div2/div/div/section/div/section/div/div/ul/li1) and for each extract the word with translation, sentences and transliteration...I'm not sure if this would be the best approach though...
As an example, the information I would like to have in the first row of my dataframe (from the box highlighted in screenshot) is:
行く, いく, go, 日曜日は図書館に行きます。, にちようび は としょかん に いきます。, I go to the library on Sundays.,私は夏休みにプールに行った。, わたし は なつやすみ に プール に いった。, I went to the pool during summer vacation.
The tags you are trying to scrape are not in the source code. Probably because the page is JavaScript rendered. Try this url to see yourself:
view-source:https://iknow.jp/courses/566921
The Python module Selenium solves this problem. If you would like I could write some code for you to start on.
Here is some code to start on:
from selenium import webdriver
url = 'https://iknow.jp/courses/566921'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(2)
cues = driver.find_elements_by_class_name('cue')
cues = [cue.text for cue in cues]
responses = driver.find_elements_by_class_name('response')
responses = [response.text for response in responses]
texts = driver.find_elements_by_xpath('//*[#class="sentence-text"]/p[1]')
texts = [text.text for text in texts]
transliterations = driver.find_elements_by_class_name('transliteration')
transliterations = [transliteration.text for transliteration in transliterations]
translations = driver.find_elements_by_class_name('translation')
translations = [translation.text for translation in translations]
driver.close()
Note: You first need to install a webdriver. I choose chrome.
Here is a link: https://chromedriver.storage.googleapis.com/index.html?path=2.41/. Also add this to your path!
If you have any other questions let me know!

I want to get a element by Text in beautiful soup

elem = browser.find_element_by_partial_link_text("WEBSITE")
above code finds out element with a link text as WEBSITE, but I don't want to use Selenuim here and Find element by text by using bs4. I tried the following code but no results
elem = soup.find(text=re.compile('WEBSITE'))
By the documentation provided here. You can do something like below.
ele = soup.find('tag', string='text_for_search')

Resources