I want to get a element by Text in beautiful soup - python-3.x

elem = browser.find_element_by_partial_link_text("WEBSITE")
above code finds out element with a link text as WEBSITE, but I don't want to use Selenuim here and Find element by text by using bs4. I tried the following code but no results
elem = soup.find(text=re.compile('WEBSITE'))

By the documentation provided here. You can do something like below.
ele = soup.find('tag', string='text_for_search')

Related

Can't get multiple span class text with selenium python

I'm getting error when I try to scrape a flashscore match summary. Example:
flashscore
I want to get the for example all the results in those page but doing driver.find_element_by_class("h2h__result") it only takes the first result. (putted inside a for obv)
If i try to do driver.find_elements_by_class i get error and i can't understand why.
Code example:
driver.get("https://www.flashscore.com/match/Qs85KCdA/#h2h/overall")
time.sleep(2)
h2h = driver.find_elements_by_class_name("rows")
for x in h2h:
p = driver.find_element_by_css_selector("span.h2h__regularTimeResult")
print(p.text)
Can someone help me to understand where i'm doing wrong? Thank you a lot guys.
The elements with class-name rows is highlighting the whole table. Use the class-name h2h__row so that all the rows are focused and will be able to extract the details from that particular row.
Try below xpaths to get the elements.
from selenium.webdriver.common.by import By
driver.get("https://www.flashscore.com/match/Qs85KCdA/#h2h/overall")
rows = driver.find_elements(By.XPATH,"//div[#class='h2h__row']")
for row in rows:
results = row.find_element(By.XPATH,".//span[#class='h2h__regularTimeResult']") # Use a dot in the xpath to find elements with in an element
print(results.text)
You can also use below CSS_SELECTOR to get the elements directly.
regularTimeResult = driver.find_elements(By.CSS_SELECTOR,"div.h2h__row span.h2h__regularTimeResult")
for item in regularTimeResult:
print(item.text)
Update:
rows = driver.find_elements(By.XPATH,"//div[#class='h2h__row']")
for row in rows:
results = row.find_element(By.XPATH,".//span[#class='h2h__regularTimeResult']") # Use a dot in the xpath to find elements with in an element
if "0 : 0" not in results.text:
print(results.text)

how to extract particular lines from the list?

I have a list and wanted to extract a particular line from the list. Below is my list
I wanted to extract 'src link' from the above list
example:
(src="https://r-cf.bstatic.com/xdata/images/hotel/square600/244245064.webp?k=8699eb2006da453ae8fe257eee2dcc242e70667ef29845ed85f70dbb9f61726a&o="). My final aim is to extract only the link. I have 20 records in the list. Hence, the need to extract 20 links from the same
My code (I stored the list in 'aas')
links = []
for i in aas:
link = re.search('CONCLUSION: (.*?)([A-Z]{2,})', i).group(1)
links.append(link)
````
I am getting an error: "expected string or bytes-like object"
Any suggestions?
As per the Beautiful Soup documentation, you can access a tag’s attributes by treating the tag like a dictionary, like so:
for img in img_list:
print(img["src"])

Navigating the html tree with BeautifulSoup and/or Selenium

I've just started using BeautifulSoup and came across an obstacle at the very beginning. I looked up similar posts but didn't find a solution to my specific problem, or there is something fundamental I’m not understanding. My goal is to extract Japanese words with their English translations and examples from this page.
https://iknow.jp/courses/566921
and save them in a dataFrame or a csv file.
I am able to see the parsed output and the content of some tags, but whenever I try requesting something with a class I'm interested in, I get no results. First I’d like to get a list of the Japanese words, and I thought I should be able to do it with:
import urllib
from bs4 import BeautifulSoup
url = ["https://iknow.jp/courses/566921"]
data = []
for pg in url:
r = urllib.request.urlopen(pg)
soup = BeautifulSoup(r,"html.parser")
soup.find_all("a", {"class": "cue"})
But I get nothing, also when I search for the response field:
responseList = soup.findAll('p', attrs={ "class" : "response"})
for word in responseList:
print(word)
I tried moving down the tree by finding children but couldn’t get to the text I want. I will be grateful for your help. Here are the fields I'm trying to extract:
After great help from jxpython, I've now stumbed upon a new challenge (perhaps this should be a new thread, but it's quite related, so maybe it's OK here). My goal is to create a dataframe or a csv file, each row containing a Japanese word, translation and examples with transliterations. With the lists created using:
driver.find_elements_by_class_name()
driver.find_elements_by_xpath()
I get lists with different number of element, so it's not possible to easily creatre a dataframe.
# len(cues) 100
# len(responses) 100
# len(transliterations)279 stramge number because some words don't have transliterations
# len(texts) 200
# len(translations)200
The transliterations lists contains a mix of transliterations for single words and sentences. I think to be able to get content to populate the first line of my dataframe I would need to loop through the
<li class="item">
content (xpath? #/html/body/div2/div/div/section/div/section/div/div/ul/li1) and for each extract the word with translation, sentences and transliteration...I'm not sure if this would be the best approach though...
As an example, the information I would like to have in the first row of my dataframe (from the box highlighted in screenshot) is:
行く, いく, go, 日曜日は図書館に行きます。, にちようび は としょかん に いきます。, I go to the library on Sundays.,私は夏休みにプールに行った。, わたし は なつやすみ に プール に いった。, I went to the pool during summer vacation.
The tags you are trying to scrape are not in the source code. Probably because the page is JavaScript rendered. Try this url to see yourself:
view-source:https://iknow.jp/courses/566921
The Python module Selenium solves this problem. If you would like I could write some code for you to start on.
Here is some code to start on:
from selenium import webdriver
url = 'https://iknow.jp/courses/566921'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(2)
cues = driver.find_elements_by_class_name('cue')
cues = [cue.text for cue in cues]
responses = driver.find_elements_by_class_name('response')
responses = [response.text for response in responses]
texts = driver.find_elements_by_xpath('//*[#class="sentence-text"]/p[1]')
texts = [text.text for text in texts]
transliterations = driver.find_elements_by_class_name('transliteration')
transliterations = [transliteration.text for transliteration in transliterations]
translations = driver.find_elements_by_class_name('translation')
translations = [translation.text for translation in translations]
driver.close()
Note: You first need to install a webdriver. I choose chrome.
Here is a link: https://chromedriver.storage.googleapis.com/index.html?path=2.41/. Also add this to your path!
If you have any other questions let me know!

Python Selenium - how to get text in div after span

I have a list of urls that go to different anime on myanimelist.net. For each anime, I want to get the text for the genres for each anime that can be found on the website and add it to a list of strings (one element for each anime, not 5 separate elements if an anime has 5 genres listed)
Here is the HTML code for an anime on myanimelist.net. I want to essentially get the genre text at top of the image and put in a list so in the image shown, its entry in the list would be ["Mystery, Police, Psychological, Supernatural, Thriller, Shounen"] and for each url in my list, another string containing the genres for that anime is appended to the list.
This is the main part of my code
driver = webdriver.Firefox()
flist = [url1, url2, url3] #List of urls
genres = []
for item in flist:
driver.get(item) #Opens each url
elem = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div/div[16]").text
genres.append(elem)
The code works for some anime and not for others. Sometimes the position is different for some anime and instead of getting the info about the genres, I get info about the studio that produced the anime, etc.
What I want is to specify "Genres:" in the span class and get the genres that are listed below it as shown in my image above. I can't seem to find anything similar to what I'm looking for (though I might just not be phrasing my questions right as well as a lack of experience using xpaths)
driver.get('https://myanimelist.net/anime/35760/Shingeki_no_Kyojin_Season_3')
links = driver.find_elements_by_xpath("//div[contains(string(), 'Genres')]/a[contains(#href,'genre')]")
for link in links:
title= elem.get_attribute("title")
genres.append(title)
print(genres)
genresString = ",".join(genres)
print(genresString)
Sample Output:
['Action', 'Military', 'Mystery', 'Super Power', 'Drama', 'Fantasy', 'Shounen']
Action,Military,Mystery,Super Power,Drama,Fantasy,Shounen

Loop json results

I'm totally new to python. I have this code:
import requests
won = 'https://api.pipedrive.com/v1/deals?status=won&start=0&api_token=xxxx'
json_data = requests.get(won).json()
deal_name = json_data ['data'][0]['title']
print(deal_name)
It prints the first title for me, but I would like it to loop through all titles in the json. But I can't figure out how. Can anyone guide me in the right direction?
You want to read up on dictionaries and lists. It seems like your json_data["data"] contains a list, so:
Seeing you wrote this:
deal_name = json_data ['data'][0]['title']
print(deal_name)
What you are looking for is:
for i in range(len(json_data["data"])):
print(json_data["data"][i]["title"])
Print it with a for loop
1. for item in json_data['data']: will take each element in the list json_data['data']
2. Then we print the title property of the object using the line print(item['title'])
Code:
import requests
won = 'https://api.pipedrive.com/v1/deals?status=won&start=0&api_token=xxxx'
json_data = requests.get(won).json()
for item in json_data['data']:
print(item['title'])
If you are ok with printing the titles as a list you can use List Comprehensions, Please refer the link in references to learn more.
print([x['title'] for x in json_data['data']])
References:
Python Loops
Python Lists
Python Comprehensions

Resources