How to extract image src from the website - python-3.x

I tried scraping the table rows from the website to get the data on corona virus spread.
I wanted to extract the src for all the tags so as to get the source of the flag's image along with all the data for each country. Could someone help ?
import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver = webdriver.Firefox(options=options)
driver.get("https://google.com/covid19-map/?hl=en")
df = pd.read_html(driver.page_source)[1]
df.to_csv("Data.csv", index=False)
driver.quit()

While Gareth's answer has already been accepted, his answer inspired me to write this one form a pandas point of view. Since we know the url for flags are a fixed pattern and the only thing that changes is the name. We can create a new column by lowercasing the name, replacing spaces with underscores and then weaving the name in the fixed URL pattern
import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome()
driver.get("https://google.com/covid19-map/?hl=en")
df = pd.read_html(driver.page_source)[1]
df['flag_url'] = df.apply(lambda row: f"https://www.gstatic.com/onebox/sports/logos/flags/{row.Location.lower().replace(' ', '_')}_icon_square.svg", axis=1)
df.to_csv("Data.csv", index=False)
driver.quit()
OUTPUT SAMPLE
Location,Confirmed,Cases per 1M people,Recovered,Deaths,flag_url
Worldwide,882068,125.18,185067,44136,https://www.gstatic.com/onebox/sports/logos/flags/worldwide_icon_square.svg
United Kingdom,29474,454.19,135,2352,https://www.gstatic.com/onebox/sports/logos/flags/united_kingdom_icon_square.svg
United States,189441,579.18,7082,4074,https://www.gstatic.com/onebox/sports/logos/flags/united_states_icon_square.svg

Not the most genious way, but since you have the page source already, how about using regex to match the urls of the images?
import re
print (re.findall(r'https://www.gstatic.com/onebox/sports/logos/flags/.+?.svg', driver.page_source))
The image links are in order so it matches the order of confirmed cases - except that on my computer, the country I'm in right now is at the top of the list.
If this is not what you want, I can delete this answer.
As mentioned by #Chris Doyle in the comments, this can even simply done by noticing the urls are the same, with ".+?" replaced by the country's name (all lowercase, connected with underscores). You have that information in the csv file.
country_name = "United Kingdom"
url = "https://www.gstatic.com/onebox/sports/logos/flags/"
url += '_'.join(country_name.lower().split())
url += '.svg'
print (url)
Also be sure to check out his answer using purely panda :)

Related

Read a table from google docs without using API

I want to read a google sheet table in python, but without using API.
I tried with BytesIO, Beatifulsoup.
I know about the soluthion with gspread, but I need to read table without token. Only using url.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from lxml.etree import tostring
from io import BytesIO
import requests
req=requests.get('https://docs.google.com/spreadsheets/d/sheetId/edit#gid=', auth=('email', 'password'))
page = req.text
here i've got html code, like <!doctype html><html lang="en-US" dir="ltr"><head><base href="h and so on...
i also tried lib BeautifulSoup, but the result is same.
For reading a table from html, you can use pandas.read_html.
If it's an unrestricted spreadsheet like this one, you probably don't even need requests - you can just directly pass the URL to read_html. [view dataframe]
import pandas as pd
sheetId = '1bQo1an4yS1tSOMDhmUTGYtUlgnHDQ47EmIcj4YyuIxo' ## REPLACE WITH YOUR SHEETID
sheetUrl = f'https://docs.google.com/spreadsheets/d/{sheetId}'
sheetDF = pd.read_html(
sheetUrl, attrs={'class': 'waffle'}, skiprows=1
)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
If it's not unrestricted, then the way you're using requests.get would not work either, because you're not passing the auth argument correctly. I actually don't think there is any way to login to Google with just requests.auth. You could login to drive on your browser, open a sheet and then copy the request to https://curlconverter.com/ and paste to your code from there.
import pandas as pd
import requests
from bs4 import BeautifulSoup
sheetUrl = 'YOUR_SHEET_URL'
cookies = 'PASTE_FROM_https://curlconverter.com/'
headers = 'PASTE_FROM_https://curlconverter.com/'
req = requests.get(sheetUrl, cookies=cookies, headers=headers)
# in one line, no error-handling
# sheetDF = pd.read_html(req.text, attrs={'class': 'waffle'}, skiprows=1)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
# req.raise_for_status() # raise error if request fails
if req.status_code != 200:
print(req.status_code, req.reason, '- failed to get', url)
soup = BeautifulSoup(req.content)
wTable = soup.find('table', class_="waffle")
if wTable is not None:
dfList = pd.read_html(str(wTable), skiprows=1) # set skiprows=1 to skip top row with column names A, B, C...
sheetDF = dfList[0] # because read_html returns a LIST of dataframes
sheetDF = sheetDF.drop(['1'], axis='columns') # drop row #s 1,2,3...
sheetDF = sheetDF.dropna(axis='rows', how='all') # drop empty rows
sheetDF = sheetDF.dropna(axis='columns', how='all') # drop empty columns
### WHATEVER ELSE YOU WANT TO DO WITH DATAFRAME ###
else: print('COULD NOT FIND TABLE')
However, please note that the cookies are probably only good for up to 5 hours maximum (and then you'll need to paste new ones), and that if there are multiple sheets in one spreadsheet, you'll only be able to scrape the first sheet with requests/pandas. So, it would be better to use the API for restricted or multi-sheet spreadsheets.

I'm having trouble returning an HTML link as I pull links from a google search query in Python

I'm attempting to pull website links from a google search but I'm having trouble returning any value, I think the issue is with the attributes I'm using to call the web link but I'm not sure why as I was able to use the same attributes in webdriver to accomplish the result.
Here's the code:
import requests
import sys
import webbrowser
import bs4
from parsel import Selector
import xlsxwriter
from openpyxl import load_workbook
import pandas as pd
print('Searching...')
res = requests.get('https://google.com/search?q="retail software" AND "mission"')
soup = bs4.BeautifulSoup(res.content, 'html.parser')
for x in soup.find_all('div', class_='yuRUbf'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
print(link)
This is the ouput:
Searching...
That's it. Thanks in advance for the help!
The class value is dynamic, you should use the following selector to retrieve the href value
"a[href*='/url']"
This will get any a tag that contains the pattern /url.
So, just change your for loop to
for anchor_tags in soup.select("a[href*='/url']"):
print(anchor_tags.attrs.get('href'))
Example of href printed
/url?q=https://www.devontec.com/mission-vision-and-values/&sa=U&ved=2ahUKEwjShcWU-aPtAhWEH7cAHYfQAlEQFjAAegQICBAB&usg=AOvVaw14ziJ-ipXIkPWH3oMXCig1
To get the link, you only need to split the string.
You can do it like this
for anchor_tags in soup.select("a[href*='/url']"):
link = anchor_tags.attrs.get('href').split('q=')[-1] # split by 'q=' and gets the last position
print(link)

Clicking links on the website to get contents in the bubbles with selenium

I'm trying to get the course information on http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext.
In my code, I tried to first click on each course, next get the description in the bubble, and then close the bubble as it may overlay on top of other course links.
My problem is that I couldn't get the description in the bubble and some course links were still skipped though I tried to avoid it by closing the bubble.
Any idea about how to do this? Thanks in advance!
info = []
driver = webdriver.Chrome()
driver.get('http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext')
for i in range(1,3):
for j in range(2, 46):
try:
driver.find_element_by_xpath('//*[#id="programrequirementstextcontainer"]/table['+str(i)+']/tbody/tr['+str(j)+']/td[1]/a').click()
info.append(driver.find_elements_by_xpath('/html/body/div[8]/div[3]/div/div')[0].text)
driver.find_element_by_xpath('//*[#id="lfjsbubbleclose"]').click()
time.sleep(3)
except: pass
[1]: http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext
Not sure why you have put static range in for loop even though all the combinations of i and j index count in your xpath doesn't find any element on your application.
I would suggest better to go with finding all element on your webpage using single locator and loop trough to get descriptions from bubble.
Use below code:
course_list = driver.find_elements_by_css_selector("table.sc_courselist a.bubblelink.code")
wait = WebDriverWait(driver, 20)
for course in course_list:
try:
print("grabbing info of course : ", course.text)
course.click()
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.courseblockdesc")))
info.append(driver.find_element_by_css_selector('div.courseblockdesc>p').text)
wait.until(EC.visibility_of_element_located((By.ID, "lfjsbubbleclose")))
driver.find_element_by_id('lfjsbubbleclose').click()
except:
print("error while grabbing info")
print(info)
As it require some time to load the content in bubble so you should introduce explicit wait in your script until bubble content get completely visible and then grab it.
import below package for using wait in above code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Please note, this code grab all the courses description from bubble. Let me know if you are looking for some specific not all.
To load the bubble, the website makes an ajax call.
import requests
from bs4 import BeautifulSoup
def course(course_code):
data = {"page":"getcourse.rjs","code":course_code}
res = requests.get("http://bulletin.iit.edu/ribbit/index.cgi", data=data)
soup = BeautifulSoup(res.text,"lxml")
result = {}
result["description"] = soup.find("div", class_="courseblockdesc").text.strip()
result["title"] = soup.find("div", class_="coursetitle").text.strip()
return result
Output for course("CS 522")
{'description': 'Continued exploration of data mining algorithms. More sophisticated algorithms such as support vector machines will be studied in detail. Students will continuously study new contributions to the field. A large project will be required that encourages students to push the limits of existing data mining techniques.',
'title': 'Advanced Data Mining'}```

Using Python, Selenium, and BeautifulSoup to scrape for content of a tag?

Relatively beginner. There are similar topics to this but I can see how my solution works, I just need help connecting these last few dots. I'd like to scrape follower counts from Instagram without the use of the API. Here's what I have so far:
Python 3.7.0
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
> DevTools listening on ws://.......
driver.get("https://www.instagram.com/cocacola")
soup = BeautifulSoup(driver.page_source)
elements = soup.find_all(attrs={"class":"g47SY "})
# Note the full class is 'g47SY lOXF2' but I can't get this to work
for element in elements:
print(element)
>[<span class="g47SY ">667</span>,
<span class="g47SY " title="2,598,456">2.5m</span>, # Need what's in title, 2,598,456
<span class="g47SY ">582</span>]
for element in elements:
t = element.get('title')
if t:
count = t
count = count.replace(",","")
else:
pass
print(int(count))
>2598456 # Success
Is there any easier, or quicker way to get to the 2,598,456 number? My original hope was that I could just use the class of 'g47SY lOXF2' but spaces in the class name aren't functional in BS4 as far as I'm aware. Just want to make sure this code is succinct and functional.
I had to use headless option and added executable_path for testing. You can remove that.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="chromedriver.exe",chrome_options=options)
driver.get('https://www.instagram.com/cocacola')
soup = BeautifulSoup(driver.page_source,'lxml')
#This will give you span that has title attribute. But it gives us multiple results
#Follower count is in the inner of a tag.
followers = soup.select_one('a > span[title]')['title'].replace(',','')
print(followers)
#Output 2598552
You could use regular expression to get the number.
Try this:
import re
fallowerRegex = re.compile(r'title="((\d){1,3}(,)?)+')
fallowerCount = fallowerRegex.search(str(elements))
result = fallowerCount.group().strip('title="').replace(',','')

Blank List returned when using XPath with Morningstar Key Ratios

I am trying to pull a piece of data from the morningstar key ratio page for any given stock using XPath. I have the full path that returns a result in the XPath Helper tooldbar add-on for google chrome but when I plug it into my code I get a blank list returned.
How do I get the result that I want returned? Is this even possible? Am I using the wrong approach?
Any help is much appreciated!
Piece of Data that I want returned:
AMD Key Ratios Example:
My Code:
from urllib.request import urlopen
import os.path
import sys
from lxml import html
import requests
page = requests.get('http://financials.morningstar.com/ratios/r.html?t=AMD&region=USA&culture=en_US')
tree = html.fromstring(page.content)
rev = tree.xpath('/html/body/div[1]/div[3]/div[2]/div[1]/div[1]/div[1]/table/tbody/tr[2]/td[1]')
print(rev)
Result of code:
[]
Desired result from XPath Helper:
Thanks,
Not Euler
This is one of those pages that downloads much of its content in stages. If you look for the item you want after using just requests you will find that it's not yet available, as shown here.
>>> import requests
>>> url = 'http://financials.morningstar.com/ratios/r.html?t=AMD&region=USA&culture=en_US'
>>> page = requests.get(url).text
>>> '5,858' in page
False
One strategy for processing these pages involves the use of the selenium library. Here, selenium launches a copy of the Chrome browser, loads that url then uses an xpath expression to locate the td element of interest. Finally, the number you want becomes available as the text property of that element.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get(url)
>>> td = driver.find_element_by_xpath('.//th[#id="i0"]/td[1]')
<selenium.webdriver.remote.webelement.WebElement (session="f436b07c27742abb36b262639245801f", element="0.12745670001529863-2")>
>>> td.text
'5,858'
As the content of that page is generated dynamically so you can either go through the process as Bill Bell shows already, or you can grab the page source then apply css selector on it to get the desired value. Here is an alternative to xpath:
from lxml import html
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://financials.morningstar.com/ratios/r.html?t=AMD&region=USA&culture=en_US')
tree = html.fromstring(driver.page_source)
driver.quit()
rev = tree.cssselect('td[headers^=Y0]')[0].text
print(rev)
Result:
5,858

Resources