How to get Javascript generated content - python-3.x

I am currently having an issue getting the HTML of javascript generated content on this ( https://aca3.accela.com/MILARA/GeneralProperty/PropertyLookUp.aspx ) webpage. It generates the javascript on the page itself. I was wondering what I was doing wrong. The code I'm using is this:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
import selenium.webdriver.support.ui as ui
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path="/Users/MrPete/Downloads/chromedriver_win32/chromedriver")
driver.get('https://aca3.accela.com/MILARA/GeneralProperty/PropertyLookUp.aspx')
profession = Select(driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_refLicenseeSearchForm_ddlLicenseType"]'))
profession.select_by_value("Pharmacist")
time.sleep(5) # Let the user actually see something!
lName = driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_refLicenseeSearchForm_txtLastName"]')
lName.send_keys('roy')
search = driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_btnNewSearch"]')
search.click()
time.sleep(5)
html = driver.execute_script("return document.getElementsByTagName('table')[38].innerHTML")
print(html)
Now, I'm not getting no output, the output I'm getting is:
<tbody><tr>
<td style="white-space:nowrap;"><span class="ACA_SmLabel ACA_SmLabel_FontSize"> Showing 1-13 of 13 </span></td>
</tr>
</tbody>
Which is (kinda) the title of the table I'm trying to get. What I want as the output is the HTML of the entire table (posted a picture of the table that is generated by javascript. What I'm currently getting is the small title up at the top of the picture, the 'Showing 1-13 of 13', and what I want is the entire table.

Try changing
html = driver.execute_script("return document.getElementsByTagName('table')[38].innerHTML")
print(html)
To:
target = driver.find_element_by_xpath('//table[#class="ACA_GridView ACA_Grid_Caption"]')
print(target.text)
Output:
Showing 1-13 of 13
License Type
License Number
First Name
Middle Initial
Last Name
Organization Name
DBA/Trade Name
License Status
License Expiration Date
Pharmacist
5302017621
Arthur
James
etc.

Related

Finding HTML after Selenium actions on a website

Have a look at this website- https://mops.twse.com.tw/mops/web/t146sb05 .
Enter the value 6150 in the text box and press Enter.
See that the url does not change, but the HTML changes. I want to scrape the values 117,129 and -13.57% from this page. I have entered the value and pressed Enter using Selenium, but don't know how to proceed further.
from selenium.webdriver.common.keys import Keys
import requests
from lxml import html
from selenium.webdriver.common.by import By
DRIVER_PATH = 'E:/Anaconda3/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://mops.twse.com.tw/mops/web/t146sb05')
input_entering = driver.find_element_by_xpath('//*[#id="co_id"]').click()
new_driver = driver.find_element_by_xpath('//*[#id="co_id"]').send_keys(6150, Keys.RETURN)
In your code you are trying to perform action on object which not valid click & send_keys attribute.
Please try below solution ::
driver.get('https://mops.twse.com.tw/mops/web/t146sb05')
element=driver.find_element_by_xpath('//*[#id="co_id"]')
element.send_keys("6150",Keys.RETURN)

Extracting data tables from HTML source after scraping using Selenium & Python

I am trying to scrape data from this link. I've researched on question that are asked and I've successfully did some scraping. But I've few issues in results that are generated. Following is the piece of code that I've used to scrape.
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from datetime import datetime
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get('http://www.scstrade.com/MarketStatistics/MS_HistoricalIndices.aspx')
inputElement_index = driver.find_element_by_id("txtSearch")
inputElement_index.send_keys('KSE ALL')
inputElement_date = driver.find_element_by_id("date1")
inputElement_date.send_keys('03/12/2019')
inputElement_date_end = driver.find_element_by_id("date2")
inputElement_date_end.send_keys('03/12/2020')
inputElement_viewprice = driver.find_element_by_id("btn1")
inputElement_viewprice.send_keys(Keys.ENTER)
tabel = driver.find_elements_by_css_selector('table > tbody')[0]
Aim is to extract data from the link with dates between 12th Mar 2020 to 03rd Mar 2020, with indices KSE ALL. Now the above code works but in the last line of the code table object is blank when the code runs for the first time if I re-run this last line it gives the table in string format that is on the 1st page. I want to know why don't I get the table when the code runs for the first time? How can I get a pandas DataFrame for the table object which is in string?
I tried the following code to get 1st page data into pandas DataFrame. But the table object turns out to be 'NoneType'.
htmlSource = driver.page_source
soup = BeautifulSoup(htmlSource, 'html.parser')
table = soup.find('table', class_='tbody')
Second, I want to extract entire data, not just the data on first page and number of pages would be dynamic they would change as date range changes. Now to move to next page I tried the following piece of code:
driver.find_element_by_id("next_pager").click()
I got the following error.
selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <td id="next_pager" class="ui-pg-button" title="Next Page">...</td> is not clickable at point (790, 95). Other element would receive the click: <div class="loading row" id="load_list" style="display: block;">...</div>
I tried to look up on how can this issue be resolved wrote the code below to add some waiting time. But got the same error as above.
wait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[title="Next Page"]'))).click()
How can I move to subsequent pages and extract data from all pages (no. of pages would be dynamic as per the date range set) and append it to data extracted from the previous page?
I would rather prefer using the api approach in this case, it would be faster and easy to get the data. And also you don't have to load number of pages in the table.
Below is the API code to get the response code (just changed the date range to make sure you will see multiple pages data in one request call)
import requests
url = "http://www.scstrade.com/MarketStatistics/MS_HistoricalIndices.aspx/chart"
payload = "{\"par\": \"KSE All\", \"date1\": \"01/03/2020\",\"date2\": \"03/12/2020\"}"
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
The only thing is you have to change the date format in the response.
result:
b'{"d":[{"kse_index_id":13362,"kse_index_type_id":1,"kse_index_date":"\\/Date(1577991600000)\\/","kse_index_open":30046.67,"kse_index_high":30053.64,"kse_index_low":29665.65,"kse_index_close":29774.00,"kse_index_value":322398592,"kse_index_change":-98.97,"kse_index_changep":-0.33},{"kse_index_id":13366,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578250800000)\\/","kse_index_open":29547.06,"kse_index_high":29774.00,"kse_index_low":29101.65,"kse_index_close":29145.52,"kse_index_value":266525664,"kse_index_change":-628.48,"kse_index_changep":-2.11},{"kse_index_id":13370,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578337200000)\\/","kse_index_open":29209.91,"kse_index_high":29393.74,"kse_index_low":29072.69,"kse_index_close":29375.75,"kse_index_value":206397936,"kse_index_change":230.23,"kse_index_changep":0.79},{"kse_index_id":13374,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578423600000)\\/","kse_index_open":29157.77,"kse_index_high":29375.75,"kse_index_low":28882.75,"kse_index_close":29010.85,"kse_index_value":279807072,"kse_index_change":-364.90,"kse_index_changep":-1.24},{"kse_index_id":13378,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578510000000)\\/","kse_index_open":29319.08,"kse_index_high":29667.92,"kse_index_low":29010.85,"kse_index_close":29654.66,"kse_index_value":361992128,"kse_index_change":643.81,"kse_index_changep":2.22},{"kse_index_id":13382,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578596400000)\\/","kse_index_open":29732.02,"kse_index_high":30070.99,"kse_index_low":29654.66,"kse_index_close":30058.45,"kse_index_value":400051936,"kse_index_change":403.79,"kse_index_changep":1.36},{"kse_index_id":13386,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578855600000)\\/","kse_index_open":30109.26,"kse_index_high":30194.74,"kse_index_low":29901.75,"kse_index_close":30020.98,"kse_index_value":365810592,"kse_index_change":-37.47,"kse_index_changep":-0.13},{"kse_index_id":13390,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578942000000)\\/","kse_index_open":30059.23,"kse_index_high":30150.96,"kse_index_low":29932.22,"kse_index_close":29973.44,"kse_index_value":249556960,"kse_index_change":-47.54,"kse_index_changep":-0.16},{"kse_index_id":13394,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579028400000)\\/","kse_index_open":29986.93,"kse_index_high":29999.17,"kse_index_low":29799.04,"kse_index_close":29892.79,"kse_index_value":171127728,"kse_index_change":-80.65,"kse_index_changep":-0.27},{"kse_index_id":13398,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579114800000)\\/","kse_index_open":29913.22,"kse_index_high":30007.53,"kse_index_low":29779.46,"kse_index_close":29914.47,"kse_index_value":229585632,"kse_index_change":21.68,"kse_index_changep":0.07},{"kse_index_id":13402,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579201200000)\\/","kse_index_open":29929.81,"kse_index_high":30037.83,"kse_index_low":29914.46,"kse_index_close":29998.45,"kse_index_value":211220464,"kse_index_change":83.98,"kse_index_changep":0.28},{"kse_index_id":13406,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579460400000)\\/","kse_index_open":30043.65,"kse_index_high":30089.73,"kse_index_low":29734.95,"kse_index_close":29808.60,"kse_index_value":173774336,"kse_index_change":-189.85,"kse_index_changep":-0.63},{"kse_index_id":13410,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579546800000)\\/","kse_index_open":29856.28,"kse_index_high":29928.72,"kse_index_low":29621.78,"kse_index_close":29735.95,"kse_index_value":177421264,"kse_index_change":-72.65,"kse_index_changep":-0.24},{"kse_index_id":13414,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579633200000)\\/","kse_index_open":29746.05,"kse_index_high":29754.25,"kse_index_low":29308.76,"kse_index_close":29561.63,"kse_index_value":177486256,"kse_index_change":-174.32,"kse_index_changep":-0.59},{"kse_index_id":13418,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579719600000)\\/","kse_index_open":29621.60,"kse_index_high":29759.68,"kse_index_low":29409.24,"kse_index_close":29456.52,"kse_index_value":230561152,"kse_index_change":-105.11,"kse_index_changep":-0.36},{"kse_index_id":13422,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579806000000)\\/","kse_index_open":29440.00,"kse_index_high":29585.39,"kse_index_low":29318.90,"kse_index_close":29529.89,"kse_index_value":172677024,"kse_index_change":73.37,"kse_index_changep":0.25},{"kse_index_id":13426,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580065200000)\\/","kse_index_open":29533.27,"kse_index_high":29594.55,"kse_index_low":29431.95,"kse_index_close":29462.60,"kse_index_value":198224992,"kse_index_change":-67.29,"kse_index_changep":-0.23},{"kse_index_id":13430,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580151600000)\\/","kse_index_open":29457.47,"kse_index_high":29462.59,"kse_index_low":29230.53,"kse_index_close":29345.90,"kse_index_value":188781760,"kse_index_change":-116.70,"kse_index_changep":-0.40},{"kse_index_id":13434,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580238000000)\\/","kse_index_open":29354.64,"kse_index_high":29446.90,"kse_index_low":29083.61,"kse_index_close":29135.35,"kse_index_value":197011200,"kse_index_change":-210.55,"kse_index_changep":-0.72},{"kse_index_id":13438,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580324400000)\\/","kse_index_open":29132.60,"kse_index_high":29181.59,"kse_index_low":28969.60,"kse_index_close":29123.53,"kse_index_value":162120016,"kse_index_change":-11.82,"kse_index_changep":-0.04},{"kse_index_id":13442,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580410800000)\\/","kse_index_open":29166.18,"kse_index_high":29257.79,"kse_index_low":28945.19,"kse_index_close":29067.54,"kse_index_value":193415040,"kse_index_change":-55.99,"kse_index_changep":-0.19},{"kse_index_id":13446,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580670000000)\\/","kse_index_open":28941.02,"kse_index_high":29067.54,"kse_index_low":28246.97,"kse_index_close":28315.61,"kse_index_value":202691712,"kse_index_change":-751.93,"kse_index_changep":-2.59},{"kse_index_id":13450,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580756400000)\\/","kse_index_open":28356.76,"kse_index_high":28506.86,"kse_index_low":28245.23,"kse_index_close":28493.84,"kse_index_value":145986304,"kse_index_change":178.23,"kse_index_changep":0.63},{"kse_index_id":13454,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580929200000)\\/","kse_index_open":28577.12,"kse_index_high":28633.74,"kse_index_low":28375.60,"kse_index_close":28398.38,"kse_index_value":127719744,"kse_index_change":-95.46,"kse_index_changep":-0.34},{"kse_index_id":13458,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581015600000)\\/","kse_index_open":28458.74,"kse_index_high":28458.75,"kse_index_low":27983.62,"kse_index_close":28042.82,"kse_index_value":193151648,"kse_index_change":-355.56,"kse_index_changep":-1.25},{"kse_index_id":13462,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581274800000)\\/","kse_index_open":28043.58,"kse_index_high":28053.71,"kse_index_low":27470.38,"kse_index_close":27520.35,"kse_index_value":180630816,"kse_index_change":-522.47,"kse_index_changep":-1.86},{"kse_index_id":13466,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581361200000)\\/","kse_index_open":27601.00,"kse_index_high":28017.17,"kse_index_low":27492.28,"kse_index_close":27865.16,"kse_index_value":161458304,"kse_index_change":344.81,"kse_index_changep":1.25},{"kse_index_id":13470,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581447600000)\\/","kse_index_open":27959.20,"kse_index_high":28384.45,"kse_index_low":27865.16,"kse_index_close":28309.35,"kse_index_value":179861264,"kse_index_change":444.19,"kse_index_changep":1.59},{"kse_index_id":13474,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581534000000)\\/","kse_index_open":28380.58,"kse_index_high":28468.96,"kse_index_low":28191.97,"kse_index_close":28256.09,"kse_index_value":197307008,"kse_index_change":-53.26,"kse_index_changep":-0.19},{"kse_index_id":13478,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581620400000)\\/","kse_index_open":28327.55,"kse_index_high":28330.57,"kse_index_low":27917.81,"kse_index_close":28015.75,"kse_index_value":117521904,"kse_index_change":-240.34,"kse_index_changep":-0.85},{"kse_index_id":13482,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581879600000)\\/","kse_index_open":28023.74,"kse_index_high":28130.89,"kse_index_low":27900.27,"kse_index_close":28002.69,"kse_index_value":99813272,"kse_index_change":-13.06,"kse_index_changep":-0.05},{"kse_index_id":13486,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581966000000)\\/","kse_index_open":28036.95,"kse_index_high":28141.44,"kse_index_low":27758.54,"kse_index_close":27807.10,"kse_index_value":91269288,"kse_index_change":-195.59,"kse_index_changep":-0.70},{"kse_index_id":13490,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582052400000)\\/","kse_index_open":27843.99,"kse_index_high":28108.02,"kse_index_low":27807.11,"kse_index_close":28063.85,"kse_index_value":142765888,"kse_index_change":256.75,"kse_index_changep":0.92},{"kse_index_id":13494,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582138800000)\\/","kse_index_open":28122.04,"kse_index_high":28132.98,"kse_index_low":27989.14,"kse_index_close":28018.02,"kse_index_value":111998784,"kse_index_change":-45.83,"kse_index_changep":-0.16},{"kse_index_id":13498,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582225200000)\\/","kse_index_open":28028.61,"kse_index_high":28039.38,"kse_index_low":27856.26,"kse_index_close":27895.15,"kse_index_value":85454400,"kse_index_change":-122.87,"kse_index_changep":-0.44},{"kse_index_id":13502,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582484400000)\\/","kse_index_open":27880.35,"kse_index_high":27895.15,"kse_index_low":27200.92,"kse_index_close":27248.30,"kse_index_value":144128160,"kse_index_change":-646.85,"kse_index_changep":-2.32},{"kse_index_id":13506,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582570800000)\\/","kse_index_open":27206.95,"kse_index_high":27321.33,"kse_index_low":26851.06,"kse_index_close":27018.98,"kse_index_value":124276016,"kse_index_change":-229.32,"kse_index_changep":-0.84},{"kse_index_id":13510,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582657200000)\\/","kse_index_open":27058.85,"kse_index_high":27070.75,"kse_index_low":26560.92,"kse_index_close":26687.95,"kse_index_value":147798160,"kse_index_change":-331.03,"kse_index_changep":-1.23},{"kse_index_id":13514,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582743600000)\\/","kse_index_open":26355.50,"kse_index_high":26687.95,"kse_index_low":25780.38,"kse_index_close":26396.96,"kse_index_value":248988672,"kse_index_change":-290.99,"kse_index_changep":-1.09},{"kse_index_id":13518,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582830000000)\\/","kse_index_open":26302.05,"kse_index_high":26519.47,"kse_index_low":26181.00,"kse_index_close":26289.38,"kse_index_value":201662240,"kse_index_change":-107.58,"kse_index_changep":-0.41},{"kse_index_id":13522,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583089200000)\\/","kse_index_open":26342.71,"kse_index_high":27096.59,"kse_index_low":26289.38,"kse_index_close":27059.34,"kse_index_value":215058320,"kse_index_change":769.96,"kse_index_changep":2.93},{"kse_index_id":13526,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583175600000)\\/","kse_index_open":27200.11,"kse_index_high":27385.30,"kse_index_low":26854.16,"kse_index_close":27054.89,"kse_index_value":225222304,"kse_index_change":-4.45,"kse_index_changep":-0.02},{"kse_index_id":13530,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583262000000)\\/","kse_index_open":27070.16,"kse_index_high":27069.35,"kse_index_low":26797.32,"kse_index_close":26919.79,"kse_index_value":186877760,"kse_index_change":-135.10,"kse_index_changep":-0.50},{"kse_index_id":13534,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583348400000)\\/","kse_index_open":26961.15,"kse_index_high":27369.98,"kse_index_low":26919.79,"kse_index_close":27228.79,"kse_index_value":340043072,"kse_index_change":309.00,"kse_index_changep":1.15},{"kse_index_id":13538,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583434800000)\\/","kse_index_open":27126.48,"kse_index_high":27228.79,"kse_index_low":26517.64,"kse_index_close":26557.85,"kse_index_value":244063824,"kse_index_change":-670.94,"kse_index_changep":-2.46},{"kse_index_id":13542,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583694000000)\\/","kse_index_open":25878.94,"kse_index_high":26557.85,"kse_index_low":25304.60,"kse_index_close":25875.06,"kse_index_value":307753952,"kse_index_change":-682.79,"kse_index_changep":-2.57},{"kse_index_id":13546,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583780400000)\\/","kse_index_open":25758.62,"kse_index_high":26210.06,"kse_index_low":25719.55,"kse_index_close":26184.13,"kse_index_value":274065504,"kse_index_change":309.07,"kse_index_changep":1.19},{"kse_index_id":13550,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583866800000)\\/","kse_index_open":26331.02,"kse_index_high":26562.31,"kse_index_low":26061.81,"kse_index_close":26127.67,"kse_index_value":217595296,"kse_index_change":-56.46,"kse_index_changep":-0.22},{"kse_index_id":13554,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583953200000)\\/","kse_index_open":26002.00,"kse_index_high":26127.67,"kse_index_low":25245.98,"kse_index_close":25310.97,"kse_index_value":230028032,"kse_index_change":-816.70,"kse_index_changep":-3.13}]}'

Web Scraping reviews -Flipkart

I am trying to take out entire review of a product(remaining half of the review is display after clicking read more. but I am still not able to do so.It is not displaying entire content of a review, which get dispalyed after clicking read more option. Below is the code , which click the readmore option and also get data from the website
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
response = requests.get("https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page=2&pid=MOBF85V7A6PXETAX")
data = BeautifulSoup(response.content, 'lxml')
chromepath = r"C:\Users\Mohammed\Downloads\chromedriver.exe"
driver=webdriver.Chrome(chromepath)
driver.get("https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page=2&pid=MOBF85V7A6PXETAX")
d = driver.find_element_by_class_name("_1EPkIx")
d.click()
title = data.find_all("p",{"class" : "_2xg6Ul"})
text1 = data.find_all("div",{"class" : "qwjRop"})
name = data.find_all("p",{"class" : "_3LYOAd _3sxSiS"})
for t2, t , t1 in zip(title,text1,name) :
print(t2.text,'\n',t.text,'\n',t1.text)
To get the full reviews, It is necessary to click on those READ MORE buttons to unwrap the rest. As you have already used selenium in combination with BeautifulSoup, I've modified the script to follow the logic. The script will first click on those READ MORE buttons. Once it is done, it will then parse all the titles and reviews from there. You can now get the titles and reviews from multiple pages (upto 4 pages).
import time
from bs4 import BeautifulSoup
from selenium import webdriver
link = "https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page={}&pid=MOBF85V7A6PXETAX"
driver = webdriver.Chrome() #If necessary, define the chrome path explicitly
for page_num in range(1,5):
driver.get(link.format(page_num))
[item.click() for item in driver.find_elements_by_class_name("_1EPkIx")]
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
for items in soup.select("._3DCdKt"):
title = items.select_one("p._2xg6Ul").text
review = ' '.join(items.select_one(".qwjRop div:nth-of-type(2)").text.split())
print(f'{title}\n{review}\n')
driver.quit()

Using Python, Selenium, and BeautifulSoup to scrape for content of a tag?

Relatively beginner. There are similar topics to this but I can see how my solution works, I just need help connecting these last few dots. I'd like to scrape follower counts from Instagram without the use of the API. Here's what I have so far:
Python 3.7.0
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
> DevTools listening on ws://.......
driver.get("https://www.instagram.com/cocacola")
soup = BeautifulSoup(driver.page_source)
elements = soup.find_all(attrs={"class":"g47SY "})
# Note the full class is 'g47SY lOXF2' but I can't get this to work
for element in elements:
print(element)
>[<span class="g47SY ">667</span>,
<span class="g47SY " title="2,598,456">2.5m</span>, # Need what's in title, 2,598,456
<span class="g47SY ">582</span>]
for element in elements:
t = element.get('title')
if t:
count = t
count = count.replace(",","")
else:
pass
print(int(count))
>2598456 # Success
Is there any easier, or quicker way to get to the 2,598,456 number? My original hope was that I could just use the class of 'g47SY lOXF2' but spaces in the class name aren't functional in BS4 as far as I'm aware. Just want to make sure this code is succinct and functional.
I had to use headless option and added executable_path for testing. You can remove that.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="chromedriver.exe",chrome_options=options)
driver.get('https://www.instagram.com/cocacola')
soup = BeautifulSoup(driver.page_source,'lxml')
#This will give you span that has title attribute. But it gives us multiple results
#Follower count is in the inner of a tag.
followers = soup.select_one('a > span[title]')['title'].replace(',','')
print(followers)
#Output 2598552
You could use regular expression to get the number.
Try this:
import re
fallowerRegex = re.compile(r'title="((\d){1,3}(,)?)+')
fallowerCount = fallowerRegex.search(str(elements))
result = fallowerCount.group().strip('title="').replace(',','')

Blank List returned when using XPath with Morningstar Key Ratios

I am trying to pull a piece of data from the morningstar key ratio page for any given stock using XPath. I have the full path that returns a result in the XPath Helper tooldbar add-on for google chrome but when I plug it into my code I get a blank list returned.
How do I get the result that I want returned? Is this even possible? Am I using the wrong approach?
Any help is much appreciated!
Piece of Data that I want returned:
AMD Key Ratios Example:
My Code:
from urllib.request import urlopen
import os.path
import sys
from lxml import html
import requests
page = requests.get('http://financials.morningstar.com/ratios/r.html?t=AMD&region=USA&culture=en_US')
tree = html.fromstring(page.content)
rev = tree.xpath('/html/body/div[1]/div[3]/div[2]/div[1]/div[1]/div[1]/table/tbody/tr[2]/td[1]')
print(rev)
Result of code:
[]
Desired result from XPath Helper:
Thanks,
Not Euler
This is one of those pages that downloads much of its content in stages. If you look for the item you want after using just requests you will find that it's not yet available, as shown here.
>>> import requests
>>> url = 'http://financials.morningstar.com/ratios/r.html?t=AMD&region=USA&culture=en_US'
>>> page = requests.get(url).text
>>> '5,858' in page
False
One strategy for processing these pages involves the use of the selenium library. Here, selenium launches a copy of the Chrome browser, loads that url then uses an xpath expression to locate the td element of interest. Finally, the number you want becomes available as the text property of that element.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get(url)
>>> td = driver.find_element_by_xpath('.//th[#id="i0"]/td[1]')
<selenium.webdriver.remote.webelement.WebElement (session="f436b07c27742abb36b262639245801f", element="0.12745670001529863-2")>
>>> td.text
'5,858'
As the content of that page is generated dynamically so you can either go through the process as Bill Bell shows already, or you can grab the page source then apply css selector on it to get the desired value. Here is an alternative to xpath:
from lxml import html
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://financials.morningstar.com/ratios/r.html?t=AMD&region=USA&culture=en_US')
tree = html.fromstring(driver.page_source)
driver.quit()
rev = tree.cssselect('td[headers^=Y0]')[0].text
print(rev)
Result:
5,858

Resources