Finding tweet id from parsed html page - python-3.x

I am trying to get tweet id from the parsed HTML. Here is my code:
tweet_ids = []
stat = statnum_parser(page_soup)
name = stat["Full_Name"]
print(page_soup.select("div.tweet"))
for tweet in page_soup.select("div.tweet"): # doesn't work properly
if tweet['data-name'] == name:
tweet_ids.append(tweet['data-tweet-id'])
The if condition checks if the tweet is not retweeted. The for loop does not work properly. Can someone help me?
I am using Selenium, BeautifulSoup

I figured out the problem. The problem was not using properly selenium with BeautifulSoup. Here is the code to get properly the HTML content of static website:
import selenium as webdriver
path_to_chrome_driver="path_to_your_chrome_driver"
driver = webdriver.Chrome(executable_path=path_to_chrome_driver)
driver.base_url = "URL of the website"
driver.get(driver.base_url)

Related

How to fix "businessObject not defined"

I am a newbie to Python and web scraping. To practice, I am just trying to pull some business names from some HTML tags a website. However, the code is not running and is throwing an 'object is not defined' error.
from bs4 import BeautifulSoup
import requests
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
content = BeautifulSoup(response.content, "html.parser")
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
businessObject = {
"BusinessName": business.find('h4', attrs={"class": "groomer-salon-card__name"}).text.encode('utf-8')
}
print (businessObject)
Expected: I am trying to retrieve the business names from this web page.
Result:
NameError: name 'businessObject' is not defined
When you did
content.find_all('div', attrs={"class": "groomer-salon-card__details"})
you actually got an empty list as no match.
So, when you did
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
you didn't generate
businessObject
As mentioned in comments, that led to your error.
Content is dynamically loaded from elswhere in the DOM using javascript (as well as other DOM modifications). You can still regex out the javascript object which contains the content used to update the DOM as you saw it in browser. You then parse with json parser as follows:
import requests, re, json
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
p = re.compile(r'state: (.*?)\n', re.DOTALL)
data = json.loads(p.findall(response.text)[0])
for listing in data['content']['search_results']['pages']['data']:
print(listing['organization_name'])
If you view page source on webpage you will see that the DOM is essentially dynamically populated from top to bottom with mutation observers monitoring progress.

Unable to get the expected html element details using Python

I am trying to scrape a website using Python. I have been able to scrape it successfully, however the expected resulted is not fetching up. I think there is something to do with the JavaScript of the web page.
My Code below:
driver.get(
"https://my website")
soup=BeautifulSoup(driver.page_source,'lxml')
all_text = soup.text
ct = all_text.replace('\n', ' ')
cl_text = ct.replace('\t', ' ')
cln_text_t = cl_text.replace('\r', ' ')
cln_text = re.sub(' +', ' ', cln_text_t)
print(cln_text)
Instead of giving me the website details it is giving the below data. Any idea how could I fix this?
html, body {height:100%;margin:0;} You have to enable javascript in your browser to use an application built with Vaadin.........
Why do you need this BeautifulSoup at all? It doesn't seem to support JavaScript.
If you need to get web page text you can fetch the document root using simple XPath selector of //html and get innerText property of the resulting WebElement
Suggested code change:
driver.get(
"my website")
root = driver.find_element_by_xpath("//html")
all_text = root.get_attribute("innerText")

Can I get url, that is generated by JavaScript, using Selenium and Python 3?

I write parser using Selenium and Python 3.7 for next site - https://www.oddsportal.com/soccer/germany/bundesliga/nurnberg-dortmund-fNa2KmU4/
I'm interested, to get the url, that is generated by JavaScript, using Selenium in Python 3?
I need to get the url for events from the sites from which the data is taken in the table.
For example. It seems to me that the data in the first line (10Bet) is obtained from this page - https://www.10bet.com/sports/football/germany-1-bundesliga/20190218/nurnberg-vs-dortmund/
How can get url to this page?
To get the URL of all the links on a page, you can store all the elements with tagname a in a WebElement list and then you can fetch the href attribute to get the link of each WebElement.
you can refer to the following code for all the link present in hole page :
List<WebElement> links = driver.findElements(By.tagName("a")); //This will store all the link WebElements into a list
for(WebElement ele: links) // This way you can take the Url of each link
{
String url = ele.getAttribute("href"); //To get the link you can use getAttribute() method with "href" as an argument
System.out.println(url);
}
In case you need the particilar link explicitly you need to pass the xpath of the element
WebElement ele = driver.findElements(By.xpath("Pass the xapth of the element"));
and after that you need to do this
String url = ele.getAttribute("href") //to get the url of the particular element
I have also shared the link with you so you can go and check i have highlighted elements in that
let us know if that helped or not
Try the below code, which will prints the required URL's as per your requirement :
from selenium import webdriver
driver = webdriver.Chrome('C:\\NotBackedUp\\chromedriver.exe')
driver.maximize_window()
driver.get('https://www.oddsportal.com/soccer/germany/bundesliga/nurnberg-dortmund-fNa2KmU4/')
# Locators for locating the required URL's
xpath = "//div[#id='odds-data-table']//tbody//tr"
rows = "//div[#id='odds-data-table']//tbody//tr/td[1]"
print("=> Required URL's is/are : ")
# Fetching & Printing the required URL's
for i in range(1, len(driver.find_elements_by_xpath(rows)), 1):
anchor = driver.find_elements_by_xpath(xpath+"["+str(i)+"]/td[2]/a")
if len(anchor) > 0:
print(anchor[0].get_attribute('href'))
else:
a = driver.find_elements_by_xpath("//div[#id='odds-data-table']//tbody//tr["+(str(i+1))+"]/td[1]//a[2]")
if len(a) > 0:
print(a[0].get_attribute('href'))
print('Done...')
I hope it helps...

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

Web Crawler keeps saying no attribute even though it really has

I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()
You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.

Resources