Scraping the stackoverflow user data - python-3.x

import requests
from bs4 import BeautifulSoup
import csv
response = requests.get('https://stackoverflow.com/users?page=3&tab=reputation&filter=week').text
soup = BeautifulSoup(response, 'lxml')
for items in soup.select('.user-details'):
name = items.select("a")[0].text
location = items.select(".user-location")[0].text
reputation = items.select(".reputation-score")[0].text
print(name,location,reputation)
with open('stackdata.csv','a',newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([name,location,reputation])
When we change the url of this code the output remains same.

I came across a similar problem. The solution that works for me is using selenium. Though I used headless browser i.e phantomjs I assume it should work for other browsers too.
driver = webdriver.PhantomJS('/home/practice/selenium/webdriver/phantomjs/bin/phantomjs')
users = []
page_num = 1
driver.get('https://stackoverflow.com/users?page={page_num}&tab=reputation&filter=week'.format(page_num=page_num))
content = driver.find_element_by_id('content')
for details in content.find_elements_by_class_name('user-details'):
users.append(details.text)
print(users)
Change the page_num to get the desired result.
Hope this will help!

Related

Trying to create a web scraper for dell drivers using python3 and Beautiful Soup

I am trying to create a web scraper to grab info about Dell Drivers from their website. Apparently, it uses java on their site to load the data for the drivers to the web page. I am having difficulty getting the driver info from the webpage. this is what I have cobbled together so far.
from bs4 import BeautifulSoup
import urllib.request
import json
resp = urllib.request.urlopen("https://www.dell.com/support/home/en-us/product-support/product/precision-15-5520-laptop/drivers")
soup = BeautifulSoup(resp, 'html.parser', from_encoding=resp.info().get_param('charset'))
So far none of these have worked to try and get the data for the drivers:
data = json.loads(soup.find('script', type='text/preloaded').text)
data = json.loads(soup.find('script', type='application/x-suppress').text)
data = json.loads(soup.find('script', type='text/javascript').text)
data = json.loads(soup.find('script', type='application/ld+json').text)
I am not very skilled at python, I have been looking all over trying to cobble something together that works. Any assistance to help me get a little further in my endeavor would be greatly appreciate.
You can use selenium:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.dell.com/support/home/en-us/product-support/product/precision-15-5520-laptop/drivers')
time.sleep(3)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html5lib')
I was able to get Sushil's answer working on my machine with some minor changes
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/temp/chromedriver_win32/chromedriver.exe')
driver.get('https://www.dell.com/support/home/en-us/product-support/product/precision-15-5520-laptop/drivers')
time.sleep(3)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser')
results = soup.find(id='downloads-table')
results2 = results.find_all(class_='dl-desk-view')
results3 = results.find_all(class_='details-control sorting_1')
results4 = results.find_all(class_='details-control')
results5 = results.find_all(class_='btn-download-lg btn btn-sm no-break text-decoration-none dellmetrics-driverdownloads btn-outline-primary')
The problem though is that this still only gets me 10 out of 79 drivers
I need a way to get all of the drivers that are available listed.
I got it figured out
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/temp/chromedriver_win32/chromedriver.exe')
driver.get('https://www.dell.com/support/home/en-us/product-support/product/precision-15-5520-laptop/drivers')
time.sleep(3)
element = driver.find_element_by_xpath("//button[contains(.,'Show all')]").click();
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser')
results = soup.find(id='downloads-table')
results2 = results.find_all(class_='dl-desk-view')
results3 = results.find_all(class_='details-control sorting_1')
results4 = results.find_all(class_='details-control')
results5 = results.find_all(class_='btn-download-lg btn btn-sm no-break text-decoration-none dellmetrics-driverdownloads btn-outline-primary')
for results2, results3, results4, results5 in zip(results2, results3, results4, results5):
print(results2, results3, results4, results5)
I was able to pull the JSON file that has the driver information. Saves a lot of hassle trying to use a web driver or other tricks.
Example for Dell Precision 7760 with Windows 10:
https://www.dell.com/support/driver/en-us/ips/api/driverlist/fetchdriversbyproduct?productcode=precision-17-7760-laptop&oscode=WT64A
(Note: "productcode" and "oscode" parameters.)
In order for this to work, you must have a request header "X-Requested-With" and set the value to "XMLHttpRequest". If you do not have this then you will get a "no content" result.
Format the resulting JSON and you should easily see the structure of the results including all of the driver data that you see on the support website.

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

Web Scraping with BeautifulSoup code review

from bs4 import BeautifulSoup
import requests
import pandas as pd
records=[]
keep_looking = True
url = 'https://www.tapology.com/fightcenter'
while keep_looking:
re = requests.get(url)
soup = BeautifulSoup(re.text,'html.parser')
data = soup.find_all('section',attrs={'class':'fcListing'})
for d in data:
event = d.find('a').text
date = d.find('span',attrs={'class':'datetime'}).text[1:-4]
location = d.find('span',attrs={'class':'venue-location'}).text
mainEvent = first.find('span',attrs={'class':'bout'}).text
url_tag = soup.find('div',attrs={'class':'fightcenterEvents'})
if not url_tag:
keep_looking = False
else:
url = "https://www.tapology.com" + url_tag.find('a')['href']
I am wondering if there are any errors in my code? It is running, but it is taking a very long time to finish and I am afraid it might be stuck in an infinity loop. Please any feedback would be helpful. Please do not rewrite all of this and post, as I would like to keep this format, as I am learning and want to improve.
Although this is not the right site to seek help for review related task, I considered giving a solution as it sounds that you may fall in an infinite loop according to your statement above.
Try this to get information from that site. It will run until there is a next page link to traverse. When there is no more new page link to follow, the script will automatically stop.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = 'https://www.tapology.com/fightcenter'
while True:
re = requests.get(url)
soup = BeautifulSoup(re.text,'html.parser')
for data in soup.find_all('section',attrs={'class':'fcListing'}):
event = data.select_one('.name a').get_text(strip=True)
date = data.find('span',attrs={'class':'datetime'}).get_text(strip=True)[:-1]
location = data.find('span',attrs={'class':'venue-location'}).get_text(strip=True)
try:
mainEvent = data.find('span',attrs={'class':'bout'}).get_text(strip=True)
except AttributeError: mainEvent = ""
print(f'{event} {date} {location} {mainEvent}')
urltag = soup.select_one('.pagination a[rel="next"]')
if not urltag: break #as soon as it finds that there is no next page link, it will break out of the loop
url = urljoin(url,urltag.get("href")) #applied urljoin to save you from using hardcoded prefix
For future reference: feel free to post any question in this site to get your code reviewed.

Web Crawler keeps saying no attribute even though it really has

I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()
You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.

Beautiful Soup Questions

I am trying to scrap the following website, however, I have encountered some problems. As you can see, I am not really familiar with regex and I hope you can give me some pointers to how to solve this problem.
Basically, I hope I can download all the transaction data into the database. However, I need to retrieve it first.
Thank you
Below is my quote:
from bs4 import BeautifulSoup
import requests
import re
url = 'https://bochk.etnet.com.hk/content/bochkweb/eng/quote_transaction_daily_history.php?services=STK&code=6881\
&ip=boc.etnet.com.hk&host=BochkUMSContent&sessionId=44c99b61679e019666f0570db51ad932'
pattern = re.compile('var json_result = (.*?);')
def turnover_detail(url):
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,"html.parser")
data = soup.find_all("script")
for json in data:
print(json)
turnover_detail(url)

Resources