is that posible scraping (crawling) tooltip after click map marker? - python-3.x

[Problem]
I can't request left mouse click event to marker for activating tooltip through selenium.
[My intention]
scraping (crawling) text from tooltip window on map marker from this web service with selenium (python code)
daum service web map: http://www.socar.kr/reserve#jeju
<map id="daum.maps.Marker.Area:13u" name="daum.maps.Marker.Area:13u"><area href="javascript:void(0)" alt="" shape="rect" coords="0,0,40,38" title="" style="-webkit-tap-highlight-color: transparent;"></map>
<div class="tooltip myInfoWindow"><h4><a class="map_zone_name" href="#"><em class="map_zone_id" style="display:none;">2390</em><span title="제주대 후문주차장">제주대 후문주차장</span><span class="bg"></span></a></h4><p><a title="제주도 제주시 아라1동 368-60">제주도 제주시 아라1동 368-6...</a><br>운영차량 : 총 <em>4</em>대</p><p class="btn"><em class="map_zone_id" style="display:none;">2390</em><a class="btn_overlay_search" href="#"><img src="/template/asset/images/reservation/btn_able_socar.png" alt="예약가능 쏘카 보기"></a></p><img src="/template/asset/images/reservation/btn_layer_close.png" alt="닫기"></div>
P.S : is it possible crawling text of tooltip window on google map marker

When you click a tooltip, an xhr request is sent to https://api.socar.kr/reserve/zone_info using a zone_id, you may have to filter out the zones you want by using the page content, I don't have any more time to spend on this right now but this recreates the requests:
import requests
from time import time, sleep
# These params will be for https://api.socar.kr/reserve/oneway_zone_list
# which we can get the zone_ids from.
params = {"type": "start", "_": str(time())}
# We use the zone_id from each dict we parse from the json receievd
params2 = {"zone_id": ""}
with requests.Session() as s:
s.get("http://www.socar.kr/reserve#jeju")
s.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"})
r = s.get("https://api.socar.kr/reserve/oneway_zone_list", params=params)
result = r.json()["result"]
for d in result:
params2["zone_id"] = d["zone_id"]
params2["_"] = str(time())
sleep(1)
r2 = s.get("https://api.socar.kr/reserve/zone_info", params=params2)
print(r2.json())
Each d in result is a dict like:
{u'zone_lat': u'37.248859', u'zone_id': u'2902', u'zone_region1_short': u'\uacbd\uae30', u'zone_open_time': u'00:00:00', u'zone_region1': u'\uacbd\uae30\ub3c4', u'zone_close_time': u'23:59:59', u'zone_name': u'SK\ud558\uc774\ub2c9\uc2a4 \uc774\ucc9c', u'open_weekend': u'close', u'zone_region3': u'\ubd80\ubc1c\uc74d', u'zone_region2': u'\uc774\ucc9c\uc2dc', u'zone_lng': u'127.490639', u'zone_addr': u'\uacbd\uae30\ub3c4 \uc774\ucc9c\uc2dc \ubd80\ubc1c\uc74d \uc544\ubbf8\ub9ac 707'}
There probably other info in that that would allow you to filter by specific place, I don't speak korean so I cannot completely follow how the data relates.
The second requests gives us a dict like:
{u'retCode': u'1', u'retMsg': u'', u'result': {u'oper_way': u'\uc655\ubcf5', u'notice': u'<br>\u203b \ubc18\ub4dc\uc2dc \ubc29\ubb38\uc790 \uc8fc\ucc28\uc7a5 \uc9c0\uc815\uc8fc\ucc28\uad6c\uc5ed\uc5d0 \ubc18\ub0a9\ud574\uc8fc\uc138\uc694.<br>', u'notice_oneway': u'', u'zone_addr': u'\uacbd\uae30\ub3c4 \uc774\ucc9c\uc2dc \ubd80\ubc1c\uc74d \uc544\ubbf8\ub9ac 707', u'total_num': 2, u'able_num': 2, u'visit': u'\uc131\uc6b02\ub2e8\uc9c0 \uc544\ud30c\ud2b8 \uae30\uc900 \uc804\ubc29 \ud604\ub300\uc5d8\ub9ac\ubca0\uc774\ud130 \ubc29\uba74\uc73c\ub85c \ud6a1\ub2e8\ubcf4\ub3c4 \uc774\uc6a9 \ud6c4 \ud558\uc774\ub2c9\uc2a4 \uc774\ucc9c \ubc29\ubb38\uc790 \uc8fc\ucc28\uc7a5 \ub0b4 \uc3d8\uce74\uc804\uc6a9\uc8fc\ucc28\uad6c\uc5ed', u'zone_alias': u'\ud558\uc774\ub2c9\uc2a4 \ubc29\ubb38\uc790 \uc8fc\ucc28\uc7a5', u'zone_attr': u'[\uc774\ubca4\ud2b8]', u'state': True, u'link': u'http://blog.socar.kr/4074', u'oper_time': u'00:00~23:59', u'lat': u'37.248859', u'zone_name': u'SK\ud558\uc774\ub2c9\uc2a4 \uc774\ucc9c', u'lng': u'127.490639', u'zone_props': 0, u'visit_link': u'http://dmaps.kr/24ij6', u'zone_id': u'2902'}}
Again not sure of all that is in there but you can see html tags under u'notice and lots of other info.

Related

How to extract information using BeautifulSoup from a particular site

My objective is to extract info from site https://shopopenings.com/merchant-search after entering pin code of the respective area and copy all info from there. Whether outlet is opened or closed. There has to be loop.
This site has an underlying API that you can use to get JSON responses. To find the endpoints and what is expected as request and response you can use the Mozilla developer tools and Chrome devtools under network.
import json
import requests
SEARCH_ADDRESS = "California City, CA 93505"
urlEndpoint_AutoComplete = "https://shopopenings.com/api/autocomplete"
urlEndpoint_Search = "https://shopopenings.com/api/search"
search_Location = {"type":"address", "searchText":SEARCH_ADDRESS, "language":"us"}
locations = requests.post(urlEndpoint_AutoComplete, data=search_Location)
local = json.loads(locations.text)[0] # get first address
local["place_address"] = local.pop("name") # fix key name for next post request
local["place_id"] = local.pop("id") # fix key name for next post request
local["shopTypes"] = ["ACC", "ARA", "AFS", "AUT", "BTN", "BWL", "BKS", "AAC",
"CEA", "CSV", "DPT", "DIS", "DSC", "DLS", "EQR", "AAF", "GHC", "GRO", "HBM",
"HIC", "AAM", "AAX", "MER", "MOT", "BMV", "BNM", "OSC", "OPT", "EAP", "SHS",
"GSF", "SGS", "TEV", "TOY", "TAT", "DVG", "WHC", "AAW"]
local["range"] = 304.8
local["language"] = "us"
results = requests.post(urlEndpoint_Search, data=local)
print(json.loads(results.text))
{'center': {'latitude': 35.125801, 'longitude': -117.9859038},
'range': '304.8',
'merchants': [{'mmh_id': '505518130',
'latitude': 35.125801,
'longitude': -117.9859,
'shopName': 'Branham M Branham Mtr',
'shopFullAddressString': 'California City, CA',
'isOpen': False,
'nfc': False,
'shopType': 'AUT',
'distance': 0.34636329,
'country': 'USA'},
{'mmh_id': '591581670',
'latitude': 35.125442,
'longitude': -117.986083,
'shopName': 'One Stop Market',
'shopFullAddressString': '7990 California City Blvd, California City, CA 93505-2518',
'isOpen': True,
'nfc': True,
'shopType': 'AFS',
'distance': 43.04766933,
'country': 'USA'},
...
...
I think use selenium to control the Navigation and the entering of the Pin then use BeautifulSoup to work with the Page Source after you action. Here is the documentation it's easy enough to get you started.
Selenium -- https://selenium-python.readthedocs.io/
BeautifulSoup -- https://readthedocs.org/projects/beautiful-soup-4/
Enjoy!!

find_all() in BeautifulSoup returns empty ResultSet

I am trying to scrape data from a website for practicing web scraping.But the findall() returns empty set. How can I resolve this issue?
#importing required modules
import requests,bs4
#sending request to the server
req = requests.get("https://www.udemy.com/courses/search/?q=python")
# checking the status on the request
print(req.status_code)
req.raise_for_status()
#converting using BeautifulSoup
soup = bs4.BeautifulSoup(req.text,'html.parser')
#Trying to scrape the particular div with the class but returning 0
container = soup.find_all('div',class_='popover--popover--t3rNO popover--popover-hover--14ngr')
#trying to print the number of container returned.
print(len(container))
Output :
200
0
See my comment about it being entirely javascript driven content. Modern websites often will use javascript to invoke HTTP requests to the server to grab data on demand when needed. Here if you disable javascript which you can easily do in chrome by going to more settings when you inspect the page. You will see that NO text is available on this website. Which is probably much different to imdb as you pointed out. If you check the beautifulsoup parsed html, you'll see you don't have any of the actual page source derived with javascript.
There are two ways to get data from a javascript rendered website
Mimic the HTTP request to the server
Browser automation package like selenium
The first option is better and more efficient, as the second option is more brittle and not great for larger data sets.
Fortunately udemy is getting the data you want from an API endpoint which it uses javascript to make HTTP requests to and the response back gets fed to the browser.
Code Example
import requests
cookies = {
'__udmy_2_v57r': '4f711b308da548b49394854a189d3179',
'ud_firstvisit': '2020-05-29T13:48:56.584511+00:00:1jefNY:9F1BJVEUJpv7gmNPgYNini76UaE',
'existing_user': 'true',
'optimizelyEndUserId': 'oeu1590760136407r0.2130390415126655',
'EUCookieMessageShown': 'true',
'_ga': 'GA1.2.1359933509.1590760142',
'_pxvid': '26d89ed1-a1b3-11ea-9179-cb750fa4136b',
'_ym_uid': '1585144165890161851',
'_ym_d': '1590760145',
'__ssid': 'd191bc02a1063fd2c75fbab525ededc',
'stc111655': 'env:1592304425%7C20200717104705%7C20200616111705%7C1%7C1014616:20210616104705|uid:1590760145861.374775813.04725504.111655.1839745362:20210616104705|srchist:1069270%3A1%3A20200629134905%7C1014624%3A1592252104%3A20200716201504%7C1014616%3A1592304425%3A20200717104705:20210616104705|tsa:0:20200616111705',
'ki_t': '1590760146239%3B1592304425954%3B1592304425954%3B3%3B5',
'ki_r': 'aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8%3D',
'IR_PI': '00aea1e6-9da9-11ea-af3a-42010a24660a%7C1592390825988',
'_gac_UA-12366301-1': '1.1592304441.CjwKCAjw26H3BRB2EiwAy32zhfcltNEr_HHFK5JRaJar5qxUn4ifG9FVFctWyTUXigNZvKeOCz7PgxoCAfAQAvD_BwE',
'csrftoken': 'pPOdtdbH0HPaHvDfAZMzEOdvWqKZuQWufu8dUrEeXuy5mOOrnFRbWZ9vq8Dfd2ts',
'__cfruid': 'f1963d736e3891a2e307ebc9f918c89065ffe40f-1596962093',
'__cfduid': 'df4d951c87bc195c73b2f12b5e29568381597085850',
'ud_cache_price_country': 'GB',
'ud_cache_device': 'desktop',
'ud_cache_language': 'en',
'ud_cache_logged_in': '0',
'ud_cache_release': '0804b40d37e001f97dfa',
'ud_cache_modern_browser': '1',
'ud_cache_marketplace_country': 'GB',
'ud_cache_brand': 'GBen_US',
'ud_cache_version': '1',
'ud_cache_user': '',
'seen': '1',
'eventing_session_id': '66otW5O9TQWd5BYq1_etrA-1597087737933',
'ud_cache_campaign_code': '',
'exaff': '%7B%22start_date%22%3A%222020-08-09T08%3A52%3A04.083577Z%22%2C%22code%22%3A%22_7fFXpljNdk-m3_OJPaWBwAQc5gVKutaSg%22%2C%22merchant_id%22%3A39197%2C%22aff_type%22%3A%22LS%22%2C%22aff_id%22%3A60680%7D:1k5D3W:2PemPLTm4xaHixBYRvRyBaAukL4',
'evi': 'SlFfLh4RBzwTSVBjXFdHehNJUGMYQE99HVFdIExYQ3gARVY8QkAWIEEDCXsVQEd0BEsJexVAA24LQgdjGANXdgZBG3ETH1luRBdHKBoHV3ZKURl5XVBXdkpRXWNUU1luRxIJe1lTQXhMDgdjHRAFbgsICXNWVk1uCwgJN0xYRGATBUpjVFVEdAEOB2NcWkR+E0lQYxhAT30dUV0gTFhCfAhDVm1MUEJ0B1EROkwUV3YAXwk3D0BPewFAHzxCQEd0BUcJexVAA24LQgdjGANXdgZCHHETTld+BkUdY1QZVzoTSRptTBQUbgtFEnleHwhgEwBcY1QZV34HShtjVBlXOhNJE21MFBRuC0UceV4fWW4DSxh3TFgObkdREXBCQAMtE0kccFtUCGATQR54VkBPNxMFCXtfTlc6UFERd1tUTTEdURlzX1JXdkpRXWNUU1luRxIJe1tXQnpMXwlzVldDbgsICTdMWEdgEwVKY1RVRHUJDgdjXFdCdBNJUGMYQE99HVFdIExYQ3kCQ1Y8Ew==',
'ud_rule_vars': 'eJyFjkuOwyAQBa9isZ04agyYz1ksIYxxjOIRGmhPFlHuHvKVRrPItvWqus4EXT4EDJP9jSViyobPktKRgZqc4GrkmmmuBHdU6YlRqY1P6RgDMQ05D2SOueCDtZPDMNT7QDrooAXRdrqhzHBlRL8XUjPgXwAGYCC7ulpdRX3acglPA8bvPwbVgm6g4p0Bvqeyhsh_BkybXyxmN8_R21J9vvpcjm5cn7ZDTidc7G2xxnvlm87hZwvlU7wE2VP1en0hlyuoG10j:1k5D3W:nxRv-tyLU7lxhsF2jRYvkJA53uM',
}
headers = {
'authority': 'www.udemy.com',
'x-udemy-cache-release': '0804b40d37e001f97dfa',
'x-udemy-cache-language': 'en',
'x-udemy-cache-user': '',
'x-udemy-cache-modern-browser': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'accept': 'application/json, text/plain, */*',
'x-udemy-cache-brand': 'GBen_US',
'x-udemy-cache-version': '1',
'x-requested-with': 'XMLHttpRequest',
'x-udemy-cache-logged-in': '0',
'x-udemy-cache-price-country': 'GB',
'x-udemy-cache-device': 'desktop',
'x-udemy-cache-marketplace-country': 'GB',
'x-udemy-cache-campaign-code': '',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.udemy.com/courses/search/?q=python',
'accept-language': 'en-US,en;q=0.9',
}
params = (
('q', 'python'),
('skip_price', 'false'),
)
response = requests.get('https://www.udemy.com/api-2.0/search-courses/', headers=headers, params=params, cookies=cookies)
ids = []
titles = []
durations = []
ratings = []
for a in response.json()['courses']:
title = a['title']
duration =int(a['estimated_content_length']) / 60
rating = a['rating']
id = str(a['id'])
titles.append(title)
ids.append(id)
durations.append(duration)
ratings.append(rating)
clean_ids = ','.join(ids)
params2 = (
('course_ids', clean_ids),
('fields/[pricing_result/]', 'price,discount_price,list_price,price_detail,price_serve_tracking_id'),
)
response = requests.get('https://www.udemy.com/api-2.0/pricing/', params=params2)
data = response.json()['courses']
prices = []
for a in ids:
price = response.json()['courses'][a]['price']['amount']
prices.append(price)
data = zip(titles, durations,ratings, prices)
for a in data:
print(a)
Output
('Learn Python Programming Masterclass', 56.53333333333333, 4.54487, 14.99)
('The Python Mega Course: Build 10 Real World Applications', 25.3, 4.51476, 16.99)
('Python for Beginners: Learn Python Programming (Python 3)', 2.8833333333333333, 4.4391, 17.99)
('The Python Bible™ | Everything You Need to Program in Python', 9.15, 4.64238, 17.99)
('Python for Absolute Beginners', 3.066666666666667, 4.42209, 14.99)
('The Modern Python 3 Bootcamp', 30.3, 4.64714, 16.99)
('Python for Finance: Investment Fundamentals & Data Analytics', 8.25, 4.52908, 12.99)
('The Complete Python Course | Learn Python by Doing', 35.31666666666667, 4.58885, 17.99)
('REST APIs with Flask and Python', 17.033333333333335, 4.61233, 12.99)
('Python for Financial Analysis and Algorithmic Trading', 16.916666666666668, 4.53173, 12.99)
('Python for Beginners with Examples', 4.25, 4.27316, 12.99)
('Python OOP : Four Pillars of OOP in Python 3 for Beginners', 2.6166666666666667, 4.46451, 12.99)
('Python Bootcamp 2020 Build 15 working Applications and Games', 32.13333333333333, 4.2519, 14.99)
('The Complete Python Masterclass: Learn Python From Scratch', 32.36666666666667, 4.39151, 16.99)
('Learn Python MADE EASY : A Concise Python Course in Python 3', 2.1166666666666667, 4.76601, 12.99)
('Complete Python Web Course: Build 8 Python Web Apps', 15.65, 4.37577, 13.99)
('Python for Excel: Use xlwings for Data Science and Finance', 16.116666666666667, 4.92293, 12.99)
('Python 3 Network Programming - Build 5 Network Applications', 12.216666666666667, 4.66143, 12.99)
('The Complete Python & PostgreSQL Developer Course', 21.833333333333332, 4.5664, 12.99)
('The Complete Python Programmer Bootcamp 2020', 13.233333333333333, 4.63859, 12.99)
Explanation
There are two ways to do this, here is re-engineering the requests which is the more efficient solution. To get the necessary information, you'll need to inspect the page and look at which HTTP requests give which information. You can do this through the network tools --> XHR when you inspect the page. You can see there are two requests that give you information. My suggestion would be look at the previews of the responses on the right hand side when you select the request. The first gives you the title, duration, price, ratings and the second request you need the id's of the courses to get the prices of the courses.
I usually copy the CURL of the HTTP requests the javascript invokes into curl.trillworks.com and this converts the necessary headers, parameters and cookies to python format.
In the first request, headers, cookies and parameters are required. THe second request, only requires the parameters.
The response you get is a json object. response.json() converts this into a python dictionary. You have to do abit of digging in this dictionary to get what you want. But for each item in response.json()['courses'] all the necessary data for each 'card' on the website is there. So we do a for loop around where the data sits in the dictionary we've created. I would play around the with response.json() till you get a feel for what the object gives you to understand the code.
The duration comes in minutes therefore I've done a quick convert to hours here. Also the id's need to be a string because in the second request we use them as parameters to get the necessary prices for the courses. We convert ids into a string and feed this as a parameter.
The second request then gives us the necessary prices, again you have to go digging in the dictionary object and I suggest you do this yourself to confirm that nested in that is the price.
The data we zip up to combine all the lists of data and then I've done a for loop to print it all. You could feed this into pandas if you wanted etc...
To get required data you need to send requests to appropriate API. For that you need to create Session:
import requests
s = requests.Session()
cookies = s.get('https://www.udemy.com').cookies
headers={"Referer": "https://www.udemy.com/courses/search/?q=python&skip_price=false"}
for page_counter in range(1, 500):
data = s.get('https://www.udemy.com/api-2.0/search-courses/?p={}&q=python&skip_price=false'.format(page_counter), cookies=cookies, headers=headers).json()
for course in data['courses']:
params = {'course_ids': [str(course['id']),],
'fields/[pricing_result/]': ['price',]}
title = course['title']
price = s.get('https://www.udemy.com/api-2.0/pricing/', params=params, cookies=cookies).json()['courses'][str(course['id'])]['price']['amount']
print({'title': title, 'price': price})

Cant scrape google search results with beautifulsoup

I want to scrape google search results , but whenever i try to do so, the program returns an empty list
from bs4 import BeautifulSoup
import requests
keyWord = input("Input Your KeyWord :")
url = f'https://www.google.com/search?q={keyWord}'
src = requests.get(url).text
soup = BeautifulSoup(src, 'lxml')
container = soup.findAll('div', class_='g')
print(container)
Complementing Andrej Kesely's answer if you getting empty results you always can climb one div up or down to test and go from there.
Code (say you want to scrape title, summary and link):
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=ice cream',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Portion of output:
[
{
"Heading": "Ice cream - Wikipedia",
"Article Summary": "Ice cream (derived from earlier iced cream or cream ice) is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, either sugar or an alternative, and any spice, such as cocoa or vanilla.",
"Link": "https://en.wikipedia.org/wiki/Ice_cream"
},
{
"Heading": "Jeni's Splendid Ice Creams",
"Article Summary": "Jeni's Splendid Ice Cream, built from the ground up with superlative ingredients. Order online, visit a scoop shop, or find the closest place to buy Jeni's near you.",
"Link": "https://jenis.com/"
}
]
Alternatively, you can do this using Google Search Engine Results API from SerpApi. It's a paid API with a free trial.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "ice cream",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")
Portion of the output:
Title: Ice cream - Wikipedia
Summary: Ice cream (derived from earlier iced cream or cream ice) is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, either sugar or an alternative, and any spice, such as cocoa or vanilla.
Link: https://en.wikipedia.org/wiki/Ice_cream
Title: 6 Ice Cream Shops to Try in Salem, Massachusetts ...
Summary: 6 Ice Cream Shops to Try in Salem, Massachusetts · Maria's Sweet Somethings, 26 Front Street · Kakawa Chocolate House, 173 Essex Street · Melt ...
Link: https://www.salem.org/icecream/
Title: Melt Ice Cream - Salem
Summary: Homemade ice cream made on-site in Salem, MA. Bold innovative flavors, exceptional customer service, local ingredients.
Link: https://meltsalem.com/
Disclaimer, I work for SerpApi.
To get correct result page from google, specify User-Agent http header. For only english results put hl=en parameter in URL:
from bs4 import BeautifulSoup
import requests
keyWord = input("Input Your KeyWord :")
url = f'https://www.google.com/search?hl=en&q={keyWord}'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
src = requests.get(url, headers=headers).text
soup = BeautifulSoup(src, 'lxml')
containers = soup.findAll('div', class_='g')
for c in containers:
print(c.get_text(strip=True, separator=' '))

Web scraping : not able to scrape text and href for a given div, class and to skip the span tag

Trying to get the text and href for top news but not able to scrape it.
website : News site
My code:
import requests
from bs4 import BeautifulSoup
import psycopg2
import time
def checkResponse(url):
response = requests.get(url)
if response.status_code == 200:
return response.content
else:
return None
def getTitleURL():
url = 'http://sandesh.com/'
response = checkResponse(url)
if response is not None:
html = BeautifulSoup(response, 'html.parser')
for values in html.find_all('div', class_='d-top-news-latest'):
headline = values.find(class_='d-s-NSG-regular').text
url = values.find(class_='d-s-NSG-regular').['href']
print(headline + "->" + url)
if __name__ == '__main__':
print('Getting the list of names....')
names = getTitleURL()
print('... done.\n')
Output:
Getting the list of names....
Corona live
મેડિકલ સ્ટાફ પર હુમલા અંગે અમિત શાહે ડોક્ટર્સ સાથે કરી ચર્ચા, સુરક્ષાની ખાતરી આપતા કરી અપીલ
Ahmedabad
ગુજરાતમાં કૂદકેને ભૂસકે વધ્યો કોરોના વાયરસનો કહેર, આજે નવા 94 કેસ નોંધાયા, જાણો કયા- કેટલા કેસ નોંધાયા
Corona live
જીવન અને મોત વચ્ચે સંઘર્ષ કરી રહ્યો છે દુનિયાનો સૌથી મોટો તાનાશાહ કિમ જોંગ! ટ્રમ્પે કહી આ વાત
Ahmedabad
અમદાવાદમાં નર્સિંગ સ્ટાફનો ગુસ્સો ફૂટ્યો, ‘અમારું કોઈ સાંભળતું નથી, અમારો કોરોના ટેસ્ટ જલદી કરાવો’
Business
ભારતીય ટેલિકોમ જગતમાં સૌથી મોટી ડીલ, ફેસબુક બની જિયોની સૌથી મોટી શેરહોલ્ડર
->http://sandesh.com/amit-shah-talk-with-ima-and-doctors-through-video-conference-on-attack/
... done.
I want to skip text inside the tag and also I am able to get only 1 href. Also the headline is a list.
how do I get each title and url.
I am trying to scrape the part in red:
First, At for values in html.find_all('div', class_='d-top-news-latest') you don't need use for because at DOM just have one class d-top-news=latest.
Second, to get the title, you can use select('span') because of your title into the span tag.
Third, you knew the headline is a list, so you need to use for to get each title and URL.
values = html.find('div', class_='d-top-news-latest')
for i in values.find_all('a', href = True):
print(i.select('span'))
print(i['href'])
OUTPUT
Getting the list of names....
[<span>
Corona live
</span>]
http://sandesh.com/maharashtra-home-minister-anil-deshmukh-issue-convicts-list-of-
palghar-case/
[<span>
Corona live
</span>]
http://sandesh.com/two-doctors-turn-black-after-treatment-of-coronavirus-in-china/
[<span>
Corona live
</span>]
http://sandesh.com/bihar-asi-gobind-singh-suspended-for-holding-home-guard-jawans-
after-stopping-officers-car-asi/
[<span>
Ahmedabad
</span>]
http://sandesh.com/jayanti-ravi-surprise-statement-sparks-outcry-big-decision-taken-
despite-more-patients-in-gujarat/
[<span>
Corona live
</span>]
http://sandesh.com/amit-shah-talk-with-ima-and-doctors-through-video-conference-on-
attack/
... done.
to remove the "span" part:
values = html.find('div', class_='d-top-news-latest')
for i in values.find_all('a', href=True):
i.span.decompose()
print(i.text)
print(i['href'])
Output:
Getting the list of names....
ગુજરાતમાં કોરોનાનો કહેરઃ રાજ્યમાં આજે કોરોનાના 135 નવા કેસ, વધુ 8 લોકોનાં મોત
http://sandesh.com/gujarat-corona-update-206-new-cases-and-18-deaths/
ચીનના વૈજ્ઞાનિકોએ જ ખોલી જીનપિંગની પોલ, કોરોના વાયરસને લઈને કર્યો સનસની ખુલાસો
http://sandesh.com/chinese-scientists-claim-over-corona-virus/
શું લોકડાઉન ફરી વધારાશે? PM મોદી 27મીએ ફરી એકવાર તમામ CM સાથે કરશે ચર્ચા
http://sandesh.com/pm-modi-to-hold-video-conference-with-cms-on-april-27-lockdown-
extension/
કોરોના વાયરસને લઈ મોટી ભવિષ્યવાણી, દુનિયાના 30 દેશો પર ઉભુ થશે ભયંકર સંકટ
http://sandesh.com/after-corona-attack-now-hunger-will-kill-many-people-in-the-world/
દેશમાં 24 કલાકમાં 1,486 કોરોનાનાં નવા કેસ, પરંતુ મળ્યા સૌથી મોટા રાહતનાં સમાચાર
http://sandesh.com/recovery-rate-increased-in-corona-patients-says-health-ministry/
... done.

Is there a way to iterate over a list to add into the selenium code?

I am trying to iterate over a large list of dealership names and cities. I want to have it refer back to the list and loop over each entry and get the results separately.
#this is only a portion of the delers the rest are in a file
Dealers= ['Mossy Ford', 'Abel Chevrolet Pontiac Buick', 'Acura of Concord', 'Advantage Audi' ]
driver=webdriver.Chrome("C:\\Users\\kevin\\Anaconda3\\chromedriver.exe")
driver.set_page_load_timeout(30)
driver.get("https://www.bbb.org/")
driver.maximize_window()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/div[2]/button").click()
driver.find_element_by_xpath("""//*[#id="findTypeaheadInput"]""").send_keys("Mossy Ford")
driver.find_element_by_xpath("""//*[#id="nearTypeaheadInput"]""").send_keys("San Diego, CA")
driver.find_element_by_xpath("""/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/button""").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div/div[2]/div[2]/div[1]/div[6]/div").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath('/html/body/div[1]/div/div/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/a').click()
#contact_names= driver.find_elements_by_xpath('/html/body/div[1]/div/div/div/div[2]/div/div[5]/div/div[1]/div[1]/div/div/ul[1]')
#print(contact_names)
#print("Query Link: ", driver.current_url)
#driver.quit()
from selenium import webdriver
dealers= ['Mossy Ford', 'Abel Chevrolet Pontiac Buick', 'Acura of Concord']
cities = ['San Diego, CA', 'Rio Vista, CA', 'Concord, CA']
driver=webdriver.Chrome("C:\\Users\\kevin\\Anaconda3\\chromedriver.exe")
driver.set_page_load_timeout(30)
driver.get("https://www.bbb.org/")
driver.maximize_window()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/div[2]/button").click()
for d in dealers:
driver.find_element_by_xpath("""//*[#id="findTypeaheadInput"]""").send_keys("dealers")
for c in cities:
driver.find_element_by_xpath("""//*[#id="nearTypeaheadInput"]""").send_keys("cities")
driver.find_element_by_xpath("""/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/button""").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div/div[2]/div[2]/div[1]/div[6]/div").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath('/html/body/div[1]/div/div/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/a').click()
contact_names= driver.find_elements_by_class_name('styles__UlUnstyled-sc-1fixvua-1 ehMHcp')
print(contact_names)
print("Query Link: ", driver.current_url)
driver.quit()
I want to be able to go to each of these different dealerships pages and pull all of their details then loop thru the rest. I am just struggling with the ideas of for loops within selenium.
Its better to create a dictionary with a mapping of dealer and city and loop through
Dealers_Cities_Dict = {
Dealers_Cities_Dict = {
"Mossy Ford": "San Diego, CA",
"Abel Chevrolet Pontiac Buick": "City",
"Acura of Concord'": "City",
"Advantage Audi'": "City"
}
for dealer,city in Dealers_Cities_Dict.items():
//This is where the code sits
driver.find_element_by_id("findTypeaheadInput").send_keys(dealer)
driver.find_element_by_id("nearTypeaheadInput").send_keys(city)

Resources