I'm trying to run the code below to take a list of URLs (in the file out.txt) and pulls the text content from that page, using xpath. The code finds the domain from the URL, then looks up the domain in a json file I've created that has both domain and Xpath. The xpath is then used to find the content.
However, right now if I run the code outside of the loop it works fine (page = 200). But if I do it inside the loop I'm getting page=404.
I'm sure that this is a grammar error with the loop and is probably really simple. What am I doing wrong?
URLList = open("out.txt").readlines()
for item in URLList:
inputurl = item
print (inputurl)
type(inputurl)
#this takes a URL and finds the xpath - it uses an external
domainlookup.json that is manually created
# inputurl = input("PLEASE PROVIDE A URL FROM AN APPROVED DOMAIN: ")
t = urlparse(inputurl).netloc
domain = ('.'.join(t.split('.')[1:]))
with open('domainlookup.json') as json_data:
domainlookup = json.load(json_data)
for i in domainlookup:
if i['DOMAIN'] == domain:
xpath = (i['XPATH'])
#this requests the xpath from the URL and scrapes the text content
page = requests.get(inputurl)
tree = html.fromstring(page.content)
content = tree.xpath(xpath)
You'll find what's wrong with your code using this code:
URLList = open("out.txt").readlines()
for item in URLList:
inputurl = item
print("[{0}]".format(inputurl) )
As you can see from the output you're not removing new line character from the URL that's why requests can't load it later. Just strip() it before using:
inputurl = item.strip()
Related
Beginner here.
I am creating a web scraper, that will requests.get() multiple pages for instance in a search result.
I've made a list called pages, and then run a loop that requests.get() all the pages. Esentially I want a function that autogenerates a list of URLs but manipulates part of the URL every time it's repeated.
Almost like so
url_base = "www.page.com/page"
pnumber = 1
url = url_base + pnumber
But every time the loop is repeated 1 is added to pnumber to get a list like this:
pages = [
'www.page.com/page1'
'www.page.com/page2'
]
Use range
Ex:
url_base = "www.page.com/page{}"
for i in range(10):
url = url_base.format(i)
print(url)
Output:
www.page.com/page0
www.page.com/page1
www.page.com/page2
www.page.com/page3
www.page.com/page4
www.page.com/page5
www.page.com/page6
www.page.com/page7
www.page.com/page8
www.page.com/page9
I write parser using Selenium and Python 3.7 for next site - https://www.oddsportal.com/soccer/germany/bundesliga/nurnberg-dortmund-fNa2KmU4/
I'm interested, to get the url, that is generated by JavaScript, using Selenium in Python 3?
I need to get the url for events from the sites from which the data is taken in the table.
For example. It seems to me that the data in the first line (10Bet) is obtained from this page - https://www.10bet.com/sports/football/germany-1-bundesliga/20190218/nurnberg-vs-dortmund/
How can get url to this page?
To get the URL of all the links on a page, you can store all the elements with tagname a in a WebElement list and then you can fetch the href attribute to get the link of each WebElement.
you can refer to the following code for all the link present in hole page :
List<WebElement> links = driver.findElements(By.tagName("a")); //This will store all the link WebElements into a list
for(WebElement ele: links) // This way you can take the Url of each link
{
String url = ele.getAttribute("href"); //To get the link you can use getAttribute() method with "href" as an argument
System.out.println(url);
}
In case you need the particilar link explicitly you need to pass the xpath of the element
WebElement ele = driver.findElements(By.xpath("Pass the xapth of the element"));
and after that you need to do this
String url = ele.getAttribute("href") //to get the url of the particular element
I have also shared the link with you so you can go and check i have highlighted elements in that
let us know if that helped or not
Try the below code, which will prints the required URL's as per your requirement :
from selenium import webdriver
driver = webdriver.Chrome('C:\\NotBackedUp\\chromedriver.exe')
driver.maximize_window()
driver.get('https://www.oddsportal.com/soccer/germany/bundesliga/nurnberg-dortmund-fNa2KmU4/')
# Locators for locating the required URL's
xpath = "//div[#id='odds-data-table']//tbody//tr"
rows = "//div[#id='odds-data-table']//tbody//tr/td[1]"
print("=> Required URL's is/are : ")
# Fetching & Printing the required URL's
for i in range(1, len(driver.find_elements_by_xpath(rows)), 1):
anchor = driver.find_elements_by_xpath(xpath+"["+str(i)+"]/td[2]/a")
if len(anchor) > 0:
print(anchor[0].get_attribute('href'))
else:
a = driver.find_elements_by_xpath("//div[#id='odds-data-table']//tbody//tr["+(str(i+1))+"]/td[1]//a[2]")
if len(a) > 0:
print(a[0].get_attribute('href'))
print('Done...')
I hope it helps...
I'm running into some trouble trying to enter and analyze several items within a page.
I have a certain page which contains items in it, the code looks something like this
class Spider(CrawlSpider):
name = 'spider'
maxId = 20
allowed_domain = ['www.domain.com']
start_urls = ['http://www.startDomain.com']
In the start url, i have some items that all follow, in XPath, the following path (within the startDomain):
def start_requests(self):
for i in range(self.maxId):
yield Request('//*[#id="result_{0}"]/div/div/div/div[2]/div[1]/div[1]/a/h2'.format(i) , callback = self.parse_item)
I'd like to find a way to access each one of these links (the ones tied to result{number}) and then scrape the contents of that certain item.
I have tried the code given below but every time i run the code,there is some links added to missing. I want to get all the links in the page in a list,so that i can go to any link that i want using slicing.
links = []
eles = driver.find_elements_by_xpath("//*[#href]")
for elem in eles:#
url = elem.get_attribute('href')
print(url)
links.append(url)
is there any way to get all the elements without missing any.
sometimes the links reside inside the frames.
Search for the frames in your website using inspect.
so you need to switch the frame first
browser.switch_to.frame("x1")
links = []
eles = driver.find_elements_by_xpath("//*[#href]")
for elem in eles:#
url = elem.get_attribute('href')
print(url)
links.append(url)
browser.switch_to.default_content()
I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.