Python scraping's trouble in extract value - python-3.x

I'm trying to extract values from the table in this site: https://www.geonames.org/search.html?q=&country=IT
In my example I want to extract the name 'Rome' and I used this code:
import requests
import lxml.html
html = requests.get('https://www.geonames.org/search.html?q=&country=IT')
doc = lxml.html.fromstring(html.content)
table_body = doc.xpath('//*[#id="search"]/table')[0]
cities = table_body.xpath('//*[#id="search"]/table/tbody/tr[3]/td[2]/a[1]/text()')
Everything seams ok for me but wehen I print it the result is:
>>> print(cities)
[]
I really have no idea of what could be the problem, do someone have some suggestion?

If you're looking to get "Rome", you can omit tbody. This element was inserted by the browser and isn't present in the original document returned by the request.
Additionally, the extra line table_body = doc.xpath('//*[#id="search"]/table')[0] is redundant--you can search directly from the root.
import requests
import lxml.html
html = requests.get('https://www.geonames.org/search.html?q=&country=IT')
doc = lxml.html.fromstring(html.content)
print(doc.xpath('//*[#id="search"]/table/tr[3]/td[2]/a[1]/text()')[0]) # => Rome

Here is the simple script to extract all cities in that page
import requests
import lxml.html
html = requests.get('https://www.geonames.org/search.html?q=&country=IT')
doc = lxml.html.fromstring(html.content)
# corrected the xpath in the below line.
cities = doc.xpath("//table[#class='restable']//td[a][2]/a[1]/text()")
for city in cities:
print(city)

Related

How Can I Assign A Variable To All Of The Items In A List?

I'm following a guide and it's saying to print the first item from an html document that contains the dollar sign.
It seems to do it correctly, outputting a price to the terminal and it actually being present on the webpage. However, I don't want to have just that single listing, I want to have all of the listings and print them to the terminal.
I'm almost positive that you could do this with a for loop, but I don't know how to set that up correctly. Here's the code I have so far with a comment on line 14, and the code I'm asking about on line 15.
from bs4 import BeautifulSoup
import requests
import os
os.system("clear")
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text="$")
#Print all prices instead of just the specified number?
parent = prices[0].parent
strong = parent.find("strong")
print(strong.string)
You could try the following:
from bs4 import BeautifulSoup
import requests
import os
os.system("clear")
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text="$")
for price in prices:
parent = price.parent
strong = parent.find("strong")
print(strong.string)

how can i re-write the below code to get an actual data set instead of the empty data frame response that i am getting?

I have written the following lines of code to scrape a book store website to get; the book title, the price of the book and the availability of the book. My code runs well but i get an empty data frame instead of the data i want. Please assist
>>> import requests
>>> import bs4
>>> import re
>>> import pandas as pd
>>> full_dict={'Title':[],'Price':[],'Availability':[]}
>>> for index in range(1,50):
res=requests.get("http://books.toscrape.com/catalogue/category/books_1/index?={index}.html")
soup=bs4.BeautifulSoup(res.text,'lxml')
books=soup.find_all(class_='product_prod')
for book in books:
book_title=book.find(href=re.compile("title"))
book_price=book.find('div',{'class':'product_price'})
book_availability=book.find('p',{'class':'instock.availability'})
full_dict['Title'].append(title)
full_dict['Price'].append(price)
full_dict['Availability'].append(availability)
>>> df=pd.DataFrame(full_dict)
>>> print(df)
i want to get the book title,book price and book availability(whether the book in in stock) displayed as results. form http://books.toscrape.com/index.html, for the first 50 pages
You need to change your url to be correct otherwise 404. I would then also change to faster selectors and ensure your variable names are consistent
import requests
import bs4
full_dict={'Title':[],'Price':[],'Availability':[]}
for index in range(1,3):
res = requests.get(f"http://books.toscrape.com/catalogue/page-{index}.html") #http://books.toscrape.com/catalogue/page-2.html
soup = bs4.BeautifulSoup(res.text,'lxml')
books = soup.select('.product_pod')
for book in books:
book_title = book.select_one('h3 a').text
book_price = book.select_one('.price_color').text.replace('Â','')
book_availability = book.select_one('.availability').text.strip()
full_dict['Title'].append(book_title)
full_dict['Price'].append(book_price)
full_dict['Availability'].append(book_availability)
ok, I just saw the mistake:
your variables are called e.g. book_title
but you append just title
it must be:
full_dict['Title'].append(book_title)
full_dict['Price'].append(book_price)
full_dict['Availability'].append(book_availability)
It seems that you're getting an 404 Error from the Webpage

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

find() in Beautifulsoup returns None

I'm very new to programming in general and I'm trying to write my own little torrent leecher. I'm using Beautifulsoup In order to extract the title and the magnet link of a torrent file. However find() element keeps returning none no matter what I do. The page is correct. I've also tested with find_next_sibling and read all the similar questions but to no avail. Since there are no errors I have no idea what my mistake is.
Any help would be much appreciated. Below is my code:
import urllib3
from bs4 import BeautifulSoup
print("Please enter the movie name: \n")
search_string = input("")
search_string.rstrip()
search_string.lstrip()
open_page = ('https://www.yify-torrent.org/search/' + search_string + '/s-1/all/all/') # get link - creates a search string with input value
print(open_page)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
manager = urllib3.PoolManager(10)
page_content = manager.urlopen('GET',open_page)
soup = BeautifulSoup(page_content,'html.parser')
magnet = soup.find('a', attrs={'class': 'movielink'}, href=True)
print(magnet)
Check out the following script which does exactly what you wanna achieve. I used requests library instead of urllib3. The main mistake you made is that you looked for the magnet link in the wrong place. You need to go one layer deep to dig out that link. Try using quote instead of string manipulation to fit your search query within the url.
Give this a shot:
import requests
from urllib.parse import urljoin
from urllib.parse import quote
from bs4 import BeautifulSoup
keyword = 'The Last Of The Mohicans'
url = 'https://www.yify-torrent.org/search/'
base = f"{url}{quote(keyword)}{'/p-1/all/all/'}"
res = requests.get(base)
soup = BeautifulSoup(res.text,'html.parser')
tlink = urljoin(url,soup.select_one(".img-item .movielink").get("href"))
req = requests.get(tlink)
sauce = BeautifulSoup(req.text,"html.parser")
title = sauce.select_one("h1[itemprop='name']").text
magnet = sauce.select_one("a#dm").get("href")
print(f"{title}\n{magnet}")

How to Extract just the text from a website using Jupyter?

I am trying to get the text of an article from a link but while importing the text I am getting all other links, advertisement links , and image names which I don't need it for my analysis.
import re
from nltk import word_tokenize, sent_tokenize, ngrams
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-
120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html,"lxml").get_text()
raw
I am getting this result (copied just some few lines, I get actual text of an article as well but exists in other lines):
window.performance && window.performance.mark &&
window.performance.mark(\'PageStart\');Best Bites: Weeknight meals
cauliflower vegetable fried rice!function(s,f,p){var
a=[],e={_version:"3.6.0",_config:{classPrefix:"",enableClasses:!0,enableJSClass:!0,usePrefixes:!0},_q:[],on:function(e,t){var
n=this;setTimeout(function(){t(n[e])},0)},addTest:function(e,t,n){a.push({name:e,fn:t,options:n})},addAsyncTest:function(e){a.push({name:null,fn:e})}},l=function(){};l.prototype=e,l=new
l;var c=[];function v(e,t){return typeof e===t}var t="Moz O ms
Webkit",u=e._config
I just want to know if there is any way for me to extract just the text of an article, ignoring all these values.
When BS4 parses a site it creates its own DOM internally as an object.
To access different parts of the DOM we have to use the correct accessors or tags like below
import re
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
parsedHTML = BeautifulSoup(html, "html.parser")
readableText = parsedHTML.article.get_text() # <- we got the text from inside the <article> tag
print(readableText)
You were close but you didn't specify which tag you wanted to get_text() from.
Also find() and find_all() are very usefull for finding tags on a page.

Resources