How to Extract just the text from a website using Jupyter? - python-3.x

I am trying to get the text of an article from a link but while importing the text I am getting all other links, advertisement links , and image names which I don't need it for my analysis.
import re
from nltk import word_tokenize, sent_tokenize, ngrams
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-
120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html,"lxml").get_text()
raw
I am getting this result (copied just some few lines, I get actual text of an article as well but exists in other lines):
window.performance && window.performance.mark &&
window.performance.mark(\'PageStart\');Best Bites: Weeknight meals
cauliflower vegetable fried rice!function(s,f,p){var
a=[],e={_version:"3.6.0",_config:{classPrefix:"",enableClasses:!0,enableJSClass:!0,usePrefixes:!0},_q:[],on:function(e,t){var
n=this;setTimeout(function(){t(n[e])},0)},addTest:function(e,t,n){a.push({name:e,fn:t,options:n})},addAsyncTest:function(e){a.push({name:null,fn:e})}},l=function(){};l.prototype=e,l=new
l;var c=[];function v(e,t){return typeof e===t}var t="Moz O ms
Webkit",u=e._config
I just want to know if there is any way for me to extract just the text of an article, ignoring all these values.

When BS4 parses a site it creates its own DOM internally as an object.
To access different parts of the DOM we have to use the correct accessors or tags like below
import re
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
parsedHTML = BeautifulSoup(html, "html.parser")
readableText = parsedHTML.article.get_text() # <- we got the text from inside the <article> tag
print(readableText)
You were close but you didn't specify which tag you wanted to get_text() from.
Also find() and find_all() are very usefull for finding tags on a page.

Related

Use beautifulsoup to download href links

Looking to download href links using beautifulsoup4, python 3 and requests library.
This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!
URL:
https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
print(results)
Those files are all associated with area tag so I would simply select those:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil/' + i['href'] for i in soup.select('area')]
You can convert page to a string in order to search for all a's using regex.
Instead of:
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
Use:
results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)

Beautiful Soup Nested Loops

I was hoping to create a list of all of the firms featured on this list. I was hoping each winner would be their own section in the HTML but it looks like there are multiple grouped together across several divs. How would you recommend going about solving this? I was able to pull all of the divs but i dont know how to cycle through them appropriately. Thanks!
import requests
from bs4 import BeautifulSoup
import csv
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
element = soup.find()
person = soup.find_all('div', class_="under40")
This solution uses css selectors
import requests
from bs4 import BeautifulSoup
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
# if you have an older version you'll need to use contains instead of -soup-contains
firm_tags = soup.select('h5:-soup-contains("Firm") strong')
# extract the text from the selected bs4.Tags
firms = [tag.text for tag in firm_tags]
# if there is extra whitespace
clean_firms = [f.strip() for f in firms]
It works by selecting all the strong tags whose parent h5 tag contain the word "Firm"
See the SoupSieve Docs for more info on bs4's CSS Selectors

I'm having trouble returning an HTML link as I pull links from a google search query in Python

I'm attempting to pull website links from a google search but I'm having trouble returning any value, I think the issue is with the attributes I'm using to call the web link but I'm not sure why as I was able to use the same attributes in webdriver to accomplish the result.
Here's the code:
import requests
import sys
import webbrowser
import bs4
from parsel import Selector
import xlsxwriter
from openpyxl import load_workbook
import pandas as pd
print('Searching...')
res = requests.get('https://google.com/search?q="retail software" AND "mission"')
soup = bs4.BeautifulSoup(res.content, 'html.parser')
for x in soup.find_all('div', class_='yuRUbf'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
print(link)
This is the ouput:
Searching...
That's it. Thanks in advance for the help!
The class value is dynamic, you should use the following selector to retrieve the href value
"a[href*='/url']"
This will get any a tag that contains the pattern /url.
So, just change your for loop to
for anchor_tags in soup.select("a[href*='/url']"):
print(anchor_tags.attrs.get('href'))
Example of href printed
/url?q=https://www.devontec.com/mission-vision-and-values/&sa=U&ved=2ahUKEwjShcWU-aPtAhWEH7cAHYfQAlEQFjAAegQICBAB&usg=AOvVaw14ziJ-ipXIkPWH3oMXCig1
To get the link, you only need to split the string.
You can do it like this
for anchor_tags in soup.select("a[href*='/url']"):
link = anchor_tags.attrs.get('href').split('q=')[-1] # split by 'q=' and gets the last position
print(link)

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

bs4 how to use find after using find_all?

I would like to search a bbs and find the content of one specific user. Here is the code I have:
import requests
from bs4 import BeautifulSoup
main_html = requests.get('http://bbs.tianya.cn/post-free-2328897-1.shtml')
main_soup = BeautifulSoup(main_html.text, 'lxml')
all_a = main_soup.find_all('div', _host='羡鱼则结网')
Now I have all the code from this user and I would like to search again for the content. However, I can't use find because find_all returns a list.
What should I do to search twice by BeautifulSoup?

Resources