Modifying a scraped url and changing its extension - python-3.x

I am new to programming and trying to download images and PDFs from a website. In the source code, the items I need are in option tags with partial urls. The site lists these items in a drop-down menu and they display in an iframe, but each item can be opened on its own page using its full url.
So far, my code finds the options, appends the partial url to the page's base address to create the full url for each option, and removes the final " / " from the .tif and .TIF urls and adds a ".pdf".
However, for the .tif and .TIF urls, I need to change "convert" to "pdf" to open them in a new page. Is there a way to do this to only the .tif.pdf and .TIF.pdf urls while leaving the others unchanged?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import os
my_url = 'http://example.com'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
options = page_soup.findAll("select",{"id":"images"})[0].findAll("option")
values = [o.get("value") for o in options]
split_values = [i.split("|", 1)[0] for i in values]
# The option value is split to separate the url from its label
# <option value="/convert/ASRIMG/new/hop.TIF/|New Form"></option>
new_val = []
for val in split_values:
ext = os.path.splitext(val.rstrip('/'))[-1]
new_ext = ext
if ext.lower() == '.tif':
new_ext += '.pdf'
new_val.append(val.rstrip('/').replace(ext, new_ext))
for i in range (len(new_val)):
image_urls = ('http://example.com' + new_val[i])
My current results:
print (new_val)
/ASRIMG/good.jpg
/ASRIMG/foo/bar1.jpg
/ASRIMG/foo/bar2.jpg
/ASRIMG/foo/bar3.jpg
/convert/ASRIMG/new/hop.TIF.pdf
/convert/REG/green1.tif.pdf
/convert/REG//green2.tif.pdf
/convert/SHIP/green3.tif.pdf
/convert/SHIP/green4.tif.pdf
/convert/SHIP/green5.tif.pdf
/SKETCHIMG/001.png
/SKETCH/002.JPG
print (image_urls)
http://example.com/ASRIMG/good.jpg
http://example.com/ASRIMG/foo/bar1.jpg
http://example.com/ASRIMG/foo/bar2.jpg
http://example.com/ASRIMG/foo/bar3.jpg
http://example.com/convert/ASRIMG/new/hop.TIF.pdf
http://example.com/convert/REG/green1.tif.pdf
http://example.com/convert/REG//green2.tif.pdf
http://example.com/convert/SHIP/green3.tif.pdf
http://example.com/convert/SHIP/green4.tif.pdf
http://example.com/convert/SHIP/green5.tif.pdf
http://example.com/SKETCHIMG/001.png
http://example.com/SKETCH/002.JPG
What I need:
http://example.com/ASRIMG/good.jpg
http://example.com/ASRIMG/foo/bar1.jpg
http://example.com/ASRIMG/foo/bar2.jpg
http://example.com/ASRIMG/foo/bar3.jpg
http://example.com/pdf/ASRIMG/new/hop.TIF.pdf
http://example.com/pdf/REG/green1.tif.pdf
http://example.com/pdf/REG//green2.tif.pdf
http://example.com/pdf/SHIP/green3.tif.pdf
http://example.com/pdf/SHIP/green4.tif.pdf
http://example.com/pdf/SHIP/green5.tif.pdf
http://example.com/SKETCHIMG/001.png
http://example.com/SKETCH/002.JPG

After this step:
split_values = [i.split("|", 1)[0] for i in values]
This code handles both upper and lower tif:
In [48]: import os
In [49]: split_values = ['/ASRIMG/good.jpg', '/convert/ASRIMG/new/hop.TIF/', 'SK
...: ETCHIMG/001.png']
In [50]: new_val = []
In [51]: for val in split_values:
...: ext = os.path.splitext(val.rstrip('/'))[-1]
...: new_ext = ext
...: if ext.lower() == '.tif':
...: new_ext += '.pdf'
...: new_val.append(val.rstrip('/').replace(ext, new_ext))
...:
...:
This strips .tif/ from each value from split_values list from the right side and then adds .tif.pdf in the end

Related

Python Instagram Photographs Download

I'm a python newbie who has been assigned the task of downloading and storing locally at least 100 to 200 photographs preferably in .jpg format. The code was provided to me, but thus far I haven't been able to get it to work. The code goes to https://www.instagram.com/explore/tags/feetphotos to get the photographs.
I've created an account to access the images. The code creates a insta_foot.csv file which contains an index and a link to the photograph itself to be downloaded. The .csv file gets created sometimes with the indices and links sometimes without the same. The last part of the code downloads the photographs from each of the links into an images directory created locally by the script.
import time
import pandas as pd
import requests
import bs4 as bs
from selenium import webdriver
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = 'https://www.instagram.com/explore/tags/foot/'
driver.get(url)
img_sizes = ['150w', '240w', '320w', '480w', '640w']
df = pd.DataFrame(columns = img_sizes)
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
el = driver.find_element_by_tag_name('body')
soup = bs.BeautifulSoup(el.get_attribute('innerHTML'), 'lxml')
for t in soup.findAll('img', {"class": "FFVAD"}):
a_series = pd.Series(['https://'+s.split(' ')[0] for s in
t['srcset'].split('https://')[1:]], index = df.columns)
df = df.append(a_series, ignore_index=True)
df.drop_duplicates(inplace = True)
print('last_height: ', last_height, ' links: ', len(df))
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
df.to_csv('insta_foot.csv')
size = '640w'
for i, row in df.iterrows():
link = row[size]
n = 'images/' + [e for e in link.split('/') if '.jpg' in e][0].split('.jpg')[0] +
'_' + size + '.jpg'
with open(n,"wb") as f:
f.write(requests.get(link).content)
print('index: ', i)
driver.close()

How would I loop through each page and scrape the specific parameters?

#Import Needed Libraries
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titlelink')
subtext = soup.select('.subtext')
def sort_stories_by_votes(hnlist): #Sorting your create_custom_hn dict by votes(if)
return sorted(hnlist, key= lambda k:k['votes'], reverse=True)
def create_custom_hn(links, subtext): #Creates a list of links and subtext
hn = []
for idx, item in enumerate(links): #Need to use this because not every link has a lot of votes
title = links[idx].getText()
href = links[idx].get('href', None)
vote = subtext[idx].select('.score')
if len(vote):
points = int(vote[0].getText().replace(' points', ''))
if points > 99: #Only appends stories that are over 100 points
hn.append({'title': title, 'link': href, 'votes': points})
return sort_stories_by_votes(hn)
pprint.pprint(create_custom_hn(links, subtext))
My question is that this is only for the first page, which has only 30 stories.
How would I apply my web scraping method by going through each page.... let's say the next 10 pages and keeping the formatted code above?
The URL for each page is like this
https://news.ycombinator.com/news?p=<page_number>
Use a for-loop to scrape content from each page. See the code below.
Here is the code that prints the contents from the first two pages. You can change the page_no depending on your need.
import requests
from bs4 import BeautifulSoup
import pprint
def sort_stories_by_votes(hnlist): #Sorting your create_custom_hn dict by votes(if)
return sorted(hnlist, key= lambda k:k['votes'], reverse=True)
def create_custom_hn(links, subtext, page_no): #Creates a list of links and subtext
hn = []
for idx, item in enumerate(links): #Need to use this because not every link has a lot of votes
title = links[idx].getText()
href = links[idx].get('href', None)
vote = subtext[idx].select('.score')
if len(vote):
points = int(vote[0].getText().replace(' points', ''))
if points > 99: #Only appends stories that are over 100 points
hn.append({'title': title, 'link': href, 'votes': points})
return sort_stories_by_votes(hn)
for page_no in range(1,3):
print(f'Page: {page_no}')
url = f'https://news.ycombinator.com/news?p={page_no}'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titlelink')
subtext = soup.select('.subtext')
pprint.pprint(create_custom_hn(links, subtext, page_no))
Page: 1
[{'link': 'https://www.thisworddoesnotexist.com/',
'title': 'This word does not exist',
'votes': 904},
{'link': 'https://www.sparkfun.com/news/3970',
'title': 'A patent troll backs off',
'votes': 662},
.
.
Page: 2
[{'link': 'https://www.vice.com/en/article/m7vqkv/how-fbi-gets-phone-data-att-tmobile-verizon',
'title': "The FBI's internal guide for getting data from AT&T, T-Mobile, "
'Verizon',
'votes': 802},
{'link': 'https://www.dailymail.co.uk/news/article-10063665/Government-orders-Google-track-searching-certain-names-addresses-phone-numbers.html',
'title': 'Feds order Google to track people searching certain names or '
'details',
'votes': 733},
.
.

Function for these parse elements will do not repeat. BeautifulSoup

Which function(or etc) is ideal so that these nicknames do not repeat on my parser. Dont know how to do that. I'l be very grateful if you help me.
Source:
from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup
# save all the nicknames to 'CSV' file format
filename = "BattlePassNicknames.csv"
f = open(filename, "a", encoding="utf-8")
headers1 = "Member of JAZE Battle Pass 2019\n"
b = 1
if b < 2:
f.write(headers1)
b += 1
# start page
i = 1
while True:
# disable jaze guard. turn off html 'mod_security'
link = 'https://jaze.ru/forum/topic?id=50&page='+str(i)
my_url = Request(
link,
headers={'User-Agent': 'Mozilla/5.0'}
)
i += 1 # increment page no for next run
uClient = uReq(my_url)
if uClient.url != link:
break
page_html = uClient.read()
# Check if there was a redirect
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each name of player
containers = page_soup.findAll("div", {"class": "top-area"})
for container in containers:
playerName = container.div.a.text.strip()
print("BattlePass PlayerName: " + playerName)
f.write(playerName + "\n")
You can add all the names to a set.
A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.
my_set = set()
# Lets add some elements to a set
my_set.add('a')
my_set.add('b')
print(my_set) # prints {'a', 'b'}
# Add one more 'a'
my_set.add('a')
print(my_set) # still prints {'a', 'b'} !
In your case, let's add all the names to a set and then write to the file after the for loop.
from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup
# save all the nicknames to 'CSV' file format
filename = "BattlePassNicknames.csv"
f = open(filename, "a", encoding="utf-8")
headers1 = "Member of JAZE Battle Pass 2019\n"
b = 1
if b < 2:
f.write(headers1)
b += 1
# start page
i = 1
names = set()
while True:
# disable jaze guard. turn off html 'mod_security'
link = 'https://jaze.ru/forum/topic?id=50&page='+str(i)
my_url = Request(
link,
headers={'User-Agent': 'Mozilla/5.0'}
)
i += 1 # increment page no for next run
uClient = uReq(my_url)
if uClient.url != link:
break
page_html = uClient.read()
# Check if there was a redirect
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each name of player
containers = page_soup.findAll("div", {"class": "top-area"})
for container in containers:
playerName = container.div.a.text.strip()
names.add(playerName)
for name in names:
f.write(name + "\n")
f.close()
EDIT
Sets do not preserve the order. If you want to retain the order, just use lists.
...
names = []
while True:
...
for container in containers:
playerName = container.div.a.text.strip()
if playerName not in names:
names.append(playerName)
for name in names:
f.write(name + "\n")
f.close()

Python Web Scrape Unknown Number of Pages

I have working code that scrapes a single Craigslist page for specific information, but what would I need to add in order to grab the data from ALL of the pages (not knowing how many pages ahead of time)?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url="https://portland.craigslist.org/search/sss?query=electronics&sort=date"
uClient=uReq(my_url) #sends GET request to URL
page_html=uClient.read() #reads returned data and puts it in a variable
uClient.close() #close the connection
#create a file that we will want later to write parsed data to
filename="ScrapedData.csv"
f=open(filename, 'w')
headers="date, location, title, price\n"
f.write(headers)
#use BS to parse the webpage
page_soup=soup(page_html,'html.parser') #applying BS to the obtained html
containers=page_soup.findAll('p',{'class','result-info'})
for container in containers:
container_date=container.findAll('time',{'class','result-date'})
date=container_date[0].text
try:
container_location=container.findAll('span',{'class','result-hood'})
location=container_location[0].text
except:
try:
container_location=container.findAll('span',{'class','nearby'})
location=container_location[0].text
except:
location='NULL'
container_title=container.findAll('a',{'class','result-title'})
title=container_title[0].text
try:
container_price=container.findAll('span',{'class','result-price'})
price=container_price[0].text
except:
price='NULL'
#to print to screen
print('date:'+date)
print('location:'+location)
print('title:'+title)
print('price:'+price)
#to write to csv
f.write(date+','+location.replace(",","-")+','+title.replace(","," ")+','+price+'\n')
f.close()
Apart from what sir Andersson has already shown, you can do that as well for this site:
import requests
from bs4 import BeautifulSoup
import csv
page_link = "https://portland.craigslist.org/search/sss?s={}&query=electronics&sort=date"
for link in [page_link.format(page) for page in range(0,1147,120)]: #this is the fix
res = requests.get(link)
soup = BeautifulSoup(res.text,'lxml')
for container in soup.select('.result-info'):
try:
date = container.select('.result-date')[0].text
except IndexError:
date = ""
try:
title = container.select('.result-title')[0].text
except IndexError:
title = ""
try:
price = container.select('.result-price')[0].text
except IndexError:
price = ""
print(date,title,price)
with open("craigs_item.csv","a",newline="",encoding="utf-8") as outfile:
writer = csv.writer(outfile)
writer.writerow([date,title,price])
You can try to loop through all pages by handling "s" parameter in URL until you find page with no results (page with text "search and you will find"):
import requests
results_counter = 0
while True:
my_url="https://portland.craigslist.org/search/sss?query=electronics&sort=date&s=%d" % results_counter
page_html = requests.get(my_url).text
if "search and you will find" in page_html:
break
else:
results_counter += 120
filename="ScrapedData.csv"
f=open(filename, 'w')
headers="date, location, title, price\n"
f.write(headers)
page_soup=soup(page_html,'html.parser') #applying BS to the obtained html
containers=page_soup.findAll('p',{'class','result-info'})
...

Can't append Base URL to create absolute links with Beatifulsoup Python 3

I get a list of links in the output file but need all of the links to show as absolute links. Some are absolute and others are relative. How do I append the base url to the relatives to ensure that I get only absolute links in the csv output?
I get back all the links but not all are absolute links e.g /subpage instead of http://page.com/subpage
from bs4 import BeautifulSoup
import requests
import csv
j = requests.get("http://cnn.com").content
soup = BeautifulSoup(j, "lxml")
#only return links to subpages e.g. a tag that contains href
data = []
for url in soup.find_all('a', href=True):
print(url['href'])
data.append(url['href'])
print(data)
with open("file.csv",'w') as csvfile:
write = csv.writer(csvfile, delimiter = ' ')
write.writerows(data)
content = open('file.csv', 'r').readlines()
content_set = set(content)
cleandata = open('file.csv', 'w')
for line in content_set:
cleandata.write(line)
with urljoin:
from urlparse import urljoin
...
base_url = "http://cnn.com"
absolute_url = urljoin(base_url, relative_url)

Resources