How to remove duplicate titles while scraping it from web-page - python-3.x

I Want to remove duplicate titles to be removed from the output, i am using Beautiful soup to scrape the titles.
#!/usr/bin/python
from bs4 import BeautifulSoup
import requests
source = requests.get('https://itrevolution.com/book-downloads-extra-materials/')
source = source.text
soup = BeautifulSoup(source, 'lxml')
for tl in soup.find_all('img', class_='responsive-img hover-img'):
title = set()
title = tl.get('title')
print('{}'.format(title))
Output: Output from the above script..
Accelerate
Team Topologies
Accelerate
Project to Product
War and Peace and IT
A Seat at the Table
The Art of Business Value
DevOps for the Modern Enterprise
Making Work Visible
Leading the Transformation
The DevOps Handbook
The Phoenix Project
Beyond the Phoenix Project
We have title Accelerate which appears twice which needs to be appear one.

You were on the right track, taking advantage of a set() is a great idea. Just create it before the for-loop, and add titles in it using method set.add(). See the following:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://itrevolution.com/book-downloads-extra-materials/')
source = source.text
soup = BeautifulSoup(source, 'lxml')
titles = set()
for tl in soup.find_all('img', class_='responsive-img hover-img'):
title = tl.get('title')
titles.add(title)
print(titles)

If you need a distinct list here is a slight modification to your code:-
from bs4 import BeautifulSoup
import requests
source = requests.get('https://itrevolution.com/book-downloads-extra-materials/')
source = source.text
soup = BeautifulSoup(source, 'lxml')
title = []
for tl in soup.find_all('img', class_='responsive-img hover-img'):
title.append(tl.get('title'))
distinctTitle = (list(set(title)))

Related

Script is not returning proper output when trying to retrieve data from a newsletter

I am trying to write a script that can retrieve album title and band name from a music store newsletter. The band name and album title are hidden in a h3 & h4 class. When executing the script I get a blank output in the csv file.
`
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('a', attrs={'class': 'row'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('td', attrs={'td_class': 'h3 class'})
band_name_element = album.find('td', attrs={'td_class': 'h4 class'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
I think the error is in the attrs part, not sure how to fix it properly. Thanks in advance!
Looking at your code I agree that the error lies in the attrs part. The problem you are facing is that the site you are trying to scrape does not contain 'a' elements with the 'row' class. Thus find_all returns an empty list. There are plenty of 'div' elements with the 'row' class, maybe you meant to look for those?
You had the right idea by looking for 'td' elements and extracting their 'h3' and 'h4' elements, but since albums is an empty list, there are no elements to find.
I changed your code slightly to look for 'td' elements directly and extract their 'h3' and 'h4' elements. With these small changes your code found 29 albums.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'} )
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for i, album in enumerate(albums):
album_title_element = album.find('h3')
band_name_element = album.find('h4')
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv', index=False)
I also took the liberty of adding index=False to the last line of your code. This makes it so each row doesn't start with a ,.
Hope this helps.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('h3', attrs={'class': 'header'})
band_name_element = album.find('h4', attrs={'class': 'header'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
Thanks to the anonymous hero for helping out!

How to Add to an Index From a Web Scrape

Am I able to run a while loop and add to the index below to gather all odd number indexes on the page?
Basically, I want to skip all the even indexes and print the odd indexes without writing
the same line in asterisks below over and over like [1],[3],[5], etc.
Is there a way to write a while loop and add to the index number?
Thanks!!
'''
import bs4
from bs4 import BeautifulSoup
import requests
import lxml
vegas_insider = requests.get('https://www.vegasinsider.com/nfl/matchups/', 'r').text
soup = BeautifulSoup(vegas_insider, 'lxml')
**team =soup.find_all('a', class_ = 'tableText')[1].text**
print(team)
'''
teams = [team.text for team in soup.find_all('a', class_ = 'tableText')[1::2]]
Or to print,
for team in soup.find_all('a', class_ = 'tableText')[1::2]:
print(team)

I am trying to import html text in CSV with beautiful soup but script outputs "blank" CSV

I have written below code to get the data from the website in CSV.
Basically I am interested in text like this in entirety.
(Beskrivning:
PL-CH-DSPTC-AD10Semi-professional Technician / Administrator: Vocationally trained positions that need both practical and theoretical understanding and some significant advanced vocational experience to perform broad range of varying tasks and issues, in related field of work. Work performed is still procedurized, however issues and problems)
So My table should have one row for each description please
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
url = "http://www.altrankarlstad.com/wisp"
page = requests.get(url)
pagetext = page.text
soup = BeautifulSoup(pagetext, 'html.parser')
gdp_table = soup.find("table", attrs={"class": "table workitems-table mt-2"})
def table_to_df(table):
return pd.DataFrame([[td.text for td in row.findAll('td')] for row in table.tbody.findAll('tr')])
file = open("data.csv", 'w')
for row in soup.find_all('tr'):
for col in row.find_all('td'):
print(col.text)

I am trying to extract text inside span_id, but getting blank output using python beautifulsoup

i am tring to extract text inside span-id tag but getting blank output screen.
i have tried using parent element div text also , but fail to extract, please anyone help me.
below is my code.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.paperplatemakingmachines.com/')
soup = BeautifulSoup(r.text,'lxml')
mob = soup.find('span',{"id":"tollfree"})
print(mob.text)
i want the text inside that span which is given mobile number.
You'll have to use Selenium as that text is not present in the initial request, or at least no without searching through <script> tags.
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time
driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
url='https://www.paperplatemakingmachines.com/'
driver.get(url)
# It's better to use Selenium's WebDriverWait, but I'm still learning how to use that correctly
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
mob = soup.find('span',{"id":"tollfree"})
print(mob.text)
The Data is actually rending dynamically through script. What you need to do is parse the data from script:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://www.paperplatemakingmachines.com/')
soup = BeautifulSoup(r.text,'lxml')
script= soup.find('script')
mob = re.search("(?<=pns_no = \")(.*)(?=\";)", script.text).group()
print(mob)
Another way of using regex to find the number
import requests
import re
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.paperplatemakingmachines.com/',)
soup = bs(r.content, 'lxml')
r = re.compile(r'var pns_no = "(\d+)"')
data = soup.find('script', text=r).text
script = r.findall(data)[0]
print('+91-' + script)

BeautifulSoup python: Get the text with no tags and get the adjacent links

I am trying to extract the movie titles and links for it from this site
from bs4 import BeautifulSoup
from requests import get
link = "https://tamilrockerrs.ch"
r = get(link).content
#r = open('json.html','rb').read()
b = BeautifulSoup(r,'html5lib')
a = b.findAll('p')[1]
But the problem is there is no tag for the titles. I can't extract the titles and if I could do that how can I bind the links and title together.
Thanks in Advance
You can find title and link by this way.
from bs4 import BeautifulSoup
import requests
url= "http://tamilrockerrs.ch"
response= requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('div', {"class":"title"})
for film in data:
print("Title:", film.find('a').text) # get the title here
print("Link:", film.find('a').get("href")) #get the link here

Resources