Beautiful Soup 4 Custom Attribute Order Output - python-3.x

I want to create a custom output formatter in BS4 that will rearrange the order of attributes of tags in an XML in a specific way this is not alphabetical order.
For instance, I want to output the following tag:
<word form="συ" head="2610" id="2357" lemma="συ" postag="p-s----n-" relation="ExD_AP"/>
as:
<word id="2357" head="2610" postag="p-s----n-" form="συ" lemma="συ" relation="ExD_AP"/>
BS4's documentation offers a clue as to where to begin. They give the following example:
from bs4.formatter import HTMLFormatter
class UnsortedAttributes(HTMLFormatter):
def attributes(self, tag):
for k, v in tag.attrs.items():
if k == 'm':
continue
yield k, v
print(attr_soup.p.encode(formatter=UnsortedAttributes()))
This will make a custom HTML output formatter that will leave attributes in the order they were input and also ignore certain tags, but I don't know how to alter this so that it will output in whatever order I would like. Can anyone help me out?

How about this?
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<word form="συ" head="2610" id="2357" lemma="συ" postag="p-s----n-" relation="ExD_AP"/>
'''
def toString(ele):
order = ['id','head','postag','from','lemma','relation']
result = '<'+ele.tag
for p in order:
result+=' {}="{}"'.format(p,ele[p])
return result+'/>'
doc = SimplifiedDoc(html)
ele = doc.word
print (toString(ele))
Result:
<word id="2357" head="2610" postag="p-s----n-" from="None" lemma="συ" relation="ExD_AP"/>

Strictly speaking, I have an answer to my own question, but it's going to take more work to actually implement it in a way I'd like. Here's how to do it.
Make a sub-class of the XMLFormatter (or HTMLFormatter if you're working with HTML), name it what you want. I chose "SortAttributes." Write the function "attributes" so that it will return a list of tuples: [(attribute1, value1), (attribute2, value2), etc.] in the order you want. Mine may look verbose, but I do it this way because I work with very inconsistent XML.
from bs4 import BeautifulSoup
from bs4.formatter import XMLFormatter
class SortAttributes(XMLFormatter):
def attributes(self, tag):
"""Reorder a tag's attributes however you want."""
attrib_order = ['id', 'head', 'postag', 'relation', 'form', 'lemma']
new_order = []
for element in attrib_order:
if element in tag.attrs:
new_order.append((element, tag[element]))
for pair in tag.attrs.items():
if pair not in new_order:
new_order.append(pair)
return new_order
xml_string = '''
<word form="συ" head="2610" id="2357" lemma="συ" postag="p-s----n-" relation="ExD_AP"/>
'''
soup = BeautifulSoup(xml_string, 'xml')
print(soup.encode(formatter=SortAttributes()))
This will output what I want:
<word id="2357" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/>
Conveniently, I can do this for an entire document with the same encode method. But if I write that to a file as a string, then all the tags are just strung together end to end. A sample would be like such:
<sentence id="783"><word id="2357" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/><word id="2358" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/><word id="2359" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/></sentence>
Instead of something I'd prefer:
<sentence id="783">
<word id="2357" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/>
<word id="2358" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/>
<word id="2359" head="2610" postag="p-s----n-" relation="ExD_AP" form="συ" lemma="συ"/>
</sentence>
To fix that, I can't just .prettify it because prettify rearranges the attributes back to alphabetical order. I'll have to go into more details with the XMLFormatter subclass instead. I hope someone finds this helpful in the future!

Related

Script is not returning proper output when trying to retrieve data from a newsletter

I am trying to write a script that can retrieve album title and band name from a music store newsletter. The band name and album title are hidden in a h3 & h4 class. When executing the script I get a blank output in the csv file.
`
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('a', attrs={'class': 'row'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('td', attrs={'td_class': 'h3 class'})
band_name_element = album.find('td', attrs={'td_class': 'h4 class'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
I think the error is in the attrs part, not sure how to fix it properly. Thanks in advance!
Looking at your code I agree that the error lies in the attrs part. The problem you are facing is that the site you are trying to scrape does not contain 'a' elements with the 'row' class. Thus find_all returns an empty list. There are plenty of 'div' elements with the 'row' class, maybe you meant to look for those?
You had the right idea by looking for 'td' elements and extracting their 'h3' and 'h4' elements, but since albums is an empty list, there are no elements to find.
I changed your code slightly to look for 'td' elements directly and extract their 'h3' and 'h4' elements. With these small changes your code found 29 albums.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'} )
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for i, album in enumerate(albums):
album_title_element = album.find('h3')
band_name_element = album.find('h4')
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv', index=False)
I also took the liberty of adding index=False to the last line of your code. This makes it so each row doesn't start with a ,.
Hope this helps.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('h3', attrs={'class': 'header'})
band_name_element = album.find('h4', attrs={'class': 'header'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
Thanks to the anonymous hero for helping out!

python: graphviz -- how to specify node render order

How can I ensure that nodes are rendered in the order that they are defined in my source code? In my code, I have node 'A' defined first, and so was hoping it would be rendered at the 'top' of the graph, and then the three 'child class' nodes would appear below it (all in a row as they are in the embedded image). but as can be see in the embedded image, the opposite has occurred. I am using Python 3.7.12 in a Jupyter Notebook.
from IPython.display import display
import graphviz
p = graphviz.Digraph(node_attr={'shape':'box'},format='png')
p.node(name='A', label='parent class')
with p.subgraph() as s:
s.attr(rank='same')
s.node(name='B', label='child class 1')
s.node(name='C', label='child class 2')
s.node(name='D', label='child class 3')
p.edges(['BA', 'CA', 'DA'])
p.render('a.gv',view=False)
with open('a.gv') as f:
dot = f.read()
display(graphviz.Source(dot))
Does anyone know how to fix this?
OK I just had to set the 'rankdir' graph attribute to 'BT' which I believe means 'bottom-top'
here's the updated code:
from IPython.display import display
import graphviz
p = graphviz.Digraph(graph_attr={'rankdir':"BT"},
node_attr={'shape':'box'},
format='png')
p.node(name='A', label='parent class')
with p.subgraph() as s:
s.attr(rank='same')
s.node(name='B', label='child class 1')
s.node(name='C', label='child class 2')
s.node(name='D', label = 'child class 3')
p.edges(['BA', 'CA', 'DA'])
p.render('a.gv',view=False)
with open('a.gv') as f:
dot = f.read()
display(graphviz.Source(dot))

How would I loop through each page and scrape the specific parameters?

#Import Needed Libraries
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titlelink')
subtext = soup.select('.subtext')
def sort_stories_by_votes(hnlist): #Sorting your create_custom_hn dict by votes(if)
return sorted(hnlist, key= lambda k:k['votes'], reverse=True)
def create_custom_hn(links, subtext): #Creates a list of links and subtext
hn = []
for idx, item in enumerate(links): #Need to use this because not every link has a lot of votes
title = links[idx].getText()
href = links[idx].get('href', None)
vote = subtext[idx].select('.score')
if len(vote):
points = int(vote[0].getText().replace(' points', ''))
if points > 99: #Only appends stories that are over 100 points
hn.append({'title': title, 'link': href, 'votes': points})
return sort_stories_by_votes(hn)
pprint.pprint(create_custom_hn(links, subtext))
My question is that this is only for the first page, which has only 30 stories.
How would I apply my web scraping method by going through each page.... let's say the next 10 pages and keeping the formatted code above?
The URL for each page is like this
https://news.ycombinator.com/news?p=<page_number>
Use a for-loop to scrape content from each page. See the code below.
Here is the code that prints the contents from the first two pages. You can change the page_no depending on your need.
import requests
from bs4 import BeautifulSoup
import pprint
def sort_stories_by_votes(hnlist): #Sorting your create_custom_hn dict by votes(if)
return sorted(hnlist, key= lambda k:k['votes'], reverse=True)
def create_custom_hn(links, subtext, page_no): #Creates a list of links and subtext
hn = []
for idx, item in enumerate(links): #Need to use this because not every link has a lot of votes
title = links[idx].getText()
href = links[idx].get('href', None)
vote = subtext[idx].select('.score')
if len(vote):
points = int(vote[0].getText().replace(' points', ''))
if points > 99: #Only appends stories that are over 100 points
hn.append({'title': title, 'link': href, 'votes': points})
return sort_stories_by_votes(hn)
for page_no in range(1,3):
print(f'Page: {page_no}')
url = f'https://news.ycombinator.com/news?p={page_no}'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titlelink')
subtext = soup.select('.subtext')
pprint.pprint(create_custom_hn(links, subtext, page_no))
Page: 1
[{'link': 'https://www.thisworddoesnotexist.com/',
'title': 'This word does not exist',
'votes': 904},
{'link': 'https://www.sparkfun.com/news/3970',
'title': 'A patent troll backs off',
'votes': 662},
.
.
Page: 2
[{'link': 'https://www.vice.com/en/article/m7vqkv/how-fbi-gets-phone-data-att-tmobile-verizon',
'title': "The FBI's internal guide for getting data from AT&T, T-Mobile, "
'Verizon',
'votes': 802},
{'link': 'https://www.dailymail.co.uk/news/article-10063665/Government-orders-Google-track-searching-certain-names-addresses-phone-numbers.html',
'title': 'Feds order Google to track people searching certain names or '
'details',
'votes': 733},
.
.

Export scraped data to CSV

Can someone please explain to me how to export the scraped data from this script to a csv through a python script? It seems that I am successfully scraping the data through the output I am seeing, but I am not sure how to put this into a csv efficiently. Thanks.
import scrapy
import scrapy.crawler as crawler
class RedditbotSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['www.reddit.com/r/gameofthrones/']
start_urls = ['https://www.reddit.com/r/gameofthrones/']
output = 'output.csv'
def parse(self, response):
yield {'a': 'b'}
#Extracting the content using css selectors
titles = response.css('.title.may-blank::text').extract()
votes = response.css('.score.unvoted::text').extract()
times = response.css('time::attr(title)').extract()
comments = response.css('.comments::text').extract()
#Give the extracted content row wise
for item in zip(titles,votes,times,comments):
#create a dictionary to store the scraped info
scraped_info = {
'title' : item[0],
'vote' : item[1],
'created_at' : item[2],
'comments' : item[3],
}
#yield or give the scraped info to scrapy
yield scraped_info
def run_crawler(spider_cls):
"""
spider_cls: Scrapy Spider class
settings: Scrapy settings
returns: Twisted Deferred
"""
runner = crawler.CrawlerRunner()
return runner.crawl(spider_cls) # return Deferred
def test_scrapy_crawler():
deferred = run_crawler(RedditbotSpider)
#deferred.addCallback
def success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""
#deferred.addErrback
def error(failure):
raise failure.value
return deferred
test_scrapy_crawler()
you can include the Feed Exporter configuration on the settings before running the spider. So for your code try changing:
runner = crawler.CrawlerRunner()
with
runner = crawler.CrawlerRunner({
'FEED_URI': 'output_file.csv',
'FEED_FORMAT': 'csv',
})
The output items should be inside the output_file.csv file in the same directory you run this script.

How to access the local path of a downloaded image Scrapy

here is how i am downloading Images.Now i had create one more pipline to insert my scraped data.
class CmindexPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_url']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
print("From Images Items", item)
return item
class MysqlPipline(object):
def process_item(self, item, spider):
print("From Process Items",item['image_path'])
here is my settings.py
ITEM_PIPELINES = {'cmindex.pipelines.CmindexPipeline': 1,'cmindex.pipelines.MysqlPipline':2}
IMAGES_STORE ='E:\WorkPlace\python\cmindex\cmindex\img'
IMAGES_THUMBS = {
'16X16': (16, 16)
}
But unfortunately sill i am not able to access item['image_paths'] in process_item.it raise error
KeyError: 'image_paths'
If anyone know what i am doing wrong please suggest me.
The process_item method is called before item_completed, so it does not have the image_paths yet.
If you want to access the image_paths, you will have to do it inside item_completed, or write another pipeline that is placed after the image pipeline.

Resources