Cleaning scraped data Python

Cleaning scraped data Python - python-3.x

I trying to learn how to scrape a website but I'm not able to figure out how to "clean" the data when importing the output to Excel.
Here is the code I used:
However, when opening the Excel file, the output is in need of some cleaning:
I assume that I should put ".text" somewhere put I don't know where. I tried adding .text as show below but it resulted in "AttributeError: 'NoneType' object has no attribute 'text'"
for i in links:
index.append([i.attrs['title']]).text
summary.append([i.attrs["aria-label"]]).text

You're telling Python to append a list to the index and summary lists by having extra brackets.
Try this instead:
for i in links:
index.append(i.attrs['title'])
summary.append(i.attrs["aria-label"])
You're still going to have some ugly information in the summary column. You can use replace or RegEx to clean this up. If you provide an output that you desire, I can edit this to include the appropriate code for replacing characters.
Also, index is the name of a method in Python, so I would choose a different list name for that.

Related

Add Bookmarks to pdf using Pymupdf

How to add Bookmarks to pdf using Pymupdf. I have seen many ways using PyPDF2 but since I'm already using pymupdf for other annotations I would prefer pymupdf for adding bookmarks. Also would like to highlight the text and add bookmarks to it.

You cannot add single bookmarks like you can in other packages.
If you have looked at the details there - or rather in the respective PDF specification, this is an overly / unnecessarily complex task.
PyMuPDF in contrast has this simple approach to offer:
Prepare a Python list that looks like a traditional table of contents (TOC):
Every line in the list contains the hierarchy level, the text to display and the page number. Optionally also some information where on the target page the pointer goes to.
Then use doc.set_toc(toc_list). All pesky detail is taken care of for you.
If the PDF already has a TOC, extract it to a list of that same structure via toc_list = doc.get_toc().
Then modify as required.

Insert Hyperlink in Access Database (pyodbc)

Here is the situation: I'm 'having fun' really using Microsoft Access for the first time for small personnal project/tools ideas.
I don't know anything about VBA yet, and unless I can't do without it, I don't plan to learn it this time (already a lot else to cover).
So I tried to use Python to automatize the main table filling. I did find pyodbc package and succeeded to connect, read and write some data out of my database.
However, I wanted to experiment a little further, and one of the fields could contain hyperlinks (could be handled somewhere else in another script later, but I am curious about the functionality anyway)...
But I couldn't figure how to insert hyperlink data in the table. I only get the displayed text set, but not the target one.
Is this feasible using pyodbc or am I on the wrong track?
Thanks in advance!
Emmanuel

The hyperlink field in MS Access consists of three parts, separated by #:
display text # filename or target # location within the document
So an example of the data of a field can look like this:
StackOverflow#http://www.stackoverflow.com#
See the docs: https://learn.microsoft.com/en-us/office/vba/api/access.application.hyperlinkpart
and samples here also: http://allenbrowne.com/casu-09.html

Python docx paragraph method is giving anomalous output

I am using python docx for word file processing. While using larger files(50+ pages), the paragraph.text method is returning string which is inconsistent with my file.
import docx
document=Document(f)
paratext=[]
paragraphs=document.paragraphs
for paragraph in paragraphs:
text=paragraph.text
paratext.append(text)
print(paratext[30])
Ideally this should print the 30th paragraph. But the output seems distorted (Beginning few characters are missing and the printed output starts from somewhere in the middle of the actual paragraph in some cases). However it works fine if I copy the adjacent few paragraphs in a fresh ms word document (1 page only) and run the code by just changing the index of paratext. For eg I copied 3 adjacent paras into a new doc and used print(paratext[2]), the output seems just perfect here. How do I get rid of this inconsistency as I have to work with larger documents.

I expect this means that the missing text is in runs that are "enclosed" in some other XML element, like perhaps a field or a hyperlink.
The quickest way to discover specifically what's happening might be to modify your short script to temporarily capture the paragraph XML.
import docx
document = Document(f)
p_xml = [paragraph._element.xml for paragraph in document.paragraphs]
print(p_xml[30])
Your choices at that point are likely to be editing the Word documents to remove the offending "enclosure" or to process XML for each paragraph yourself using lxml calls.
That might be easier that it sounds if you use the .xpath() method available on paragraph._element. In any case, that would be a separate question in which you show the XML you find with the method above.

List of all BibTex entries in a .bib file, to generate Hakyll publication list?

I'm making a personal website with Hakyll, and I'd like to list my publications.
I've found this module and this guide for how to print the references from a markdown document at the bottom.
The problem with this is, it assumes you've got some document, where you cite all the things you want printed.
What I want is to generate a document that lists every document my .bib file. In particular:
I don't want to have to manually write the bibtex name of each publication I want listed
I just want the "references" section printed, i.e. there's no place in the document where the publication is referenced, they're just listed at the end.
Is it possible to get this information from the Hakyll.Web.Pandoc.Biblio module? Or do I need to separately parse the .bib file to get this? And once I do, how would I make go about generating this page with Hakyll?

You could use this trick from the pandoc's manual, the equivalent of biblatex's \nocite{*}:
It is possible to create a bibliography with all the citations,
whether or not they appear in the document, by using a wildcard:
---
nocite: |
#*
---

Reading a file and looking for specifc strings

Hey so my question might be basic but I am a little lost on how to implement it.
If I was reading a file, for example an HTML File. How do I grab a specific parts of the file. For example what I want to do is
blahblahblahblah<br>blahblahblah
how do I find the tag that starts off with < and ends with > and grab the string inside which is br in Python?

This is a very broad question there are a couple of ways you could retrieve a single string from a html file.
First option would be to parse the file with a library like BeautifulSoup, this option is also valid for xml files too.
Second option would be, if the file is relatively small you could use regex to locate a string you want and return it.
First option is what I would recommend, if you use a library like BeautifulSoup you have a lot of functionality, eg. to find the parent element of a selected tag and so on.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cleaning scraped data Python - python-3.x

Related

Add Bookmarks to pdf using Pymupdf

Insert Hyperlink in Access Database (pyodbc)

Python docx paragraph method is giving anomalous output

List of all BibTex entries in a .bib file, to generate Hakyll publication list?

Reading a file and looking for specifc strings

Categories

Resources