Python docx paragraph method is giving anomalous output - python-3.x

I am using python docx for word file processing. While using larger files(50+ pages), the paragraph.text method is returning string which is inconsistent with my file.
import docx
document=Document(f)
paratext=[]
paragraphs=document.paragraphs
for paragraph in paragraphs:
text=paragraph.text
paratext.append(text)
print(paratext[30])
Ideally this should print the 30th paragraph. But the output seems distorted (Beginning few characters are missing and the printed output starts from somewhere in the middle of the actual paragraph in some cases). However it works fine if I copy the adjacent few paragraphs in a fresh ms word document (1 page only) and run the code by just changing the index of paratext. For eg I copied 3 adjacent paras into a new doc and used print(paratext[2]), the output seems just perfect here. How do I get rid of this inconsistency as I have to work with larger documents.

I expect this means that the missing text is in runs that are "enclosed" in some other XML element, like perhaps a field or a hyperlink.
The quickest way to discover specifically what's happening might be to modify your short script to temporarily capture the paragraph XML.
import docx
document = Document(f)
p_xml = [paragraph._element.xml for paragraph in document.paragraphs]
print(p_xml[30])
Your choices at that point are likely to be editing the Word documents to remove the offending "enclosure" or to process XML for each paragraph yourself using lxml calls.
That might be easier that it sounds if you use the .xpath() method available on paragraph._element. In any case, that would be a separate question in which you show the XML you find with the method above.

Related

Add Bookmarks to pdf using Pymupdf

How to add Bookmarks to pdf using Pymupdf. I have seen many ways using PyPDF2 but since I'm already using pymupdf for other annotations I would prefer pymupdf for adding bookmarks. Also would like to highlight the text and add bookmarks to it.
You cannot add single bookmarks like you can in other packages.
If you have looked at the details there - or rather in the respective PDF specification, this is an overly / unnecessarily complex task.
PyMuPDF in contrast has this simple approach to offer:
Prepare a Python list that looks like a traditional table of contents (TOC):
Every line in the list contains the hierarchy level, the text to display and the page number. Optionally also some information where on the target page the pointer goes to.
Then use doc.set_toc(toc_list). All pesky detail is taken care of for you.
If the PDF already has a TOC, extract it to a list of that same structure via toc_list = doc.get_toc().
Then modify as required.

Cleaning scraped data Python

I trying to learn how to scrape a website but I'm not able to figure out how to "clean" the data when importing the output to Excel.
Here is the code I used:
However, when opening the Excel file, the output is in need of some cleaning:
I assume that I should put ".text" somewhere put I don't know where. I tried adding .text as show below but it resulted in "AttributeError: 'NoneType' object has no attribute 'text'"
for i in links:
index.append([i.attrs['title']]).text
summary.append([i.attrs["aria-label"]]).text
You're telling Python to append a list to the index and summary lists by having extra brackets.
Try this instead:
for i in links:
index.append(i.attrs['title'])
summary.append(i.attrs["aria-label"])
You're still going to have some ugly information in the summary column. You can use replace or RegEx to clean this up. If you provide an output that you desire, I can edit this to include the appropriate code for replacing characters.
Also, index is the name of a method in Python, so I would choose a different list name for that.

Using a list for a feature in an ML model

I want to run a machine learning algorithm on some data, so I'm exporting the data into a file first.
But one of my features for the text I'm classifying is a list of tags,
and each text can have multiple tags ex. (["mystery", "thriller"]).
Is it recommended that when I write to my CSV file for exporting the data, that I write that entire list as one of the features for my data (the "tags" feature).
Or is it better to make a separate feature for each tag. The only problem then is that most examples will only have one tag, so the other feature columns for those will be blank.
So it seems like writing this list of tags as one feature makes the most sense, but then when parsing it for training, would I then treat every element of that list as its own feature still or no?
If you do it as a single feature just make sure to use some delimiter to separate the tags that won't occur in any of the tags, and also isn't a comma (as that will mess with the csv format), something like | would probably do fine. When you go to build your models and read in that list of tags you can then split it based on that delimiter. In Java this would look like:
String[] tagList = inputString.split("|");
I'm sure most languages will have a similar method to do this.

Reading a file and looking for specifc strings

Hey so my question might be basic but I am a little lost on how to implement it.
If I was reading a file, for example an HTML File. How do I grab a specific parts of the file. For example what I want to do is
blahblahblahblah<br>blahblahblah
how do I find the tag that starts off with < and ends with > and grab the string inside which is br in Python?
This is a very broad question there are a couple of ways you could retrieve a single string from a html file.
First option would be to parse the file with a library like BeautifulSoup, this option is also valid for xml files too.
Second option would be, if the file is relatively small you could use regex to locate a string you want and return it.
First option is what I would recommend, if you use a library like BeautifulSoup you have a lot of functionality, eg. to find the parent element of a selected tag and so on.

Scraping data into Stata

I have 40,000 HTML files. Each file has a table containing the profit & loss statement of a particular company.
I would like to scrape all these data into Stata. (Or alternatively, into an Excel/CSV file). The end product should be a Stata/Excel file containing a list of all companies and details of their balance sheet (revenue, profit, etc.)
May I know how this can be done? I tried Outwit but it doesn't seem good enough.
Stata is not exactly the best tool for the job. You would have to use low-level file commands to read the input text files, and then parse out the relevant tables (again, using low level string processing). Putting them into data set is the easiest part; you can either
expand 2 in l
replace company = "parsed name" in l
replace revenue = parsed_revenue in l
etc., or use post mechanics. With some luck, you'd find some packages that may make it simpler, but I am not aware of any, and findit html does not seem to bring anything usable.
Stata is not the good tool for this job. In principle it is possible. Personally I have already done similar things: reading ascii files into Stata, parsing them and extracting information fro them. I have dumped the data into Stata using insheet. Then I have treated the data with Stata's string functions. It was a bit cumbersome. And the files had quite a simple and clear structure. I don't want to imagine what happens when the files have a more complicated structure.
I think that the best strategy is to use a scripting language such as Python, Perl or Ruby. to extract the information contained in the html tables. The results can easily be written into a csv, Excel or even a Stata (.dta) file.
You should use Python beautifulsoup package. It is very handy in extracting data from HTML files. Following is the link.
http://www.crummy.com/software/BeautifulSoup/
In the documentation, there are many commands, however only few commands are important. Following are the important commands:
from bs4 import BeautifulSoup
#read the file
fp=open(file_name,'r')
data=fp.read()
fp.close()
#pass the data to beautifulsoup
soup = BeautifulSoup(html_doc, 'html.parser')
#extract the html elements by id and write result into file

Resources