Unknown encoding of files in a resulting Beautiful Soup txt file - text

I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end (EDIT: I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup.
I've tried using online decoder, because I thought that maybe this is connected to Base64 encoding, but it seems that none of the known encoding could help me. I saw that at the beginning of some files, there is something like: "created with Certent Disclosure Management 6.31.0.1" and other programs, I thought maybe this causes the encoding. Nevertheless Word is able to open these files, so I guess there must be a known key to it. This is a sample encoded data:
M1G2RBE#MN)T='1,SC4,]%$$Q71T3<XU#[AHMB9#*E1=E_U5CKG&(77/*(LY9
ME$N9MY/U9DC,- ZY:4Z0EWF95RMQY#J!ZIB8:9RWF;\"S+1%Z*;VZPV#(MO
MUCHFYAJ'V#6O8*[R9L<VI8[I8KYQB7WSC#DMFGR[E6+;7=2R)N)1Q\24XQ(K
MYQDS$>UJ65%MV4+(KBRHJ3HFIAR76#G/F$%=*9FOU*DM-6TSTC$Q\[C$YC$/
And a sample file from the 13 000 that I downloaded.
Below I insert the BeautifulSoup that I use to extract text. It does its' job, but I need to find a clue to this encoded string and somehow decode it in the Python code below.
from bs4 import BeautifulSoup
with open("98752-TOROTEL INC-10-K-2019-07-23", "r") as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
print(soup.getText())
with open("extracted_test.txt", "w", encoding="utf-8") as f:
f.write(soup.getText())
f.close()
What I want to achieve is decoding of this dummy string in the end of the file.

Ok, this is going to be somewhat messy, but will get you close enough to what you are looking for, without using regex (which is notoriously problematic with html). The fundamental problem you'll be facing is that EDGAR filings are VERY inconsistent in their formatting, so what may work for one 10Q (or 10K or 8K) filing may not work with a similar filing (even from the same filer...) For example, the word 'item' may appear in either lower or uppercase (or mixed), hence the use of the string.lower() method, etc. So there's going to be some cleanup, under all circumstances.
Having said that, the code below should get you the RISK FACTORS sections from both filings (including the one which has none):
url = [one of these two]
from bs4 import BeautifulSoup as bs
response = requests.get(url)
soup = bs(response.content, 'html.parser')
risks = soup.find_all('a')
for risk in risks:
if 'item' in str(risk.attrs).lower() and '1a' in str(risk.attrs).lower():
for i in risk.findAllNext():
if 'item' in str(i.attrs).lower():
break
else:
print(i.text.strip())
Good luck with your project!

Related

When I parse a large XML sitemap on Beautifulsoup in Python, it only parses part of the file

I have written code that pulls out URLs of a very large sitemap xml file (10mb) using Beautiful Soup, and it works exactly how I want it, but it only seems to do a small amount of the overall file. This is my code:
`sitemap = "sitemap1.xml"
from bs4 import BeautifulSoup as bs
import lxml
content = []
with open(sitemap, "r") as file:
# Read each line in the file, readlines() returns a list of lines
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "xml")
result = bs_content.find_all("loc")
for result in result:
print(result.text)
`
I have changed my IDE to allow for larger files, it just seems to start the process at a random point towards the end of the XML file and only extracts from there on.
I just wanted to say I ended up sorting this out. I used the read XML function in pandas and it worked well. The original XML file was corrupted.
... I also realised that the console was just printing from a certain point because it's such a large file, and it was still actually processing the whole file.
Sorry about this - I'm new :)

Why does find_next_sibling in bs4 work on one line of code but not another, very similar, line of code?

I'm writing a simple web scraper to get data from the Texas Commission on Environmental Quality (TCEQ) website. The info I need is inside 'td' tags. I'm scraping the appropriate 'td' by referencing the preceding 'th', which all have the same text used to ID. I'm using find_next_sibling to scrape the data into a variable.
Here is my code:
import requests
from bs4 import BeautifulSoup
URL = "https://www2.tceq.texas.gov/oce/eer/index.cfm?fuseaction=main.getDetails&target=323191"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
###This one works
report = soup.find("th", text="Incident Tracking Number:").find_next_sibling("td").text
###This one doesn't
owner = soup.find("th", text="Name of Owner or Operator:").find_next_sibling("td").text
I'm getting this error: AttributeError: 'NoneType' object has no attribute 'find_next_sibling'. This code has several lines like the two above, and, like them, some of them work and some of them don't. I've looked into the HTML to see if there's another tag, but I'm not seeing it if it's there. Please and thank you for any help!
When using the text parameter, you should make sure you provide the text exactly. In your case, there's a space at the end.
soup.find('th', text='Name of Owner or Operator: ').find_next_sibling('td').text
This prints:
\n \n \n \n \n PHILLIPS 66 COMPANY\n \n \n

PyPDF2 encoding issues

I'm having some trouble identifying why the output doesn't match the input of the PDF when pulling the text. And if there are any tricks I could do to fix this as it's not an isolated issue.
with open(file, 'rb') as f:
binary = PyPDF2.pdf.PdfFileReader(f)
text = binary.getPage(x).extractText()
print(text)
file: "I/O filters, 292–293"
output: "I/O Þlters, 292Ð293"
The Ð seems to represent all instances of '-' and Þ seems to be used for all instances of "fi".
I am using Windows CMD as my output for testing and I do know some characters don't show up right, but that leaves me baffled for something like the 'fi'
The text extraction of PyPDF2 was massively improved in versions 2.x. The whole project moved to pypdf.
I recommend you give it another try: https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

Read text files from website with Python

Hello I have got problem I want to get all data from the web but this is too huge to save it to variable. I save data making it like this:
r = urlopen("http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-list-v4_2_0.txt")
r = BeautifulSoup(r, "lxml")
r = r.p.get_text()
some operations
This was working good until I have to get data from this website:
http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-description-file-v4_2_0.txt
When I run same code as above on this page my program is stopping at line
r = BeautifulSoup(r, "lxml")
and this is taking forever, nothing happen. I don't know how to get this whole data not saving it to file to make on this some operations of searching key words and printing them. I can't save this to file I have to get this from website.
I will be very thankful for every help.
I think the code below can do what you want. Like mentioned in a comment by #alecxe, you don't need to use BeautifulSoup. This problem should be a problem to retrieve content from text files online and is answered in this Given a URL to a text file, what is the simplest way to read the contents of the text file?
from urllib.request import urlopen
r = urlopen("http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-list-v4_2_0.txt")
for line in r:
do_somthing()

Word Frequency in a WikiPedia Article

How can i get the frequency of a specified word in a wikipedia Article without storing the whole article and then process it ? For eg , How may times the word "India" occurs in this article https://simple.wikipedia.org/wiki/India
Here's a simple-minded example that reads the web page line by line. But there is no guarantee the HTML is broken into lines. (It is in this case, over 1300 of them.)
import re
import urllib.request
from collections import Counter
URL = 'https://simple.wikipedia.org/wiki/India'
counter = Counter()
with urllib.request.urlopen(URL) as source:
for line in source:
words = re.split(r"[^A-Z]+", line.decode('utf-8'), flags=re.I)
counter.update(words)
for word in ['India', 'Indian', 'Indians']:
print('{}: {}'.format(word, counter[word]))
OUTPUT
> python3 test.py
India: 547
Indian: 75
Indians: 11
>
This also counts terms if they appear in the HTML structure of the page, not just the content.
If you want to focus on content, consider the Pywikibot python library which uses the preferred MediaWiki API to extract content, though it appears to be based on a "complete page at a time" model which you noted you're trying to avoid. Regardless, that module's documentation points to a list of similar, but more advanced, packages that you might want to review.

Resources