I have 40,000 HTML files. Each file has a table containing the profit & loss statement of a particular company.
I would like to scrape all these data into Stata. (Or alternatively, into an Excel/CSV file). The end product should be a Stata/Excel file containing a list of all companies and details of their balance sheet (revenue, profit, etc.)
May I know how this can be done? I tried Outwit but it doesn't seem good enough.
Stata is not exactly the best tool for the job. You would have to use low-level file commands to read the input text files, and then parse out the relevant tables (again, using low level string processing). Putting them into data set is the easiest part; you can either
expand 2 in l
replace company = "parsed name" in l
replace revenue = parsed_revenue in l
etc., or use post mechanics. With some luck, you'd find some packages that may make it simpler, but I am not aware of any, and findit html does not seem to bring anything usable.
Stata is not the good tool for this job. In principle it is possible. Personally I have already done similar things: reading ascii files into Stata, parsing them and extracting information fro them. I have dumped the data into Stata using insheet. Then I have treated the data with Stata's string functions. It was a bit cumbersome. And the files had quite a simple and clear structure. I don't want to imagine what happens when the files have a more complicated structure.
I think that the best strategy is to use a scripting language such as Python, Perl or Ruby. to extract the information contained in the html tables. The results can easily be written into a csv, Excel or even a Stata (.dta) file.
You should use Python beautifulsoup package. It is very handy in extracting data from HTML files. Following is the link.
http://www.crummy.com/software/BeautifulSoup/
In the documentation, there are many commands, however only few commands are important. Following are the important commands:
from bs4 import BeautifulSoup
#read the file
fp=open(file_name,'r')
data=fp.read()
fp.close()
#pass the data to beautifulsoup
soup = BeautifulSoup(html_doc, 'html.parser')
#extract the html elements by id and write result into file
Related
I am trying to parse a number of 3 to 4 diseases the clinical recommendation to follow for a project in my university.
Basically, from https://www.uspreventiveservicestaskforce.org/BrowseRec/Index/browse-
I would like to parse and export into Excel the Table Head (Name, Type, Year, Age Group) and than to populate it with the diseases, but also, more important, with the information available inside the link (Population, Recommendation, Grade).
The idea is that I do not know how to parse the information inside the links - for example, take the first link disease (Abdominal Aortic Aneurysm: Screening) that is the page with the information I need - https://www.uspreventiveservicestaskforce.org/Page/Document/UpdateSummaryFinal/abdominal-aortic-aneurysm-screening
Is Beautiful Soup the go to solution? I am a newbie to this, so any help is highly appreciated. Many thanks!
What you have to do is
use python-requests to get the index page
use BeautifulSoup to parse the page's content and extract the urls your interested in
for each of those urls, use requests again to get "disease" page, then BeautifulSoup again to extract the data you're interested in
use the csv module to write those data into a .csv file, that can be opened by Excel (or any other similar program like OpenOffice etc).
So in pseudocode:
get the index content
for each disease_url in the index content:
get the disease page content
retrieve data from the page content
write data to csv
All of those packages are rather well documented, so you shouldn't have too many issues implementing this in Python.
I want to run a machine learning algorithm on some data, so I'm exporting the data into a file first.
But one of my features for the text I'm classifying is a list of tags,
and each text can have multiple tags ex. (["mystery", "thriller"]).
Is it recommended that when I write to my CSV file for exporting the data, that I write that entire list as one of the features for my data (the "tags" feature).
Or is it better to make a separate feature for each tag. The only problem then is that most examples will only have one tag, so the other feature columns for those will be blank.
So it seems like writing this list of tags as one feature makes the most sense, but then when parsing it for training, would I then treat every element of that list as its own feature still or no?
If you do it as a single feature just make sure to use some delimiter to separate the tags that won't occur in any of the tags, and also isn't a comma (as that will mess with the csv format), something like | would probably do fine. When you go to build your models and read in that list of tags you can then split it based on that delimiter. In Java this would look like:
String[] tagList = inputString.split("|");
I'm sure most languages will have a similar method to do this.
Hey so my question might be basic but I am a little lost on how to implement it.
If I was reading a file, for example an HTML File. How do I grab a specific parts of the file. For example what I want to do is
blahblahblahblah<br>blahblahblah
how do I find the tag that starts off with < and ends with > and grab the string inside which is br in Python?
This is a very broad question there are a couple of ways you could retrieve a single string from a html file.
First option would be to parse the file with a library like BeautifulSoup, this option is also valid for xml files too.
Second option would be, if the file is relatively small you could use regex to locate a string you want and return it.
First option is what I would recommend, if you use a library like BeautifulSoup you have a lot of functionality, eg. to find the parent element of a selected tag and so on.
I am aware that Gutenberg (a company providing public domain books) does not allow automatic access of their website, they do however provide them in a 'machine readable format' just for that purpose, specifically RDF. I, being new, have never heard of this format, and googling hasn't helped much. I have acquired the rdflib module that I quite frankly have no idea what to do with.
What I am trying to do is extract the text which I assume is legally accessible through the RDF files that I downloaded. In the rdf file there is, among others, this line:
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/ebooks/100.txt.utf-8"/>
It leads to the Gutenberg page with the text file of the book, from where I assume the program can get the text, though I'm not sure since I don't see the distinction between directly scraping their site, and scraping it through the RDF file.
So, if the text is at all accessible programmatically, how would I do it?
You won't find full text in the RDF catalog from from Project Gutenberg. It does contain URLs for the text in several formats, though. Once you've downloaded the catalog zip file and unzipped it, here's how to get HTML book URL from a particular RDF file.
filename = 'cache/epub/78/pg78.rdf'
from lxml import etree
rdf = open(filename).read()
tree = etree.fromstring(rdf)
resource_tag = '{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource'
hasFormat_tag = './/{http://purl.org/dc/terms/}hasFormat'
resources = [el.attrib[resource_tag] for el in tree.findall(hasFormat_tag)]
urls = [url for url in resources if url.endswith('htm')]
// urls[0] is 'http://www.gutenberg.org/files/78/78-h/78-h.htm'
Once you have the URL of the HTML version of the book you want, here's how to grab the text.
import requests
from lxml import etree
response = requests.get(urls[0])
html = etree.HTML(response.text)
text = '\n'.join([el.text for el in html.findall('.//p')])
text now contains the full text of Tarzan, minus the Project Gutenberg metadata, table of contents, and chapter headings.
>>> text[:100]
u'\r\nI had this story from one who had no business to tell it to me, or to\r\nany other. I may credit th'
Note that there are inconsistencies between books on Gutenberg so your results may vary.
I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.
This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706