I am aware that Gutenberg (a company providing public domain books) does not allow automatic access of their website, they do however provide them in a 'machine readable format' just for that purpose, specifically RDF. I, being new, have never heard of this format, and googling hasn't helped much. I have acquired the rdflib module that I quite frankly have no idea what to do with.
What I am trying to do is extract the text which I assume is legally accessible through the RDF files that I downloaded. In the rdf file there is, among others, this line:
<dcterms:hasFormat rdf:resource="http://www.gutenberg.org/ebooks/100.txt.utf-8"/>
It leads to the Gutenberg page with the text file of the book, from where I assume the program can get the text, though I'm not sure since I don't see the distinction between directly scraping their site, and scraping it through the RDF file.
So, if the text is at all accessible programmatically, how would I do it?
You won't find full text in the RDF catalog from from Project Gutenberg. It does contain URLs for the text in several formats, though. Once you've downloaded the catalog zip file and unzipped it, here's how to get HTML book URL from a particular RDF file.
filename = 'cache/epub/78/pg78.rdf'
from lxml import etree
rdf = open(filename).read()
tree = etree.fromstring(rdf)
resource_tag = '{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource'
hasFormat_tag = './/{http://purl.org/dc/terms/}hasFormat'
resources = [el.attrib[resource_tag] for el in tree.findall(hasFormat_tag)]
urls = [url for url in resources if url.endswith('htm')]
// urls[0] is 'http://www.gutenberg.org/files/78/78-h/78-h.htm'
Once you have the URL of the HTML version of the book you want, here's how to grab the text.
import requests
from lxml import etree
response = requests.get(urls[0])
html = etree.HTML(response.text)
text = '\n'.join([el.text for el in html.findall('.//p')])
text now contains the full text of Tarzan, minus the Project Gutenberg metadata, table of contents, and chapter headings.
>>> text[:100]
u'\r\nI had this story from one who had no business to tell it to me, or to\r\nany other. I may credit th'
Note that there are inconsistencies between books on Gutenberg so your results may vary.
Related
I am trying to parse a number of 3 to 4 diseases the clinical recommendation to follow for a project in my university.
Basically, from https://www.uspreventiveservicestaskforce.org/BrowseRec/Index/browse-
I would like to parse and export into Excel the Table Head (Name, Type, Year, Age Group) and than to populate it with the diseases, but also, more important, with the information available inside the link (Population, Recommendation, Grade).
The idea is that I do not know how to parse the information inside the links - for example, take the first link disease (Abdominal Aortic Aneurysm: Screening) that is the page with the information I need - https://www.uspreventiveservicestaskforce.org/Page/Document/UpdateSummaryFinal/abdominal-aortic-aneurysm-screening
Is Beautiful Soup the go to solution? I am a newbie to this, so any help is highly appreciated. Many thanks!
What you have to do is
use python-requests to get the index page
use BeautifulSoup to parse the page's content and extract the urls your interested in
for each of those urls, use requests again to get "disease" page, then BeautifulSoup again to extract the data you're interested in
use the csv module to write those data into a .csv file, that can be opened by Excel (or any other similar program like OpenOffice etc).
So in pseudocode:
get the index content
for each disease_url in the index content:
get the disease page content
retrieve data from the page content
write data to csv
All of those packages are rather well documented, so you shouldn't have too many issues implementing this in Python.
i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database).
However, i am unsure which method / Tool to use.
My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster.
So my question is, how can I read a pdf from an online source and gather these pieces of information?
If you are not limited in terms of programming language, consider using iText.
It can easily extract all the text from a given PDF document. It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text.
iText is available both for c# and java lovers.
File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
Have a look at the website to learn more.
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction
I have 40,000 HTML files. Each file has a table containing the profit & loss statement of a particular company.
I would like to scrape all these data into Stata. (Or alternatively, into an Excel/CSV file). The end product should be a Stata/Excel file containing a list of all companies and details of their balance sheet (revenue, profit, etc.)
May I know how this can be done? I tried Outwit but it doesn't seem good enough.
Stata is not exactly the best tool for the job. You would have to use low-level file commands to read the input text files, and then parse out the relevant tables (again, using low level string processing). Putting them into data set is the easiest part; you can either
expand 2 in l
replace company = "parsed name" in l
replace revenue = parsed_revenue in l
etc., or use post mechanics. With some luck, you'd find some packages that may make it simpler, but I am not aware of any, and findit html does not seem to bring anything usable.
Stata is not the good tool for this job. In principle it is possible. Personally I have already done similar things: reading ascii files into Stata, parsing them and extracting information fro them. I have dumped the data into Stata using insheet. Then I have treated the data with Stata's string functions. It was a bit cumbersome. And the files had quite a simple and clear structure. I don't want to imagine what happens when the files have a more complicated structure.
I think that the best strategy is to use a scripting language such as Python, Perl or Ruby. to extract the information contained in the html tables. The results can easily be written into a csv, Excel or even a Stata (.dta) file.
You should use Python beautifulsoup package. It is very handy in extracting data from HTML files. Following is the link.
http://www.crummy.com/software/BeautifulSoup/
In the documentation, there are many commands, however only few commands are important. Following are the important commands:
from bs4 import BeautifulSoup
#read the file
fp=open(file_name,'r')
data=fp.read()
fp.close()
#pass the data to beautifulsoup
soup = BeautifulSoup(html_doc, 'html.parser')
#extract the html elements by id and write result into file
For one of my statistics project, I need to RANDOMLY download several files from a google patent page, and each file is a large zip file. The web link is the following:
http://www.google.com/googlebooks/uspto-patents-grants-text.html#2012
Specifically, I want to RANDOMLY select 5 years (The links on the top of the page) and download (i.e 5 files). DO you guys know if there's some good package out there that is good for this task?
Thank you.
That page contains mostly zip files and looking at the HTML content it seems that it should be fairly easy to determine which links will yield a zip file by simply searching for a *.zip in a collection of candidate URLs, so here is what I would recommend:
fetch the page
parse the HTML
extract the anchor tags
for each anchor tag
if href of anchor tag contaings "*.zip"
add href to list of file links
while more files needed
generate a random index i, such that i is between 0 and num links in list
select i-th element from the links list
fetch the zip file
save the file to disk or load it in memory
If you don't want to get the same file twice, then just remove the URL from the list of links and that randomly select another index (until you have enough files or until you run out of links). I don't know what programming language your team codes in, but it shouldn't be very difficult to write a small program that does the above.
I want to export the MediaWiki markup for a number of articles (but not all articles) from a local MediaWiki installation. I want just the current article markup, not the history or anything else, with an individual text file for each article. I want to perform this export programatically and ideally on the MediaWiki server, not remotely.
For example, if I am interested in the Apple, Banana and Cupcake articles I want to be able to:
article_list = ["Apple", "Banana", "Cupcake"]
for a in article_list:
get_article(a, a + ".txt")
My intention is to:
extract required articles
store MediaWiki markup in individual text files
parse and process in a separate program
Is this already possible with MediaWiki? It doesn't look like it. It also doesn't look like Pywikipediabot has such a script.
A fallback would be to be able to do this manually (using the Export special page) and easily parse the output into text files. Are there existing tools to do this? Is there a description of the MediaWiki XML dump format? (I couldn't find one.)
On the server side, you can just export from the database. Remotely, Pywikipediabot has a script called get.py which gets the wikicode of a given article. It is also pretty simple to do manually, somehow like this (writing this from memory, errors might occur):
import wikipedia as pywikibot
site = pywikibot.getSite() # assumes you have a user-config.py with default site/user
article_list = ["Apple", "Banana", "Cupcake"]
for title in article_list:
page = pywikibot.Page(title, site)
text = page.get() # handling of not found etc. exceptions omitted
file = open(title + ".txt", "wt")
file.write(text)
Since MediaWiki's language is not well-defined, the only reliable way to parse/process it is through MediaWiki itself; there is no support for that in Pywikipediabot, and the few tools which try to do it fail with complex templates.
It looks like getText.php is a builtin server-side maintenance script for exporting the wikitext of a specific article. (Easier than querying the database.)
Found it via Publishing from MediaWiki which covers all angles on exporting from MediaWiki.