Hey so my question might be basic but I am a little lost on how to implement it.
If I was reading a file, for example an HTML File. How do I grab a specific parts of the file. For example what I want to do is
blahblahblahblah<br>blahblahblah
how do I find the tag that starts off with < and ends with > and grab the string inside which is br in Python?
This is a very broad question there are a couple of ways you could retrieve a single string from a html file.
First option would be to parse the file with a library like BeautifulSoup, this option is also valid for xml files too.
Second option would be, if the file is relatively small you could use regex to locate a string you want and return it.
First option is what I would recommend, if you use a library like BeautifulSoup you have a lot of functionality, eg. to find the parent element of a selected tag and so on.
Related
I am using python docx for word file processing. While using larger files(50+ pages), the paragraph.text method is returning string which is inconsistent with my file.
import docx
document=Document(f)
paratext=[]
paragraphs=document.paragraphs
for paragraph in paragraphs:
text=paragraph.text
paratext.append(text)
print(paratext[30])
Ideally this should print the 30th paragraph. But the output seems distorted (Beginning few characters are missing and the printed output starts from somewhere in the middle of the actual paragraph in some cases). However it works fine if I copy the adjacent few paragraphs in a fresh ms word document (1 page only) and run the code by just changing the index of paratext. For eg I copied 3 adjacent paras into a new doc and used print(paratext[2]), the output seems just perfect here. How do I get rid of this inconsistency as I have to work with larger documents.
I expect this means that the missing text is in runs that are "enclosed" in some other XML element, like perhaps a field or a hyperlink.
The quickest way to discover specifically what's happening might be to modify your short script to temporarily capture the paragraph XML.
import docx
document = Document(f)
p_xml = [paragraph._element.xml for paragraph in document.paragraphs]
print(p_xml[30])
Your choices at that point are likely to be editing the Word documents to remove the offending "enclosure" or to process XML for each paragraph yourself using lxml calls.
That might be easier that it sounds if you use the .xpath() method available on paragraph._element. In any case, that would be a separate question in which you show the XML you find with the method above.
i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database).
However, i am unsure which method / Tool to use.
My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster.
So my question is, how can I read a pdf from an online source and gather these pieces of information?
If you are not limited in terms of programming language, consider using iText.
It can easily extract all the text from a given PDF document. It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text.
iText is available both for c# and java lovers.
File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
Have a look at the website to learn more.
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction
I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.
This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706
I have 40,000 HTML files. Each file has a table containing the profit & loss statement of a particular company.
I would like to scrape all these data into Stata. (Or alternatively, into an Excel/CSV file). The end product should be a Stata/Excel file containing a list of all companies and details of their balance sheet (revenue, profit, etc.)
May I know how this can be done? I tried Outwit but it doesn't seem good enough.
Stata is not exactly the best tool for the job. You would have to use low-level file commands to read the input text files, and then parse out the relevant tables (again, using low level string processing). Putting them into data set is the easiest part; you can either
expand 2 in l
replace company = "parsed name" in l
replace revenue = parsed_revenue in l
etc., or use post mechanics. With some luck, you'd find some packages that may make it simpler, but I am not aware of any, and findit html does not seem to bring anything usable.
Stata is not the good tool for this job. In principle it is possible. Personally I have already done similar things: reading ascii files into Stata, parsing them and extracting information fro them. I have dumped the data into Stata using insheet. Then I have treated the data with Stata's string functions. It was a bit cumbersome. And the files had quite a simple and clear structure. I don't want to imagine what happens when the files have a more complicated structure.
I think that the best strategy is to use a scripting language such as Python, Perl or Ruby. to extract the information contained in the html tables. The results can easily be written into a csv, Excel or even a Stata (.dta) file.
You should use Python beautifulsoup package. It is very handy in extracting data from HTML files. Following is the link.
http://www.crummy.com/software/BeautifulSoup/
In the documentation, there are many commands, however only few commands are important. Following are the important commands:
from bs4 import BeautifulSoup
#read the file
fp=open(file_name,'r')
data=fp.read()
fp.close()
#pass the data to beautifulsoup
soup = BeautifulSoup(html_doc, 'html.parser')
#extract the html elements by id and write result into file
After installing my files using WIX 3.5 I would like to changes some values in one of my xml files.
Currently there are multiple entries like this:
<endpoint address="net.tcp://localhost/XYZ" .../>
I would like to change the localhost to the real servername wich is available due to a property. How can I perform this replacement on each entry inside this xml file? Is there a way to do this without writing an own CA?
Thanks in advance!
XmlConfig and/or XmlFile elements are your friends here.
UPDATE: Well, according to the comments below, it turns out that only part of the attribute (or element) value should be changed. This seems not to be supported by either of two referenced elements.
I would take one of the two approaches then:
Use third-party "read XML" actions, like this one
It's better than creating your own because you can rely on deeper testing in this case
Teach your build script to control the string pattern
Let's say you put `net.tcp://localhost/XYZ` to build file and your code is pointed out to take this value as a string pattern to use at install time. For instance, keep the string pattern as a Property in your MSI package. When it changes, e.g. to `net.tcp://localhost/ABC` you'll have to change nothing in your action. In this case from a XMLFile perspective you always know your FROM and TO attribute values.
If your XML configuration file is not large, you can load the file into memory and perform replace using JScript.
var s = "<endpoint address=\"net.tcp://localhost/XYZ\" .../>";
var re = /"net.tcp:\/\/localhost\//g;
var r = s.replace(re, "\"http://newhost.com/");
Here s is your complete XML file, re is the regular expression, and r would contain the result or replace.
You can read and write to public properties of Windows Installer using JScript. Yet there's still one problem: you have to read your XML file and to write it back to disk. To do it, you can use Win32_ReadFile and Win32_WriteFile custom actions from the AppSecInc. MSI Extensions library referenced by Yan in his answer.
However, it could be easier to write a complete Custom Action which will load your XML configuration file, do the replace, and write the file back to disk. To do it you can use XSLT and JScript (see an example code).
InstallShield has a built-in data driven custom action called Text Search. It basically allows for RegEx style replacements like what you are describing.
WiX doesn't have this functionality but you could write a custom action ( say using C#/DTF ) to do it for you.
There nothing in Wix, you can do to change something in a file without using a custom action. If you don't want to use CA, you can consider saving the settings in some other place e.g. User's registry and always read that setting from there