My requirement is to parse SEC tabular data. Please find the sample tabular data in the below image.
I'm using Python for it. I found that the tabular data is being stored in XBRL format. In the beginning, I tried to parse the XBRL data as the way we parse XML using the lxml module. Later I realized that it's a complex model to parse and we have many libraries for parsing XBRL document. I've gone through different libraries like python-xbrl, xbrl, and, installed servers(raptorXMLXBRL server) for parsing XBRL documents. But none worked as expected. As I mentioned earlier, my goal is to get the tabular data from the SEC. WE can find sample documents in this link. Can you please suggest me a process/module for parsing the tabular data. Thanks in advance.
Like you, I tried parsing xbrl documents using whatever tools are available in python - without much success. So one way to work around the problem is to get to the html filing underlying the xbrl filing.
So, to use your example link, the url of the first 10K there is
https://www.sec.gov/ix?doc=/Archives/edgar/data/1551152/000155115220000007/abbv-20191231x10k.htm
Simply strip the /ix?doc= string from the url, and you are left with
https://www.sec.gov/Archives/edgar/data/1551152/000155115220000007/abbv-20191231x10k.htm
which is the same 10k filing, but in html format. From there you can just use your normal html tools to extract whatever data you are interested in.
Related
I am trying to parse a number of 3 to 4 diseases the clinical recommendation to follow for a project in my university.
Basically, from https://www.uspreventiveservicestaskforce.org/BrowseRec/Index/browse-
I would like to parse and export into Excel the Table Head (Name, Type, Year, Age Group) and than to populate it with the diseases, but also, more important, with the information available inside the link (Population, Recommendation, Grade).
The idea is that I do not know how to parse the information inside the links - for example, take the first link disease (Abdominal Aortic Aneurysm: Screening) that is the page with the information I need - https://www.uspreventiveservicestaskforce.org/Page/Document/UpdateSummaryFinal/abdominal-aortic-aneurysm-screening
Is Beautiful Soup the go to solution? I am a newbie to this, so any help is highly appreciated. Many thanks!
What you have to do is
use python-requests to get the index page
use BeautifulSoup to parse the page's content and extract the urls your interested in
for each of those urls, use requests again to get "disease" page, then BeautifulSoup again to extract the data you're interested in
use the csv module to write those data into a .csv file, that can be opened by Excel (or any other similar program like OpenOffice etc).
So in pseudocode:
get the index content
for each disease_url in the index content:
get the disease page content
retrieve data from the page content
write data to csv
All of those packages are rather well documented, so you shouldn't have too many issues implementing this in Python.
Does anybody knows how to create a structured report using dicom scope toolkit via console (ubuntu 16.04) with a link to a related image?
The thing is that I have an image of some kind of trauma and I have to connect with a report which is in a text file. The last file should be in .dcm format which contains annotation and a link to an image. I have to use dicom scope program.
Maybe others refrain from answering because your question needs a very long answer. I cannot provide step-by-step instructions, a few hints, though.
The way I would go is to:
(assuming that your image is available in DICOM format):
obtain a sample structured report. I think that the "simple" Basic Text SR is what you want to go for. You can find some samples here.
convert the SR to an XML file using dsr2xml
edit the contents in XML. Do not forget to include your image reference in (0040,a730) Content Sequence -> (0008,1199) Referenced SOP Sequence
convert the XML back to DICOM SR using xml2dsr
By the way: From your question, I did not really understand why you want to use a structured report, as you wrote that your report is plain text. Instead of digging into the complex structure of SR, you may want to consider exporting the report to an Encapsulated PDF document which can reference images as well.
In COGNOS is there a way to get the definitions (filters, selected fields) from a number of reports in a folder?
I've inherited around 500 reports defined in a folder and they all need to be checked and fixed as they have business errors (not technical errors). If it was possible to get all their definitions in a single extract that would save an enormous amount of time having to click multiple times to get that information from each report one by one.
In ACCESS this can be done with VBA (for query definitions), but I'm not sure if there is a scripting language that can be used with COGNOS to achieve a similar result.
It sounds like you may want to "validate" each of these 500 reports (effectively equivalent to pressing the "validate" button on each individual report if it was open in the authoring studio).
Validation will ensure that a report specification XML is still syntactically correct, references a package which is still present the content store, references only query items from that package which still exist, generates valid SQL vs. the underlying datasource, etc.
If that's what you're looking for, an easy way to do batch validation for all 500 reports would be to use MotioPI (its a free admin tool for Cognos). Here's a short article which walks you through the process:
http://info.motio.com/Blog/bid/70357/Batch-Validation-of-Cognos-Reports
If you're wanting to retrieve the actual report specification (XML) for each of these 500 objects, then you'd need to write a program which utilizes the Cognos SDK to retrieve the specification XML from each of the 500 report objects. After that, you'd need to add logic which examines each of these 500 XML documents, looking for whatever it is you're looking for.
We solved this by exporting the XML of the reports using a SQL query on the content store.
The output is processed with a Python script to convert XML to table layout in CSV format.
This CSV-file can easely be imported in Excel.
You might want to process the reports XML directly in a SQL query with the xmltable function. In our situation this turned out to be a heavy proces we don't want to burden the content store database with. For a small set of reports this is working fine though.
I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.
I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.
This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706