I'm writing a program in Haskell that needs the metadata from media files, such as runtime, artist, size, name, copyright, height....
Basically I need to get this information and create some pdf's with it, but I can't find a way to get the values like "60s", "AC/DC", "5000", "Thunderstruck", "copyright"...
Any ideas hot to parse info that exiftool gives? Which parameters in exiftool are better to use? Should I use Text.Regex?
Since exiftool can produce XML or JSON output, you can pick one format and parse the output accordingly. Haskell has Text.XML.Light (and bunch of others) for parsing XML and aeson for JSON.
As for what tags available in EXIF, take a look at this convenient list.
Related
I am trying to limit the DICOM tags, which are retained, by using
for key in keys:
if key.upper() not in {'0028|0010','0028|0011'}:
image_slice.EraseMetaData(key)
in Python 3.6 where image_slice is of type SimpleITK.SimpleITK.Image
I then use
image_slice.GetMetaDataKeys()
to see what tags remain and they are the tags I selected. I then save the image with
writer.SetFileName(outputDir+os.path.basename(sliceFileNames[i]))
writer.Execute(image_slice)
where outputDir is the output directory name and os.path.basename(sliceFileNames[i]) is the DICOM image name. However, when I open the image, with Weasis or with MIPAV, I notice that there are a lot more tags than were in image_slice. For example, there is
(0002,0001) [OB] FileMetaInformationVersion: binary data
(0002,0002) [UI] MediaStorageSOPClassUID:
(0002,0003) [UI] MediaStorageSOPInstanceUID:
(0008,0020) [DA] StudyDate: (which is given the date that the file was created)
I was wondering how, and where these additional tags were added.
The group 2 tags you are seeing are meta data tags, that are always written while writing the dataset. Unless "regular" tags, which start with group 8, these group 2 tags do not belong to the dataset itself, but contain information about the encoding/writing of the dataset, like the transfer syntax - more information can be found in the DICOM standard, part 10. They will be recreated on saving a dataset to a file, otherwise, the DICOM file would not be valid.
About the rest of the tags I can only guess, but they are probably written by the software because they are mandatory DICOM tags and have been missing in the dataset. StudyDate is certainly a mandatory tag, so adding it if it is missing is correct, if the data is seen as derived data (which it usually is if you are manipulating it with ITK). I guess the other tags that you didn't mention are also mandatory tags.
Someone with more SimpleITK knowledge can probably add more specific information.
i'm working on a Location Fan Based of a videogame and for now i have a TXT with this structure :
<ID1>\t<ID2>\t<ID3>\t<ID4>\t"<stringToTranslate>"
<ID1>\t<ID2>\t<ID3>\t<ID4>\t"<stringToTranslate>"
<ID1>\t<ID2>\t<ID3>\t<ID4>\t"<stringToTranslate>"
<ID1>\t<ID2>\t<ID3>\t<ID4>\t"<stringToTranslate>"
I need to create a formatted file to translate and use the platform Crowdin ...
But I don't know what kind of structure to create, if to make a json, an ini, an xml, because then I have to create a script to convert my txt into this new type.
Thanks a lot for your help.
Any key-value file type would do the trick I believe - JSON format is a good choice (looks like you have multiple string IDs, so nested JSON will be perfect in terms of the structure)
I'm trying to parse a XML file in order to retrieve the data in a list.
I need to extract the TITRE_N, the AUTEURS_N and the RESUME_N. I know how to do it but my problem is that for some reference, I don't have any data for AUTEURS_N. there is no tag and the result as you can think it that all the data after are shift! Do you know how I can parse this doc and handle the fact that sometimes I'm missing one tag that I usefully use?
thx a lot!
I am new to Solr, but I suppose that there is an easy way to index SVG files with Solr. I have installed Solr 6.3.0 and I am using an example 'files' core. It works well, but it seems that it parses the SVG files as plain text.
Is there an easy way to take only the texts between the text tags?
Ideally, I want to combine some meta data from a JSON file with the text from the SVG files. The JSON file looks like:
{
"id":"000001",
"title":"Some diagram",
...
} ...
The associated svg file is 000001.svg.Is there a way to create a scheme in Solr, that can take the fields from the json and merge a field with the text from the SVG file?
The most flexible way that will do what you want is to write a custom indexing utility that parses your JSON, picks up the SVG and extracts the relevant elements, then submits the complete structure to Solr. Depending on your programming language of choice you'll do this with something like SolrJ, Solrnet or another client library.
This is way more flexible and maintainable than integrating it directly into Solr, but if you want to do custom SVG indexing (without the additional JSON), you could use the XSLT support in the regular update handler, or using an XPathEntityProcessor in a DataImportHandler configuration.
My choice would be the custom indexing code.
I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.
This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706