Gate- add annotation to entire document - nlp

I am trying to do document classification with gate. For that I need to annotate the entire document with one type of annotation. Can anyone please tell me how to do that?

Usually I use XML for that purpose. Something like:
<document class="class-1">
The text of you document 1 is here..
</document>
<document class="class-2">
The text of you document 2 is here..
</document>
Then save these xml as separated files (or as one document).
In GATE application you can use Annotation Set Transfer PR and move annotation from "Original markups" to default annotation set. This is one of the options. Other options depends on data format you have.

If your source documents are HTML or XML then there will already be an annotation in the Original markups set that spans all the content, otherwise the simplest option would be to load the Groovy plugin and use the scripting PR with a one-line script like
outputAS.add(doc.start(), doc.end(), "Document", Utils.featureMap())

Related

What is disadvantage of manipulating XML files directly as string?

In case I want to change the text or add an element in XML files, I can just directly convert the file to a string, replace or add elements as a string, then convert it back to XML.
In what use case where that approach is bad? Why do we need to manipulate it using libraries such as XMLdom, Xpath?
The disadvantage of manipulating XML via string operators is that achieving a parsing-dependent goal for even one particular XML document is already harder than using a proven XML parser. Achieving the goal for equivalent XML document variations will be nearly impossible, especially for anyone naive enough to be considering such an approach in the first place.
Not convinced?
Scan the table of contents of the Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C Recommendation 26 November 2008. If you do not understand everything, your hand-written, poor imitation of an XML parser, will fail, if not on your first test case, on future variations which you're obligated to handle if you wish to claim your code works with XML. To mention just a few challenges, your program should
Report if its input XML is not well-formed.
Handle character and entity references.
Handle comments and CDATA sections.
Tempted to parse XML via string operators, including regex? Don't do it.
Use a real XML parser.

Semantically correct way to add a copyright notice into a svg file?

I want to add a copyright notice in my svg files and it should be only "hidden" text and no watermark.
This is no real protection, because if you open a svg file with a text editor you can edit everything and delete the copyright. But I think this would be a simple and great way to show, who has made the file and a possible chance to find unlicensed graphics if there is some hidden information and if you are looking for it you can easily find it.
My main question is: how should the copyright text be put into the file?
<title> element is for accessibility purposes, some user agents display the title element as a tooltip.
<desc> element generally improves accessibility and you should describe what a user would see.
ugly way: a text element with inline CSS to hide it. Don't even think about this! :)
<!--Copyright info here--> could be also a simple solution.
<metadata>: this would the best way but I did not find a detailed definition and which child elements could live inside. Also https://developer.mozilla.org/en-US/DOM/SVGMetadataElement gives a 404.
Under https://www.w3.org/TR/SVG/struct.html#MetadataElement we can find more details. But is RDF really necessary?
I think a <metadata> element is the right place, but which child elements should be used and is just RDF the way to go?
I think the metadata element is the correct choice here. It has to contain XML, but it doesn’t have to be a RDF serialization (e.g., RDF/XML).
But I think it makes sense to use RDF here, because that’s exactly RDF’s job (providing metadata about resources, like SVG documents), and there is probably no other XML-based metadata language that has greater reach / better support.
A simple RDF statement (in RDF/XML) could look like this:
<metadata>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:schema="http://schema.org/">
<rdf:Description rdf:about="http://example.com/my-svg-file.svg">
<schema:license rdf:resource="https://creativecommons.org/licenses/by-sa/4.0/"/>
</rdf:Description>
</rdf:RDF>
</metadata>
The about attribute takes an IRI as value; for a stand-alone SVG document, you could provide an empty value (= the base IRI of the document).
In this example I use the license property from Schema.org:
A license document that applies to this content, typically indicated by URL.
(The vocabulary Schema.org is supported by several big search engines.)

Using <xs:appinfo> to specify version information

I have found few examples of "standard" usage of <xs:appinfo>. This one is interesting: http://docstore.mik.ua/orelly/xml/schema/ch15_01.htm#ch15-77057, however I would like to provide info like "used since v1.3" or "deprecated since v1.1". Any suggestion?
Both <xs:documentation> and <xs:appinfo> allow as children any other XML elements without limitations (along with just text).
The XSD language does not specify what exactly that extra XML and its meaning might be.
It purpose is just to allow for anyone to extend particular schema/components with
some extra (structured) information, which could be processed/used further automatically.
So, it is completely up to you how to design that extra XML (which would extend your documentation) and how to process and use it.
For that matter, one usage of such extra XML is to format the annotation text with HTML. In that case, that custom XML will be simply XHTML.

Generate po or xml file for language translation

I just need a clearance from expert. I need to translate whole site in other language. My site is consist of the 100 of articles. I need to get that whole article translated. Should I create .po or xml file for each article
If above is only way then let me know efficient way to create .po and xml files as these are not small messages.
I see you've tagged your post with 'expressionengine', so I'm assuming that your site is built on EE. In which case, neither .po files nor XML files are the way to go. Since EE offers completely customizable fields and channels, you can have you secondary language content managed just like your primary language content.
There are many different approaches to this in EE, each with their own pros and cons. This article linked below gives a great overview of the many approaches, and offers many links to additional reading. It's more than one answer on SO can properly cover.
Multi-language Solutions for ExpressionEngine on EE Insider
To export as XML:
http://devot-ee.com/add-ons/export-it
or
http://devot-ee.com/add-ons/ajw-export
Alternatively you can simply build a template that outputs the XML using standard {exp:channel:entries} tag pair, making the template type XML and adding the correct header and code for XML.
To re-import:
http://devot-ee.com/add-ons/datagrab
All of the above will involve knowing what fields you want to export out along with their table and row references so it can easily be re-imported.
Strongly suggest you thoroughly test the export and import facility you opt for to ensure it works before beginning any translation process.
Example XML Template (this is to build sitemap.xml but gives you a start on building your own XML structure):
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
{exp:channel:entries channel="pages" entry_id="not 117|104" limit="500" disable="member_data|pagination|trackbacks" rdf="off" dynamic="no" status="Open" sort="asc"}
<url>
<loc>{page_url}</loc>
<lastmod>{gmt_edit_date format='%Y-%m-%dT%H:%i:%s%Q'}</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
{/exp:channel:entries}
</urlset>

All mandatory field in a xsd file?

Is there a quick way to find out all the mandatory field in a xsd file?
I need to quickly see all the mandatory fields in the schema
thanks
Not sure if you're looking to do this through code. If not, Altova XMLSpy, for example, provides an option to "Generate Sample XML File" - with options to generate only mandatory fields.
Otherwise, if you're working with Java, for example, you can use something like the Eclipse XSD project for programmatic access to the XSD. (It even works without Eclipse.) Some additional details at Are there any other frameworks that parse XSD other than XSOM? .
Take a look at this post; instead of exporting all fields, there's also an option to get only the mandatory ones... One significant difference compared with the answer you accepted is in that you can also generate an Excel or CSV file, in addition to the XML file; not to mention that the sample XML approach is deficient by definition... I would pay attention to the way mandatory choices, abstract typed elements or abstract elements with substitution groups play in your case.

Resources