Is it possible to parse sub-trees with Groovy XMLSlurper - groovy

Does anyone know whether it is possible to utilise XMLSlurper in a fashion that means individual sub-trees can be pulled from a very large XML document and processed individually?
Imagine you've got a huge XML feed containing a root element that has thousands of direct child elements that you can process individually. Obviously, reading the whole document into memory is a no-no but, as each child of the root is itself modestly sized, it would be nice to stream through the document but apply XMLSlurper niceness to each of the child elements in turn. As each child element is processed, garbage collection can clean up memory used to process it. In this way we get the great ease of XMLSlurper (such concise syntax) with the low memory footprint of streaming (e.g. SAX).
I'd be interested to know if anyone has ideas on this and/or whether you've come across this requirement yourselves.

Initializing an XmlSlurper instance means, calling one of its overloaded parse(..) methods (or the parseText(String) method). Upon this call, XmlSlurper will (use SAX events, at least, to) construct an in-memory GPathResult that holds the complete information on the XML elements and attributes, and their structure.
So, no, the XmlSlurper does not provide an API to parse XML document portions, only.
What can be done is, extending XmlSlurper, overwriting the parse*(..) methods, pre-processing the XML by using a custom SAX handler, gathering the desired portions of XML, and forwarding these to one of the XmlSlurper.parse*(..) methods.

You can use StAX API together with XmlSlurper to parse subtrees.
// Example of using StAX to split a large XML document and parse a single element using XmlSlurper
import javax.xml.stream.XMLInputFactory
import javax.xml.stream.XMLStreamReader
import javax.xml.transform.Transformer
import javax.xml.transform.TransformerFactory
import javax.xml.transform.sax.SAXResult
import javax.xml.transform.stax.StAXSource
def url = new URL("http://repo2.maven.org/maven2/archetype-catalog.xml")
url.withInputStream { inputStream ->
def xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(inputStream)
def transformer = TransformerFactory.newInstance().newTransformer()
while (xmlStreamReader.hasNext()) {
xmlStreamReader.next()
if (xmlStreamReader.isStartElement() && xmlStreamReader.getLocalName() == 'archetype') {
// Example of splitting a large XML document and parsing a single element with XmlSlurper at a time
def xmlSlurper = new XmlSlurper()
transformer.transform(new StAXSource(xmlStreamReader), new SAXResult(xmlSlurper))
def archetype = xmlSlurper.document
println "${archetype.groupId} ${archetype.artifactId} ${archetype.version}"
}
}
}

Related

No insertion orderr is preserved while converting from xml to json using org.json.XML.toJSONObject(xmlStirng)

I am using a dynamic data structure for my project. So instead of a predefined class I am using java.util.LinkedHashMap to store my dynamic data and preserve my insertion order as well.
I am able to convert the map to json and get the map and back from Json using ``.
fasterxml.jackson.databind.ObejctMapper mapper;
LinkedHashMap<String, Object> map =
mapper.readValue(json, new TypeReference<LinkedHashMap<String, Object>>() {});
String json = mapper.writeValueAsString(map);
I am trying to do some XSLT transformation on my map data. So I also need to transform from xml to map and map to xml. As there is no direct method to convert these I wrote my own utility for map to xml.
and to convert from xml to map I used -> org.json.JSONObject. I first convert the xml to json using
org.json.XML.toJSONObject(xmlstring)
and can convert the json to map easily using object mapper.
But the problem here is I am loosing the insertion order which is crucial for my data.
How can I convert my data from xml to json so that the insertion order is preserved.
Thats a superb idea to use LinkedHashMap for dynamic data structure.
However JSONObject internally uses HashMap to create the json. So it looses the insertion order.
public JSONObject() {
// HashMap is used on purpose to ensure that elements are unordered by
// the specification.
// JSON tends to be a portable transfer format to allows the container
// implementations to rearrange their items for a faster element
// retrieval based on associative access.
// Therefore, an implementation mustn't rely on the order of the item.
this.map = new HashMap<String, Object>();
}
So If you can override the JSONObject your problem will be solved.
Enjoy!!!
You don't need to change the jar. You just need to create a new class with the same name of the class inside the jar and also have to create the new classes those are dependent on the class.
You need to copy the code from the jar class to your new class and tweek.
then access that class from you code by changing the import statement.

Unmarshalling XML Fragments at multiple levels of a hierarchy with JAXB

I need to un-marshal multiple objects out of an XML Structure that looks like this:
<Control>
<TotalCompanies>2</TotalCompanies>
<TotalSales>100</TotalSales>
<Company>
<Name>ACME Ca</Name>
<TotalSales>70</TotalSales>
<TotalSalesPeople>2</TotalSalesPeople>
<SalesPeople>
<SalesPerson>
<Name>John</Name>
<Sales>40</Sales>
</SalesPerson>
<SalesPerson>
<Name>Joe</Name>
<Sales>30</Sales>
</SalesPerson>
</SalesPeople>
</Company>
<Company>
<Name>ACME Va</Name>
<TotalSales>30</TotalSales>
<TotalSalesPeople>1</TotalSalesPeople>
<SalesPeople>
<SalesPerson>
<Name>Janet</Name>
<Sales>30</Sales>
</SalesPerson>
</SalesPeople>
</Company>
</Control>
I need to be able to separately unmarshall a Control object that contains just the totals and not it's children, and similarly I need to do the same thing at the other levels of the hierarchy. So ideally, my beans would look something like this:
class Control {
int totalCompanies;
int totalSales;
}
class Company {
String name;
int totalSales;
int totalSalesPeople;
}
class SalesPerson {
String name;
int sales;
}
I'm doing this in the context of Spring Batch, but I am pretty sure that doesn't matter. If I restructure the XML some, then I can get it to work (I am pretty sure I won't be allowed to restructure the XML, though). That is, if the objects aren't nested, then it is fine. Similarly, I can get all the SalesPeople out pretty easily.
I can also get the entire tree as an object, and that might work in some cases. However, the real incoming file could be larger than the available memory, so that won't work in practice.
Is there any way to get JAXB, or some other out-of-the-box unmarshaller to do this or do I just need to roll my own based on SAX or STAX?
EDIT:
The system is using Spring Batch to read in large incoming files. The files are not as described above (domain is different), but the structure is the same. The architectural direction is to attempt to use out-of-the-box readers (StaxEventItemReader, e.g.) and unmarshallers (Jaxb2Marshaller, e.g.).
The system will operate in environments where we cannot absolutely guarantee there is sufficient memory to hold the entire file in memory.
I have approaches (custom Stax reader/pre-processing the file/requesting an XSD change) that work, but I wanted to make sure I wasn't missing a feature in the standard reader / unmarshaller implementations that could make this work easily out of the box.

Partial objects with JAXB?

I'm working to create some services with JAX-RS, and am relatively new to JAXB (actually XML in general) so please don't assume I know the pre-requisites that I probably should know! Here's the questions: I want to send and receive "partial" objects in XML. That is, imagine one has an object (Java form, obviously) with:
class Thing { int x, String y, Customer z }
I want to be able to send an XML output that contains (dynamically chosen, so I can't use XmlTransient) just x, or just z, or x and y, but not z, or any other combination that suits my client. The point, obviously, is that sometimes the client doesn't need everything, so I can save some bandwidth (particularly with lists of deep, complex objects, which this example clearly doesn't illustrate!).
Also, for input, the same bandwidth argument applies; I would like to be able to have the client send just the particular fields that should be updated in, say, a PUT operation, and ignore the rest, then have the server "merge" those new values onto existing objects and leave the un-mentioned fields unchanged.
This seems to be supported in the Jackson JSON libraries (though I'm still working on it), but I'm having trouble finding it in JAXB. Any ideas?
One thought that I was pondering is whether one can do this in some way via Maps. If I created a Map (potentially nested Maps, for nested coplex objects) of what I want to send, could JAXB send that with a plausible structure? And if it could create such a map on input, I guess I could work through it to make the updates. Not perfect, but maybe?
And yes, I know that the "documents" that will be flying around will probably fail to comply with schemas, having missing fields and all that, but I'm ok with that, provided the infrastructure can be made to work.
Oh, and I know I could do this "manually" with SAX, StAX, or DOM parsing, but I'm hoping there's a rather more automatic way, particularly since JAXB handles the whole objects so effortlessly.
Cheers,
Toby
Note: I'm the EclipseLink JAXB (MOXy) lead and a member of the JAXB (JSR-222) expert group.
EclipseLink JAXB (MOXy) offerst this support through its object graph extension. Object graphs allow you to specify a subset of properties for the purposes of marshalling an unmarshalling. They may be created at runtime programatically:
// Create the Object Graph
ObjectGraph contactInfo = JAXBHelper.getJAXBContext(jc).createObjectGraph(Customer.class);
contactInfo.addAttributeNodes("name");
Subgraph location = contactInfo.addSubgraph("billingAddress");
location.addAttributeNodes("city", "province");
Subgraph simple = contactInfo.addSubgraph("phoneNumbers");
simple.addAttributeNodes("value");
// Output XML - Based on Object Graph
marshaller.setProperty(MarshallerProperties.OBJECT_GRAPH, contactInfo);
marshaller.marshal(customer, System.out);
or statically on the class through annotations:
#XmlNamedObjectGraph(
name="contact info",
attributeNodes={
#XmlNamedAttributeNode("name"),
#XmlNamedAttributeNode(value="billingAddress", subgraph="location"),
#XmlNamedAttributeNode(value="phoneNumbers", subgraph="simple")
},
subgraphs={
#XmlNamedSubgraph(
name="location",
attributeNodes = {
#XmlNamedAttributeNode("city"),
#XmlNamedAttributeNode("province")
}
)
}
)
#XmlRootElement
#XmlAccessorType(XmlAccessType.FIELD)
public class Customer {
For More Information
http://blog.bdoughan.com/2013/03/moxys-object-graphs-partial-models-on.html
http://blog.bdoughan.com/2013/03/moxys-object-graphs-inputoutput-partial.html
http://blog.bdoughan.com/2011/05/specifying-eclipselink-moxy-as-your.html

groovy domain objects in Db4O database

I'm using db4o with groovy (actually griffon). I'm saving dozen of objects into db4o objectSet and see that .yarv file size is about 11Mb. I've checked its content and found that it stores metaClass with all nested fields into every object. It's a waste of space.
Looking for the way to avoid storing of metaClass and therefore reduce the size of result .yarv file, since I'm going to use db4o to store millions of entities.
Should I try callConstructors(true) db4o configuration? Think it would help?
Any help would be highly appreciated.
As an alternative you can just store 'Groovy'-beans instances. Those are compiled down to regular Java-ish classes with no special Groovy specific code attached to them.
Just like this:
class Customer {
// properties
Integer id
String name
Address address
}
class Address{
String street;
}
def customer = new Customer(id:1, name:"Gromit", address:new Address(street:"Fun"))
I don't know groovy but based on your description every groovy object carries metadata and you want to skip storing these objects.
If that is the case installing a "null translator" (TNull class) will cause the "translated" objects to not be stored.
PS: Call Constructor configuration has no effect on what gets stored in the db; it only affects how objects are instantiated when reading from db.
Hope this helps

how to read XML file using xml reader?

Question 1: Assume that i am reading XmlNodeType.Text and I would like to know its tag node name. How do you do that without moving cursor up or down? Also How can I know parent tag of current node tag?
Question 2: Assume that I am reading xml file and I would like to start at particular node tag. How can do that?
Question 3: if you have xsd file, is there easy way to upload xml file? I am using C# 3.5 .net and sql server 2008.
This is what i wrote so far:
XmlTextReader reader = new XmlTextReader("datafile.xml");
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
Console.Write(reader.Name);
}
else if (reader.NodeType == XmlNodeType.Text)
{
Console.Write("/"+reader.Name+"/" + reader.Value+"/");
}
else
{
if (reader.NodeType == XmlNodeType.EndElement)
{
Console.WriteLine(reader.Name);
Console.ReadLine();
}
}
}
reader.Close();
Please let me know if you need more clarification
XmlReader is stateless and only retains information about the current node, so if you are reading the content of an element and wish to know the elements name you need to make sure that when you read the start element node you somehow retain the element name.
Again if you want to know the name of the parent element you need to retain this information / state yourself as you read through the xml document.
If you wish to start reading at a particular node you should go through and read the xml document node by node until you read the node you wish to start at.
Ultimately reading xml via the XmlReader class is more difficult than the alternatives, generally speaking you would only use XmlReader if the the xml document is very large, in most other cases using one of the alternatives:
Linq to XML
The XmlDocument class
Using XSD.exe to generate a .Net class from a XSD file that can be used to serialise and deserialise xml via the XmlSerializer class.
For more information see XML Serialization in the .NET Framework
If you really want to use XmlReader then you should read Using the XmlReader Class .

Resources