Split an XML file into multiple files - groovy

Suppose I have a following XML file:
<a>
<b>
....
</b>
<b>
....
</b>
<b>
....
</b>
</a>
I want split this file into multiple XML files based on the number of <b> tags.
Like:
File01.xml
<a>
<b>
....
</b>
</a>
File02.xml
<a>
<b>
....
</b>
</a>
File03.xml
<a>
<b>
....
</b>
</a>
And so on...
I'm new to Groovy and I tried with the following piece of code.
import java.util.HashMap
import java.util.List
import javax.xml.parsers.DocumentBuilderFactory
import org.custommonkey.xmlunit.*
import org.w3c.dom.NodeList
import javax.xml.xpath.*
import javax.xml.transform.TransformerFactory
import org.w3c.dom.*
import javax.xml.transform.dom.DOMSource
import javax.xml.transform.stream.StreamResult
class file_split {
File input = new File("C:\\file\\input.xml")
def dbf = DocumentBuilderFactory.newInstance().newDocumentBuilder()
def doc = new XmlSlurper(dbf).parse(ClassLoader.getSystemResourceAsStream(input));
def xpath = XPathFactory.newInstance().newXPath()
NodeList nodes = (NodeList) xpath.evaluate("//a/b", doc, XPathConstants.NODESET)
def itemsPerFile = 5;
def fileNumber = 0;
def currentdoc = dbf.newDocument()
def rootNode = currentdoc.createElement("a")
def currentFile = new File(fileNumber + ".xml")
try{
for(i = 1; i <= nodes.getLength(); i++){
def imported = currentdoc.importNode(nodes.item(i-1), true)
rootNode.appendChild(imported)
if(i % itemsPerFile == 0){
writeToFile(rootNode, currentFile)
rootNode = currentdoc.createElement("a");
currentFile = new File((++fileNumber)+".xml");
}
}
}
catch(Exception ex){
logError(file.name,ex.getMessage());
ex.printStackTrace();
}
def writeToFile(Node node, File file) throws Exception {
def transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(node), new StreamResult(new FileWriter(file)));
}
}
Any help would be greatly appreciated.

This should work:
import groovy.xml.*
new XmlSlurper().parseText( file ).b.eachWithIndex { element, index ->
new File( "/tmp/File${ "${index+1}".padLeft( 2, '0' ) }.xml" ).withWriter { w ->
w << XmlUtil.serialize( new StreamingMarkupBuilder().bind {
a {
mkp.yield element
}
} )
}
}
If you want to group them, you can use collate (this example groups 2 b tags per file:
import groovy.xml.*
new XmlSlurper().parseText( file )
.b
.toList()
.collate( 2 )
.eachWithIndex { elements, index ->
new File( "/tmp/File${ "${index+1}".padLeft( 2, '0' ) }.txt" ).withWriter { w ->
w << XmlUtil.serialize( new StreamingMarkupBuilder().bind {
a {
elements.each { element ->
mkp.yield element
}
}
} )
}
}

I don't know what problem you are experiencing, but it seems like your creating a new rootNode when needed, but not a new currentdoc. Try to reinitialize the currentdoc right before you reinitialize the rootNode in your loop.

Related

how to create Gsp tag by Markupbuilder?

I want to create a gsp file like this:
but I fail to find how to write the code by markupbuilder.
my code like this:
MarkupBuilder mb = new groovy.xml.MarkupBuilder(strXml);
def builderA = new StreamingMarkupBuilder()
def gsp = builderA.bind{
html{
g.uploadForm(action:"saveDataItem"){
table{
f.with{
tr{
td{
"Test"
}
}
}
}
}
}
}
println XmlUtil.serialize(gsp)
It dos not work.
import groovy.xml.*
def mb = new StreamingMarkupBuilder()
def gsp = mb.bind {
html{
"g:uploadForm"(action:"saveDataItem"){
table{
tr{
td("Test")
}
}
}
}
}
println gsp.toString()

Download a zip file using Groovy

I need to download a zip file from a url using groovy.
Test url: https://gist.github.com/daicham/5ac8461b8b49385244aa0977638c3420/archive/17a929502e6dda24d0ecfd5bb816c78a2bd5a088.zip
What I've done so far:
def static downloadArtifacts(url,filename) {
new URL(url).openConnection().with { conn ->
conn.setRequestProperty("PRIVATE-TOKEN", "xxxx")
url = conn.getHeaderField( "Location" )
if( !url ) {
new File((String)filename ).withOutputStream { out ->
conn.inputStream.with { inp ->
out << inp
inp.close()
}
}
}
}
}
But while opening the downloaded zip file I get an error "An error occurred while loading the archive".
Any help is appreciated.
URL url2download = new URL(url)
File file = new File(filename)
file.bytes = url2download.bytes
You can do it with HttpBuilder-NG:
// https://http-builder-ng.github.io/http-builder-ng/
#Grab('io.github.http-builder-ng:http-builder-ng-core:1.0.3')
import groovyx.net.http.HttpBuilder
import groovyx.net.http.optional.Download
def target = 'https://gist.github.com/daicham/5ac8461b8b49385244aa0977638c3420/archive/17a929502e6dda24d0ecfd5bb816c78a2bd5a088.zip'
File file = HttpBuilder.configure {
request.uri = target
}.get {
Download.toFile(delegate, new File('a.zip'))
}
You can do it:
import java.util.zip.ZipEntry
import java.util.zip.ZipOutputStream
class SampleZipController {
def index() { }
def downloadSampleZip() {
response.setContentType('APPLICATION/OCTET-STREAM')
response.setHeader('Content-Disposition', 'Attachment;Filename="example.zip"')
ZipOutputStream zip = new ZipOutputStream(response.outputStream);
def file1Entry = new ZipEntry('first_file.txt');
zip.putNextEntry(file1Entry);
zip.write("This is the content of the first file".bytes);
def file2Entry = new ZipEntry('second_file.txt');
zip.putNextEntry(file2Entry);
zip.write("This is the content of the second file".bytes);
zip.close();
}
}

Modifying the file contents of a zipfile entry

I would like to update the contents of text file located inside a zipfile.
I cannot find out how to do this, and the code below is not working properly.
May thanks for any help!!
import java.util.zip.ZipFile
import java.util.zip.ZipEntry
import java.util.zip.ZipOutputStream
String zipFileFullPath = "C:/path/to/myzipfile/test.zip"
ZipFile zipFile = new ZipFile(zipFileFullPath)
ZipEntry entry = zipFile.getEntry ( "someFile.txt" )
if(entry){
InputStream input = zipFile.getInputStream(entry)
BufferedReader br = new BufferedReader(new InputStreamReader(input, "UTF-8"))
String s = null
StringBuffer sb = new StringBuffer()
while ((s=br.readLine())!=null){
sb.append(s)
}
sb.append("adding some text..")
ZipOutputStream out = new ZipOutputStream(new FileOutputStream(zipFileFullPath))
out.putNextEntry(new ZipEntry("someFile.txt"));
int length
InputStream fin = new ByteArrayInputStream(sb.toString().getBytes("UTF8"))
while((length = fin.read(sb)) > 0)
{
out.write(sb, 0, length)
}
out.closeEntry()
}
Just some slight modifications to #Opal's answer, I've just:
used groovy methods where possible
packaged in a method
Groovy Snippet
void updateZipEntry(String zipFile, String zipEntry, String newContent){
def zin = new ZipFile(zipFile)
def tmp = File.createTempFile("temp_${System.nanoTime()}", '.zip')
tmp.withOutputStream { os ->
def zos = new ZipOutputStream(os)
zin.entries().each { entry ->
def isReplaced = entry.name == zipEntry
zos.putNextEntry(isReplaced ? new ZipEntry(zipEntry) : entry)
zos << (isReplaced ? newContent.getBytes('UTF8') : zin.getInputStream(entry).bytes )
zos.closeEntry()
}
zos.close()
}
zin.close()
assert new File(zipFile).delete()
tmp.renameTo(zipFile)
}
Usage
updateZipEntry('/tmp/file.zip', 'META-INF/web.xml', '<foobar>new content!</foobar>')
What exactly isn't working? Is there any exception thrown?
As far as I know it's not possible to modify a zip file in situ. The following script rewrites the file and if desired entry is processed - modifies it.
import java.util.zip.*
def zipIn = new File('lol.zip')
def zip = new ZipFile(zipIn)
def zipTemp = File.createTempFile('out', 'zip')
zipTemp.deleteOnExit()
def zos = new ZipOutputStream(new FileOutputStream(zipTemp))
def toModify = 'lol.txt'
for(e in zip.entries()) {
if(!e.name.equalsIgnoreCase(toModify)) {
zos.putNextEntry(e)
zos << zip.getInputStream(e).bytes
} else {
zos.putNextEntry(new ZipEntry(toModify))
zos << 'lollol\n'.bytes
}
zos.closeEntry()
}
zos.close()
zipIn.delete()
zipTemp.renameTo(zipIn)
UPDATE
I wasn't right. It's possible to modify zip file in situ, but Your solution will omit other files that were zipped. The output file will contain only one single file - the file You wanted to modify. I also suppose that You file was corrupted because of not invoking close() on out.
Below is You script slightly modified (more groovier):
import java.util.zip.*
def zipFileFullPath = 'lol.zip'
def zipFile = new ZipFile(zipFileFullPath)
def entry = zipFile.getEntry('lol.txt')
if(entry) {
def input = zipFile.getInputStream(entry)
def br = new BufferedReader(new InputStreamReader(input, 'UTF-8'))
def sb = new StringBuffer()
sb << br.text
sb << 'adding some text..'
def out = new ZipOutputStream(new FileOutputStream(zipFileFullPath))
out.putNextEntry(new ZipEntry('lol.txt'))
out << sb.toString().getBytes('UTF8')
out.closeEntry()
out.close()
}

how reading nutch generated content data on the segment folder using java

I am trying to read the content data inside the segment folder. I think the content data file is written in a custom format
I experimented with nutch's Content class, but it does not recognize the format.
import java.io.IOException;
import org.apache.commons.cli.Options;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;
public class ContentReader {
public static void main(String[] args) throws IOException {
// Setup the parser
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
org.apache.nutch.segment.SegmentReader
has a map reduce implementation that reads content data in the segment directory.
spark/scala code to read data from the segments content folder.
How I read from the content folder in my project.
I have created a case class page which holds data read from the content folder
case class Page(var url: String, var title: String = null
,var contentType: String = null, var rawHtml: String = null,var language: String = null
,var metadata: Map[String,String])
Code to read from content folder
import org.apache.commons.lang3.StringUtils
import org.apache.hadoop.io.{Text, Writable}
import org.apache.nutch.crawl.{CrawlDatum, Inlinks}
import org.apache.nutch.parse.ParseText
import org.apache.nutch.protocol.Content
val contentDF = spark.sparkContext.sequenceFile(path.contentLocation, classOf[Text], classOf[Writable])
.map { case (x, y) => (x.toString, extract(y.asInstanceOf[Content])) }
/** converts Content object to Page **/
def extract(content: Content): Page = {
try {
val parsed = Page(content.getUrl)
var charset: String = getCharsetFromContentType(content.getContentType)
if (StringUtils.isBlank(charset)) {
charset = "UTF-8"
}
parsed.rawHtml = Try(new String(content.getContent, charset)).getOrElse(new String(content.getContent, "UTF-8"))
parsed.contentType = Try(content.getMetadata.get("Content-Type")).getOrElse("text/html")
// parsed.isHomePage = Boolean.valueOf(content.getMetadata.get("isHomePage"))
parsed.metadata = content.getMetadata.names().map(name => (name,content.getMetadata.get(name))).toMap
Try {
if (StringUtils.isNotBlank(content.getMetadata.get("Content-Language")))
parsed.language = content.getMetadata.get("Content-Language")
else if (StringUtils.isNotBlank(content.getMetadata.get("language")))
parsed.language = content.getMetadata.get("language")
else parsed.language = content.getMetadata.get("lang")
}
parsed
} catch {
case e: Exception =>
LOG.error("ERROR while extracting data from Content ", e)
null
}
}
/**Get Html ContentType **/
def getCharsetFromContentType(contentType: String): String = {
var result: String = "UTF-8"
Try {
if (StringUtils.isNotBlank(contentType)) {
val m = charsetPattern.matcher(contentType)
result = if (m.find) m.group(1).trim.toUpperCase else "UTF-8"
}
}
result
}

replace XmlSlurper tag with arbitrary XML

I am trying to replace specific XmlSlurper tags with arbitrary XML strings. The best way I have managed to come up with to do this is:
#!/usr/bin/env groovy
import groovy.xml.StreamingMarkupBuilder
def page=new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText("""
<html>
<head></head>
<body>
<one attr1='val1'>asdf</one>
<two />
<replacemewithxml />
</body>
</html>
""".trim())
import groovy.xml.XmlUtil
def closure
closure={ bind,node->
if (node.name()=="REPLACEMEWITHXML") {
bind.mkp.yieldUnescaped "<replacementxml>sometext</replacementxml>"
} else {
bind."${node.name()}"(node.attributes()) {
mkp.yield node.text()
node.children().each { child->
closure(bind,child)
}
}
}
}
println XmlUtil.serialize(
new StreamingMarkupBuilder().bind { bind->
closure(bind,page)
}
)
However, the only problem is the text() element seems to capture all child text nodes, and thus I get:
<?xml version="1.0" encoding="UTF-8"?>
<HTML>asdf<HEAD/>
<BODY>asdf<ONE attr1="val1">asdf</ONE>
<TWO/>
<replacementxml>sometext</replacementxml>
</BODY>
</HTML>
Any ideas/help much appreciated.
Thank you!
Misha
p.s. Also, out of curiosity, if I change the above to the "Groovier" notation as follows, the groovy compiler thinks I am trying to access the ${node.name()} member of my test class. Is there a way to specify this is not the case while still not passing the actual builder object? Thank you! :)
def closure
closure={ node->
if (node.name()=="REPLACEMEWITHXML") {
mkp.yieldUnescaped "<replacementxml>sometext</replacementxml>"
} else {
"${node.name()}"(node.attributes()) {
mkp.yield node.text()
node.children().each { child->
closure(child)
}
}
}
}
println XmlUtil.serialize(
new StreamingMarkupBuilder().bind {
closure(page)
}
)
Ok here is what I came up with:
#!/usr/bin/env groovy
import groovy.xml.StreamingMarkupBuilder
import groovy.xml.XmlUtil
def printSlurper={page->
println XmlUtil.serialize(
new StreamingMarkupBuilder().bind { bind->
mkp.yield page
}
)
}
def saxParser=new org.cyberneko.html.parsers.SAXParser()
saxParser.setFeature('http://xml.org/sax/features/namespaces',false)
saxParser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment",true)
def string="TEST"
def middleClosureHelper={ builder->
builder."${string}" {
mkp.yieldUnescaped "<inner>XML</inner>"
}
}
def middleClosure={
MiddleClosure {
middleClosureHelper(delegate)
}
}
def original=new XmlSlurper(saxParser).parseText("""
<original>
<middle>
</middle>
</original>
""")
original.depthFirst().find { it.name()=='MIDDLE' }.replaceNode { node->
mkp.yield middleClosure
}
printSlurper(original)
assert original.depthFirst().find { it.name()=='INNER' } == null
def modified=new XmlSlurper(saxParser).parseText(new StreamingMarkupBuilder().bind {mkp.yield original}.toString())
assert modified.depthFirst().find { it.name()=='INNER' } != null
You have to reload the slurper, but it works!
Misha

Resources