how reading nutch generated content data on the segment folder using java - nutch

I am trying to read the content data inside the segment folder. I think the content data file is written in a custom format
I experimented with nutch's Content class, but it does not recognize the format.

import java.io.IOException;
import org.apache.commons.cli.Options;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;
public class ContentReader {
public static void main(String[] args) throws IOException {
// Setup the parser
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}

org.apache.nutch.segment.SegmentReader
has a map reduce implementation that reads content data in the segment directory.

spark/scala code to read data from the segments content folder.
How I read from the content folder in my project.
I have created a case class page which holds data read from the content folder
case class Page(var url: String, var title: String = null
,var contentType: String = null, var rawHtml: String = null,var language: String = null
,var metadata: Map[String,String])
Code to read from content folder
import org.apache.commons.lang3.StringUtils
import org.apache.hadoop.io.{Text, Writable}
import org.apache.nutch.crawl.{CrawlDatum, Inlinks}
import org.apache.nutch.parse.ParseText
import org.apache.nutch.protocol.Content
val contentDF = spark.sparkContext.sequenceFile(path.contentLocation, classOf[Text], classOf[Writable])
.map { case (x, y) => (x.toString, extract(y.asInstanceOf[Content])) }
/** converts Content object to Page **/
def extract(content: Content): Page = {
try {
val parsed = Page(content.getUrl)
var charset: String = getCharsetFromContentType(content.getContentType)
if (StringUtils.isBlank(charset)) {
charset = "UTF-8"
}
parsed.rawHtml = Try(new String(content.getContent, charset)).getOrElse(new String(content.getContent, "UTF-8"))
parsed.contentType = Try(content.getMetadata.get("Content-Type")).getOrElse("text/html")
// parsed.isHomePage = Boolean.valueOf(content.getMetadata.get("isHomePage"))
parsed.metadata = content.getMetadata.names().map(name => (name,content.getMetadata.get(name))).toMap
Try {
if (StringUtils.isNotBlank(content.getMetadata.get("Content-Language")))
parsed.language = content.getMetadata.get("Content-Language")
else if (StringUtils.isNotBlank(content.getMetadata.get("language")))
parsed.language = content.getMetadata.get("language")
else parsed.language = content.getMetadata.get("lang")
}
parsed
} catch {
case e: Exception =>
LOG.error("ERROR while extracting data from Content ", e)
null
}
}
/**Get Html ContentType **/
def getCharsetFromContentType(contentType: String): String = {
var result: String = "UTF-8"
Try {
if (StringUtils.isNotBlank(contentType)) {
val m = charsetPattern.matcher(contentType)
result = if (m.find) m.group(1).trim.toUpperCase else "UTF-8"
}
}
result
}

Related

Why i cannot read text file from project (Android Studio)

I created assets folder in a project, put my text file there. But when i run my app, it crashes with error:
"Caused by: android.system.ErrnoException: open failed: ENOENT (No such file or directory)"
I tried different ways to definite var filename for example: "assets/file.txt", "assets\file.txt", "./file.txt", but i still get the same error
package com.soft23.testfile
import android.os.Bundle
import androidx.activity.ComponentActivity
import androidx.activity.compose.setContent
import androidx.compose.material.Text
import java.io.File
class MainActivity : ComponentActivity() {
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
var str = "str"
var filename = "file.txt"
File(filename).forEachLine { str = it }
setContent {
Text(text = str)
}
}
}
Similar code in IntelliJ IDEA works fine
What i did wrong?
you can get asset file in Kotlin through this way:
var reader:BufferedReader? = null
try {
//here i'm calling in Fragment, change activity?.getAssets()? = getAssets() if you calling in Activity
reader = BufferedReader(InputStreamReader(activity?.getAssets()?.open("test.txt")));
val returnString = StringBuilder()
while (true) {
val mLine = reader.readLine()
if (mLine == null) break
returnString.append(mLine + "\n")
}
//here is result
Log.d("test.txt", "string = $returnString")
} catch (e: Exception) {
//log the exception
} finally {
if (reader != null) {
try {
reader.close();
} catch (e: Exception) {
//log the exception
}
}
}
make sure your .txt file into this path:
project/app/src/main/assets

Download a zip file using Groovy

I need to download a zip file from a url using groovy.
Test url: https://gist.github.com/daicham/5ac8461b8b49385244aa0977638c3420/archive/17a929502e6dda24d0ecfd5bb816c78a2bd5a088.zip
What I've done so far:
def static downloadArtifacts(url,filename) {
new URL(url).openConnection().with { conn ->
conn.setRequestProperty("PRIVATE-TOKEN", "xxxx")
url = conn.getHeaderField( "Location" )
if( !url ) {
new File((String)filename ).withOutputStream { out ->
conn.inputStream.with { inp ->
out << inp
inp.close()
}
}
}
}
}
But while opening the downloaded zip file I get an error "An error occurred while loading the archive".
Any help is appreciated.
URL url2download = new URL(url)
File file = new File(filename)
file.bytes = url2download.bytes
You can do it with HttpBuilder-NG:
// https://http-builder-ng.github.io/http-builder-ng/
#Grab('io.github.http-builder-ng:http-builder-ng-core:1.0.3')
import groovyx.net.http.HttpBuilder
import groovyx.net.http.optional.Download
def target = 'https://gist.github.com/daicham/5ac8461b8b49385244aa0977638c3420/archive/17a929502e6dda24d0ecfd5bb816c78a2bd5a088.zip'
File file = HttpBuilder.configure {
request.uri = target
}.get {
Download.toFile(delegate, new File('a.zip'))
}
You can do it:
import java.util.zip.ZipEntry
import java.util.zip.ZipOutputStream
class SampleZipController {
def index() { }
def downloadSampleZip() {
response.setContentType('APPLICATION/OCTET-STREAM')
response.setHeader('Content-Disposition', 'Attachment;Filename="example.zip"')
ZipOutputStream zip = new ZipOutputStream(response.outputStream);
def file1Entry = new ZipEntry('first_file.txt');
zip.putNextEntry(file1Entry);
zip.write("This is the content of the first file".bytes);
def file2Entry = new ZipEntry('second_file.txt');
zip.putNextEntry(file2Entry);
zip.write("This is the content of the second file".bytes);
zip.close();
}
}

How to get list of Named range,sheet name and referance formuls using XSSF and SAX (Event API) for large excel file

I'm tring to to read large excel file (size~10MB,.xlsx) .
I'm using below code
Workbook xmlworkbook =WorkbookFactory.create(OPCPackage.openOrCreate(root_path_name_file));
But it's showing Heap memory issue.
I have also seen other solution on StackOverflow some of them given to increase the JVM but i dont want to increase jvm.
Issue 1) We can't use SXSSF (Streaming Usermodel API) because this is only for writing or creating new workbook.
My sole objective to get the number of NamedRange of sheet, Total number of sheet and their sheet name for large excel file.
If the requirement is only to get the named ranges and sheet names, then only the /xl/workbook.xml from the *.xlsx ZIPPackage must be parsed since those informations are all stored there.
This is possible by getting the appropriate PackagePart and parsing the XML from this. For parsing XML my favorite is using StAX.
Example code which gets all sheet names and defined named ranges:
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.XMLEvent;
import javax.xml.namespace.QName;
import java.io.File;
import java.util.regex.Pattern;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.util.HashMap;
class StaxReadOPCPackageParts {
public static void main(String[] args) {
try {
File file = new File("file.xlsx");
OPCPackage opcpackage = OPCPackage.open(file);
//get the workbook package part
PackagePart workbookpart = opcpackage.getPartsByName(Pattern.compile("/xl/workbook.xml")).get(0);
//create reader for package part
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(workbookpart.getInputStream());
List<String> sheetNames = new ArrayList<>();
Map<String, String> definedNames = new HashMap<>();
boolean isInDefinedName = false;
String sheetName = "";
String definedNameName = "";
StringBuffer definedNameFormula = new StringBuffer();
while(reader.hasNext()){ //loop over all XML in workbook.xml
XMLEvent event = (XMLEvent)reader.next();
if(event.isStartElement()) {
StartElement startElement = (StartElement)event;
QName startElementName = startElement.getName();
if(startElementName.getLocalPart().equalsIgnoreCase("sheet")) { //start element of sheet definition
Attribute attribute = startElement.getAttributeByName(new QName("name"));
sheetName = attribute.getValue();
sheetNames.add(sheetName);
} else if (startElementName.getLocalPart().equalsIgnoreCase("definedName")) { //start element of definedName
Attribute attribute = startElement.getAttributeByName(new QName("name"));
definedNameName = attribute.getValue();
isInDefinedName = true;
}
} else if(event.isCharacters() && isInDefinedName) { //character content of definedName == the formula
definedNameFormula.append(((Characters)event).getData());
} else if(event.isEndElement()) {
EndElement endElement = (EndElement)event;
QName endElementName = endElement.getName();
if(endElementName.getLocalPart().equalsIgnoreCase("definedName")) { //end element of definedName
definedNames.put(definedNameName, definedNameFormula.toString());
definedNameFormula = new StringBuffer();
isInDefinedName = false;
}
}
}
opcpackage.close();
System.out.println("Sheet names:");
for (String shName : sheetNames) {
System.out.println("Sheet name: " + shName);
}
System.out.println("Named ranges:");
for (String defName : definedNames.keySet()) {
System.out.println("Name: " + defName + ", Formula: " + definedNames.get(defName));
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

Excel File Upload with Kendo UI

while I was trying to upload an excel file with kendo ui I found a code on the internet. It is using a keyword named "Constants" but this keyword does not recognize the ".xls" file extension. I am stuck at this and did some research but have no answer to solve this. Here is my code:
public ActionResult Submit(IEnumerable<HttpPostedFileBase> files)
{
if(files!= null)
{
string fileName;
string filePath;
string fileExtension;
foreach(var f in files)
{
//Set file details
SetFileDetails(f, out fileName, out filePath, out fileExtension);
if(fileExtension == Constants.xls || fileExtension == Constants.xlsx)
{
//Save the uploaded file to app folder
string savedExcelFiles = Constants.UploadedFolder + fileName;
f.SaveAs(Server.MapPath(savedExcelFiles));
ReadDataFromExcelFiles(savedExcelFiles);
}
else
{
//file not supported send alert
}
}
}
return RedirectToActionPermanent("Index","Connect");
}
private static void SetFileDetails(HttpPostedFileBase f,out string fileName,out string filePath,out string fileExtension)
{
fileName=Path.GetFileName(f.FileName);
fileExtension=Path.GetExtension(f.FileName);
filePath = Path.GetFullPath(f.FileName);
}
private void ReadDataFromExcelFiles(string savedExcelFiles)
{
var connectionString = string.Format("Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=Excel 12.0;",Server.MapPath(savedExcelFiles));
//fill the DataSet by the sheets
var adapter = new OleDbDataAdapter("SELECT * FROM [Sheet1$]",connectionString);
var ds = new DataSet();
List<UploadExcel> uploadExl = new List<UploadExcel>();
adapter.Fill(ds,"Subscriber");
DataTable data=ds.Tables["Subscriber"];
GetSetUploadExcelData(uploadExl,data);
}
private static void GetSetUploadExcelData (List<UploadExcel> uploadExl,DataTable data)
{
for(int i=0;i<data.Rows.Count-1;i++)
{
UploadExcel NewUpload = new UploadExcel();
NewUpload.ID = Convert.ToInt16(data.Rows[i]["ID"]);
NewUpload.CostCenter = Convert.ToString(data.Rows[i]["CostCenter"]);
NewUpload.FirstName = Convert.ToString(data.Rows[i]["FirstName"]);
NewUpload.LastName = Convert.ToString(data.Rows[i]["LastName"]);
NewUpload.MobileNo = Convert.ToString(data.Rows[i]["MobileNo"]);
NewUpload.EmailID = Convert.ToString(data.Rows[i]["EmailID"]);
NewUpload.Services = Convert.ToString(data.Rows[i]["Services"]);
NewUpload.UsageType = Convert.ToString(data.Rows[i]["UsageType"]);
NewUpload.Network = Convert.ToString(data.Rows[i]["Network"]);
NewUpload.UsageIncluded = Convert.ToInt16(data.Rows[i]["UsageIncluded"]);
NewUpload.Unit = Convert.ToString(data.Rows[i]["Unit"]);
uploadExl.Add(NewUpload);
}
}
}
I suspect that the Constants.xls relates to a static class or enum that the original code author is using to hold the .xls/.xlsx extensions.
If you create a constants class something like:
public static class Constants
{
public static string xls = "xls";
public static string xlsx = "xlsx";
}
This would then should help.
If you need any more assistance then please let me know.
edit: Just reviewing the code it seems they are also putting in constant mapping for the uploadfolder location as well so I suspect this is just a static class rather than an enum with application specific details. in a way a bit like using the appSettings within webconfig

How to find, remove and read text in this symbol ${}

Sample data key in by user.
booking/${mm}/${yyyy}
${yyyy}/booking/${mm}
booking/${mm}${yyyy}/00
My problem is how to take out ${ } and read what contain in there and then replace by month/year depend on format.
So the output should be "booking/10/2013" after save into database.
Im using Grails. Hope can solve this problem using Java / groovy.
i just solve the problementer code here
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String bookingNo1 = "booking/${mm}/${yyyy}";
String bookingNo2 = "${yyyy}/booking/${mm}";
String bookingNo3 = "booking/${mm}${yyyy}/00";
String newDate = null;
newDate = getDataString(bookingNo1);
System.out.println(newDate);
newDate = getDataString(bookingNo2);
System.out.println(newDate);
newDate = getDataString(bookingNo3);
System.out.println(newDate);
}
public static String getTimeString(String pattern) {
SimpleDateFormat format = new SimpleDateFormat();
format.applyPattern(pattern);
return format.format(new Date());
}
public static String getDataString(String dateInput) {
String dateString = dateInput;
String regex = "\\$\\{(mm|yyyy|DD|MM)\\}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(dateInput);
while (matcher.find()) {
String datePattern = matcher.group()
.replaceAll("(\\$|\\{|\\})", "");
dateString = dateString.replaceAll("\\$\\{" + datePattern + "\\}",
getTimeString(datePattern));
}
return dateString;
}
}
You can do this with Groovy:
// Given these inputs
def inputs = [ 'booking/${mm}/${yyyy}',
'${yyyy}/booking/${mm}',
'booking/${mm}${yyyy}/00' ]
// Make a binding for 'mm' and 'yyyy'
def date = new Date()
def binding = [ mm : date.format( 'MM' ),
yyyy : date.format( 'yyyy' ) ]
// Then process each input with STE and print it out
inputs.each { input ->
println new groovy.text.SimpleTemplateEngine()
.createTemplate( input )
.make( binding )
}
That prints:
booking/10/2013
2013/booking/10
booking/102013/00

Resources