Hive UDF - How to access column name - apache-spark

Would someone please let me know how to access the column name in simple hive udf.
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import Utils
#Description(name = "Decrypt", value = "Encrypt the Given Column", extended = "SELECT Decrypt('Hello World!');")
public class Encrypt extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
//Access Column Name and pass to the function to get encryption key
String secretKey = Utils.getSecretKey(columnName)
String encryptedText = AES.encrypt(str.toString(), "randomkey");
result.set(encryptedText);
return result;
}
}

Related

how to sort string characters alphabetically in pig latin

If I have a table as below:name:chararray
example :
acab
bca
das
desac
How can I order the column 'name' alphabetically in Pig?
like this :
aacb
abc
ads
acdes
Write your own UDF.Pass the name to UDF.In the UDF convert the string to a chararray and sort it.Return the sorted string.
PIG
REGISTER ORDER_UDF.jar;
A = LOAD 'data.txt' USING PigStorage(',') AS (name: chararray);
B = FOREACH A GENERATE ORDER_UDF.ORDER(name);
DUMP B;
UDF
import java.io.IOException;
import java.util.Arrays;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ORDER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
char tempArray[] = ((String)input).toCharArray();
Arrays.sort(tempArray);
return new String(tempArray);
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}

How to get list of Named range,sheet name and referance formuls using XSSF and SAX (Event API) for large excel file

I'm tring to to read large excel file (size~10MB,.xlsx) .
I'm using below code
Workbook xmlworkbook =WorkbookFactory.create(OPCPackage.openOrCreate(root_path_name_file));
But it's showing Heap memory issue.
I have also seen other solution on StackOverflow some of them given to increase the JVM but i dont want to increase jvm.
Issue 1) We can't use SXSSF (Streaming Usermodel API) because this is only for writing or creating new workbook.
My sole objective to get the number of NamedRange of sheet, Total number of sheet and their sheet name for large excel file.
If the requirement is only to get the named ranges and sheet names, then only the /xl/workbook.xml from the *.xlsx ZIPPackage must be parsed since those informations are all stored there.
This is possible by getting the appropriate PackagePart and parsing the XML from this. For parsing XML my favorite is using StAX.
Example code which gets all sheet names and defined named ranges:
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.XMLEvent;
import javax.xml.namespace.QName;
import java.io.File;
import java.util.regex.Pattern;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.util.HashMap;
class StaxReadOPCPackageParts {
public static void main(String[] args) {
try {
File file = new File("file.xlsx");
OPCPackage opcpackage = OPCPackage.open(file);
//get the workbook package part
PackagePart workbookpart = opcpackage.getPartsByName(Pattern.compile("/xl/workbook.xml")).get(0);
//create reader for package part
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(workbookpart.getInputStream());
List<String> sheetNames = new ArrayList<>();
Map<String, String> definedNames = new HashMap<>();
boolean isInDefinedName = false;
String sheetName = "";
String definedNameName = "";
StringBuffer definedNameFormula = new StringBuffer();
while(reader.hasNext()){ //loop over all XML in workbook.xml
XMLEvent event = (XMLEvent)reader.next();
if(event.isStartElement()) {
StartElement startElement = (StartElement)event;
QName startElementName = startElement.getName();
if(startElementName.getLocalPart().equalsIgnoreCase("sheet")) { //start element of sheet definition
Attribute attribute = startElement.getAttributeByName(new QName("name"));
sheetName = attribute.getValue();
sheetNames.add(sheetName);
} else if (startElementName.getLocalPart().equalsIgnoreCase("definedName")) { //start element of definedName
Attribute attribute = startElement.getAttributeByName(new QName("name"));
definedNameName = attribute.getValue();
isInDefinedName = true;
}
} else if(event.isCharacters() && isInDefinedName) { //character content of definedName == the formula
definedNameFormula.append(((Characters)event).getData());
} else if(event.isEndElement()) {
EndElement endElement = (EndElement)event;
QName endElementName = endElement.getName();
if(endElementName.getLocalPart().equalsIgnoreCase("definedName")) { //end element of definedName
definedNames.put(definedNameName, definedNameFormula.toString());
definedNameFormula = new StringBuffer();
isInDefinedName = false;
}
}
}
opcpackage.close();
System.out.println("Sheet names:");
for (String shName : sheetNames) {
System.out.println("Sheet name: " + shName);
}
System.out.println("Named ranges:");
for (String defName : definedNames.keySet()) {
System.out.println("Name: " + defName + ", Formula: " + definedNames.get(defName));
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

How to find, remove and read text in this symbol ${}

Sample data key in by user.
booking/${mm}/${yyyy}
${yyyy}/booking/${mm}
booking/${mm}${yyyy}/00
My problem is how to take out ${ } and read what contain in there and then replace by month/year depend on format.
So the output should be "booking/10/2013" after save into database.
Im using Grails. Hope can solve this problem using Java / groovy.
i just solve the problementer code here
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String bookingNo1 = "booking/${mm}/${yyyy}";
String bookingNo2 = "${yyyy}/booking/${mm}";
String bookingNo3 = "booking/${mm}${yyyy}/00";
String newDate = null;
newDate = getDataString(bookingNo1);
System.out.println(newDate);
newDate = getDataString(bookingNo2);
System.out.println(newDate);
newDate = getDataString(bookingNo3);
System.out.println(newDate);
}
public static String getTimeString(String pattern) {
SimpleDateFormat format = new SimpleDateFormat();
format.applyPattern(pattern);
return format.format(new Date());
}
public static String getDataString(String dateInput) {
String dateString = dateInput;
String regex = "\\$\\{(mm|yyyy|DD|MM)\\}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(dateInput);
while (matcher.find()) {
String datePattern = matcher.group()
.replaceAll("(\\$|\\{|\\})", "");
dateString = dateString.replaceAll("\\$\\{" + datePattern + "\\}",
getTimeString(datePattern));
}
return dateString;
}
}
You can do this with Groovy:
// Given these inputs
def inputs = [ 'booking/${mm}/${yyyy}',
'${yyyy}/booking/${mm}',
'booking/${mm}${yyyy}/00' ]
// Make a binding for 'mm' and 'yyyy'
def date = new Date()
def binding = [ mm : date.format( 'MM' ),
yyyy : date.format( 'yyyy' ) ]
// Then process each input with STE and print it out
inputs.each { input ->
println new groovy.text.SimpleTemplateEngine()
.createTemplate( input )
.make( binding )
}
That prints:
booking/10/2013
2013/booking/10
booking/102013/00

How do I create an OrderBy statement using a reflected value?

I would like to create a method that orders an IEnumerable List by a given property where the property is passed into the method by a string i.e. (Mind you the first code example does not work, but the second does and is what I am trying to emulate dynamically).
string sortName = "SerialNumber";
IEnumerable<PartSummary> partList = FunctionToCreateList();
partOrderedList = partList.OrderBy(what do I stick in here);
that would be equivalent to
IEnumerable<PartSummary> partList = FunctionToCreateList();
partOrderedList = partList.OrderBy(p => p.SerialNumber);
How can I accomplish this?
Are you saying you want to pass the order by in to your method? If so, you can use this:
Expression<Func<PartSummary, bool>> orderByClause
Then you can do this:
partOrderedList = partList.OrderBy(orderByClause);
Then you can handle your order by in your business layer or wherever you wish.
Okay, update: If you want to pass in the column name as a string you can do something like as follows:
Create a static class for an extension method (reference: http://social.msdn.microsoft.com/Forums/en-US/linqprojectgeneral/thread/39028ad2-452e-409f-bc9e-d1b263e921f6/):
static class LinqExtensions
{
public static IQueryable<T> OrderBy<T>(this IQueryable<T> source, string sortingColumn, bool isAscending)
{
if (String.IsNullOrEmpty(sortingColumn))
{
return source;
}
ParameterExpression parameter = Expression.Parameter(source.ElementType, String.Empty);
MemberExpression property = Expression.Property(parameter, sortingColumn);
LambdaExpression lambda = Expression.Lambda(property, parameter);
string methodName = isAscending ? "OrderBy" : "OrderByDescending";
Expression methodCallExpression = Expression.Call(typeof(Queryable), methodName,
new Type[] { source.ElementType, property.Type },
source.Expression, Expression.Quote(lambda));
return source.Provider.CreateQuery<T>(methodCallExpression);
}
}
Then you can create your method:
static IQueryable<PartSummary> FunctionToCreateList()
{
IList<PartSummary> list = new List<PartSummary>();
list.Add(new PartSummary
{
Id = 1,
SerialNumber = "A",
});
list.Add(new PartSummary
{
Id = 2,
SerialNumber = "B",
});
return list.AsQueryable();
}
And then call your method:
static void Main(string[] args)
{
IQueryable<PartSummary> partOrderedList = FunctionToCreateList();
PartSummary partSummary = new PartSummary();
string sortBy = "Id";
partOrderedList = partOrderedList.OrderBy(sortBy, false);
foreach (PartSummary summary in partOrderedList)
{
Console.WriteLine(summary.Id + ", " + summary.SerialNumber);
}
Console.ReadLine();
}
Now you can pass in the column name as a string and sort.
Hope this helps!
You can also avoid extending and just use a compiled expression tree to accomplish this:
public Func<T, object> ResolveToProperty<T>(String propertyName)
{
Type t = typeof(T);
var paramExpression = Expression.Parameter(t, "element");
var propertyExpression = Expression.Property(paramExpression, propertyName);
return Expression.Lambda<Func<T, object>>(propertyExpression, paramExpression).Compile();
}
string sortName = "SerialNumber";
IEnumerable<PartSummary> partList = FunctionToCreateList();
var partOrderedList = partList.OrderBy(ResolveToProperty<PartSummary>(sortName));

how reading nutch generated content data on the segment folder using java

I am trying to read the content data inside the segment folder. I think the content data file is written in a custom format
I experimented with nutch's Content class, but it does not recognize the format.
import java.io.IOException;
import org.apache.commons.cli.Options;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;
public class ContentReader {
public static void main(String[] args) throws IOException {
// Setup the parser
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
org.apache.nutch.segment.SegmentReader
has a map reduce implementation that reads content data in the segment directory.
spark/scala code to read data from the segments content folder.
How I read from the content folder in my project.
I have created a case class page which holds data read from the content folder
case class Page(var url: String, var title: String = null
,var contentType: String = null, var rawHtml: String = null,var language: String = null
,var metadata: Map[String,String])
Code to read from content folder
import org.apache.commons.lang3.StringUtils
import org.apache.hadoop.io.{Text, Writable}
import org.apache.nutch.crawl.{CrawlDatum, Inlinks}
import org.apache.nutch.parse.ParseText
import org.apache.nutch.protocol.Content
val contentDF = spark.sparkContext.sequenceFile(path.contentLocation, classOf[Text], classOf[Writable])
.map { case (x, y) => (x.toString, extract(y.asInstanceOf[Content])) }
/** converts Content object to Page **/
def extract(content: Content): Page = {
try {
val parsed = Page(content.getUrl)
var charset: String = getCharsetFromContentType(content.getContentType)
if (StringUtils.isBlank(charset)) {
charset = "UTF-8"
}
parsed.rawHtml = Try(new String(content.getContent, charset)).getOrElse(new String(content.getContent, "UTF-8"))
parsed.contentType = Try(content.getMetadata.get("Content-Type")).getOrElse("text/html")
// parsed.isHomePage = Boolean.valueOf(content.getMetadata.get("isHomePage"))
parsed.metadata = content.getMetadata.names().map(name => (name,content.getMetadata.get(name))).toMap
Try {
if (StringUtils.isNotBlank(content.getMetadata.get("Content-Language")))
parsed.language = content.getMetadata.get("Content-Language")
else if (StringUtils.isNotBlank(content.getMetadata.get("language")))
parsed.language = content.getMetadata.get("language")
else parsed.language = content.getMetadata.get("lang")
}
parsed
} catch {
case e: Exception =>
LOG.error("ERROR while extracting data from Content ", e)
null
}
}
/**Get Html ContentType **/
def getCharsetFromContentType(contentType: String): String = {
var result: String = "UTF-8"
Try {
if (StringUtils.isNotBlank(contentType)) {
val m = charsetPattern.matcher(contentType)
result = if (m.find) m.group(1).trim.toUpperCase else "UTF-8"
}
}
result
}

Resources