Text, ByteWritable, VIntWritable of Hadoop equivalent in Spark?

Text, ByteWritable, VIntWritable of Hadoop equivalent in Spark? - apache-spark

Dear Members,
As we know that Text data type of Hadoop uses UTF-8 encoding. That is if the character can be stored in a byte it will store it in a byte. If a character cannot be stored in a byte then it will store it in 2 bytes. The same way, as a performance boost, for word count program in Hadoop Map phase, for mapper value data type if ByteWritable can be used then the amount of mapper output data can be reduced. IntWritable requires 4 bytes whereas ByteWritable needs 1 byte. If I use VIntWritable instead of IntWritable then if an integer can be stored in a byte it will store it in a byte otherwise it will store it in 4 bytes reduce memory footprint.
How the following Java program can be modified so that mapper's key type is Text and value is ByteWritable and reducer's key type is Text and value is VIntWritable.
// Now we have non-empty lines, lets split them into words
JavaRDD<String> words = nonEmptyLines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String s) throws Exception {
return Arrays.asList(s.split(" "));
}
});
// Convert words to Pairs, remember the TextPair class in Hadoop world
JavaPairRDD<String, Integer> wordPairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> wordCount = wordPairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
});
// Just for debugging, NOT FOR PRODUCTION
wordCount.foreach(new VoidFunction<Tuple2<String, Integer>>() {
#Override
public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
System.out.println(String.format("%s - %d", stringIntegerTuple2._1(), stringIntegerTuple2._2()));
}
});

Related

Cannot convert byte array back to Protobuf String

#Test
public void test() throws InvalidProtocolBufferException {
byte[] testString = StringValue.newBuilder()
.setValue("testString")
.build()
.getValueBytes()
.toByteArray();
StringValue stringValue = StringValue.parseFrom(testString);
System.out.println(stringValue);
}
Created a byte array from the Protobuf StringValue, but when converted it back, I got:
com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
Any idea why would this happen?

Saving POJO to Cassandra using Flink

I am new to Flink, and I wanted to store Kafka streaming data into Cassandra. I've converted String into POJO. My POJO as below,
#Table(keyspace = "sample", name = "contact")
public class Person implements Serializable {
private static final long serialVersionUID = 1L;
#Column(name = "name")
private String name;
#Column(name = "timeStamp")
private LocalDateTime timeStamp;
and My conversion takes places as below,
stream.flatMap(new FlatMapFunction<String, Person>() {
public void flatMap(String value, Collector<Person> out) {
try {
out.collect(objectMapper.readValue(value, Person.class));
} catch (JsonProcessingException e) {
e.printStackTrace();
}
}
}).print(); // I need to use proper method to convert to Datastream.
env.execute();
I read document on the below link for reference,
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/cassandra.html
The Cassandra Sink accepts DataStream instance. I need to convert my conversion and store them into Kafka.
Cant create Cassandra Pojo Sink gives me also some idea.
There is method .forward() which returns DataStream<Reading> forward, and when pass the instance to,
CassandraSink.addSink(forward)
.setHost("localhost")
.build();
cannot access org.apache.flink.streaming.api.scala.DataStream
How can i convert my POJO to store in Cassandra?

Problem using FormatStringConverter in ComboBox.setConverter in javafx

I have to use a combobox to associate a list of values (key, Value). Key is the value to be stored in the database, Value is the description to be displayed in the combobox.
For example:
Key / value
C1 Code one
C2 Choice two
C3 Choice three
...
Retrieving, for example, the selected value 'Choice two' I would like to receive the code C2.
For storing the elements in items I defined the ComboVal class.
Defining my combobox, I am stuck on the definition of the setConverter function.
The compiler gives me the following error:
Error: (1093, 49) java: constructor FormatStringConverter in class javafx.util.converter.FormatStringConverter cannot be applied to given types; required: java.text.Format; found: no arguments
reason: actual and formal argument lists differ in length
Code:
class ComboVal {
String[] dato = {null, null};
ComboVal (String Key, String Value)
{
setValue(Key, Value);
}
ComboVal ()
{
setValue(null, null);
}
String getValue ()
{
return dato[1];
}
String getKey ()
{
return dato[0];
}
void setValue (String Key, String Value)
{
dato[0] = Key;
dato[1] = Value;
}
}
classe myclass {
....
/*
Parameter ctrl is a List containing information for dynamic creation of combobox
*/
void mothod (List ctrl)
{
VBox box = new VBox();
box.getChildren().add(new Label(ctrl. label));
ObservableList items = FXCollections.observableArrayList();
ComboBox<ComboVal> cb = new ComboBox<>();
cb.setId(ctrl.name);
cb.setItems(items);
//----->>> compiler report error in the next row <<<<<<<<
cb.setConverter(new FormatStringConverter<ComboVal>() {
#Override
public String toString (ComboVal object)
{
return (object.getValue());
}
#Override
public ComboVal fromString (String string)
{
return null;
}
});
ctrl.options.forEach((k, v) -> {items.add(new ComboVal(k, v));});
cb.setCellFactory(new Callback<ListView<ComboVal>, ListCell<ComboVal>>() {
#Override
public ListCell<ComboVal> call (ListView<ComboVal> p)
{
return new ListCell<ComboVal>() {
#Override
protected void updateItem (ComboVal item, boolean empty)
{
super.updateItem(item, empty);
if (item == null || empty)
{
setGraphic(null);
}
else
{
setGraphic(new Text(item.getValue()));
}
}
};
}});
box.getChildren().add(cb);
}

The FormatStringConverter class needs to be constructed with a Format parameter. You however constructed it with no parameters.
Supply a Format, like this for example:
Format format = new MessageFormat("Bla bla");
cb.setConverter(new FormatStringConverter<ComboVal>(format);
The FormatStringConverter defines its own toString and fromString methods already, and will use the given format to parse and display the values. I doubt this is what you want as this is pretty limited.
So I think you'd be better of just using a regular StringConverter and provide the custom implementations for toString and fromString.

Mapping a structure inside a union in JNA

I am attempting to map the kstat library in Solaris 11.3 to Java using JNA. While I've managed to get most of the structures working, I've spent the last 24 hours fighting with a particularly difficult union-within-a-structure-within-a-union.
I am successfully retrieving a pointer to a kstat_named structure I need using kstat_data_lookup(). My code properly retrieves most of the data (name, data_type, and non-struct members of the union) in this C structure:
typedef struct kstat_named {
char name[KSTAT_STRLEN]; /* name of counter */
uchar_t data_type; /* data type */
union {
charc[16]; /* enough for 128-bit ints */
struct {
union {
char *ptr; /* NULL-terminated string */
} addr;
uint32_t len; /* length of string */
} str;
int32_t i32;
uint32_t ui32;
int64_t i64;
uint64_t ui64;
/* These structure members are obsolete */
int32_t l;
uint32_t ul;
int64_t ll;
uint64_t ull;
} value; /* value of counter */
} kstat_named_t;
I have mapped this in JNA as follows:
class KstatNamed extends Structure {
public static class UNION extends Union {
public byte[] charc = new byte[16]; // enough for 128-bit ints
public Pointer str; // KstatNamedString
public int i32;
public int ui32;
public long i64;
public long ui64;
}
public byte[] name = new byte[KSTAT_STRLEN]; // name of counter
public byte data_type; // data type
public UNION value; // value of counter
public KstatNamed() {
super();
}
public KstatNamed(Pointer p) {
super();
this.useMemory(p);
this.read();
}
#Override
public void read() {
super.read();
switch (data_type) {
case KSTAT_DATA_CHAR:
value.setType(byte[].class);
break;
case KSTAT_DATA_STRING:
value.setType(Pointer.class);
break;
case KSTAT_DATA_INT32:
case KSTAT_DATA_UINT32:
value.setType(int.class);
break;
case KSTAT_DATA_INT64:
case KSTAT_DATA_UINT64:
value.setType(long.class);
break;
default:
break;
}
value.read();
}
#Override
protected List<String> getFieldOrder() {
return Arrays.asList(new String[] { "name", "data_type", "value" });
}
}
This code works correctly for int32 types (KSTAT_DATA_INT32). However, when the data type is KSTAT_DATA_STRING, which corresponds to the str structure inside the union, I am not having any success in properly retrieving the data.
I have mapped the nested structure like this:
class KstatNamedString extends Structure {
public static class UNION extends Union {
public Pointer ptr; // NULL-terminated string
}
public UNION addr;
public int len; // length of string
public KstatNamedString() {
super();
}
public KstatNamedString(Pointer p) {
super();
this.useMemory(p);
this.read();
}
#Override
public void read() {
super.read();
addr.setType(Pointer.class);
addr.read();
}
#Override
protected List<String> getFieldOrder() {
return Arrays.asList(new String[] { "addr", "len" });
}
}
Ultimately I'm trying to replicate the behavior of this C macro:
#define KSTAT_NAMED_STR_PTR(knptr) ((knptr)->value.str.addr.ptr)
I've tried multiple different methods of trying to get access to the above structure, but it never seems to read the correct data (the len value is in the millions and attempting to read the string ptr causes segfault). I've tried:
Pointer p = LibKstat.INSTANCE.kstat_data_lookup(ksp, name);
KstatNamed data = new KstatNamed(p);
KstatNamedString str = new KstatNamedString(data.value.str);
return str.addr.ptr.getString(0); // <--- Segfault on C side
I've also tried:
Specifying KstatNamedString as the type instead of the Pointer type
Using various combinations of ByReference in both the structures and the unions
I've googled everywhere, including trying what I thought was a promising result here, but nothing seems to work.
I'm sure I'm missing something simple.

Use KstatNamedString instead of Pointer type.
Change your pointer-based constructors like this:
public KstatNamed(Pointer p) {
super(p);
this.read();
}
public KstatNamedString(Pointer p) {
super(p);
this.read();
}
and change the addr field of the str struct field to be a simple Pointer (no need for the union bits around it).
public Pointer /*UNION*/ addr;
Run your JVM with -Djna.dump_memory=true and print your newly-initialized Structure as a string. That will show you how JNA interprets the memory layout of the struct, and how the native memory is initialized. That should help you determine how to extract the string you're looking for (assuming it's there).
You can also tune your union read() method to initially read only the type field (using Structure.readField("data_type")) before setting the union type.

How to remove null data from JavaPairRDD

I am getting the Hbase data and trying to do a spark job on it. My table has around 70k rows and each row has a column 'type', which can have the values:post,comment or reply. Based on the type, I want to take out different Pair RDDs like shown below(for post).
JavaPairRDD<ImmutableBytesWritable, FlumePost> postPairRDD = hBaseRDD.mapToPair(
new PairFunction<Tuple2<ImmutableBytesWritable, Result>, ImmutableBytesWritable, FlumePost>() {
private static final long serialVersionUID = 1L;
public Tuple2<ImmutableBytesWritable, FlumePost> call(Tuple2<ImmutableBytesWritable, Result> arg0)
throws Exception {
FlumePost flumePost = new FlumePost();
ImmutableBytesWritable key = arg0._1;
Result result = arg0._2;
String type = Bytes.toString(result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("t")));
if (type.equals("post")) {
return new Tuple2<ImmutableBytesWritable, FlumePost>(key, flumePost);
} else {
return null;
}
}
}).distinct();
Problem here is, For all the rows with type other than post I have to send null value which is undesirable. And iteration goes on for 70k times for all the three types, wasting the cycles. So my first question is:
1) What is the efficient way to do this?
So now after getting 70k results I put the distinct() method to remove the duplication of null values. So I end up having one null value object in it. I expect 20327 results but I get 20328.
2) Is there a way to remove this null entry from the pair RDD?

You can use the filter operation on the RDD.
Simply call:
.filter(new Function<Tuple2<ImmutableBytesWritable, FlumePost>, Boolean>() {
#Override
public Boolean call(Tuple2<ImmutableBytesWritable, FlumePost> v1) throws Exception {
return v1 != null;
}
})
before calling distinct() to filter out the nulls.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Text, ByteWritable, VIntWritable of Hadoop equivalent in Spark? - apache-spark

Related

Cannot convert byte array back to Protobuf String

Saving POJO to Cassandra using Flink

Problem using FormatStringConverter in ComboBox.setConverter in javafx

Mapping a structure inside a union in JNA

How to remove null data from JavaPairRDD

Categories

Resources