I am getting the Hbase data and trying to do a spark job on it. My table has around 70k rows and each row has a column 'type', which can have the values:post,comment or reply. Based on the type, I want to take out different Pair RDDs like shown below(for post).
JavaPairRDD<ImmutableBytesWritable, FlumePost> postPairRDD = hBaseRDD.mapToPair(
new PairFunction<Tuple2<ImmutableBytesWritable, Result>, ImmutableBytesWritable, FlumePost>() {
private static final long serialVersionUID = 1L;
public Tuple2<ImmutableBytesWritable, FlumePost> call(Tuple2<ImmutableBytesWritable, Result> arg0)
throws Exception {
FlumePost flumePost = new FlumePost();
ImmutableBytesWritable key = arg0._1;
Result result = arg0._2;
String type = Bytes.toString(result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("t")));
if (type.equals("post")) {
return new Tuple2<ImmutableBytesWritable, FlumePost>(key, flumePost);
} else {
return null;
}
}
}).distinct();
Problem here is, For all the rows with type other than post I have to send null value which is undesirable. And iteration goes on for 70k times for all the three types, wasting the cycles. So my first question is:
1) What is the efficient way to do this?
So now after getting 70k results I put the distinct() method to remove the duplication of null values. So I end up having one null value object in it. I expect 20327 results but I get 20328.
2) Is there a way to remove this null entry from the pair RDD?
You can use the filter operation on the RDD.
Simply call:
.filter(new Function<Tuple2<ImmutableBytesWritable, FlumePost>, Boolean>() {
#Override
public Boolean call(Tuple2<ImmutableBytesWritable, FlumePost> v1) throws Exception {
return v1 != null;
}
})
before calling distinct() to filter out the nulls.
Related
I want to write a dataset to CSV file but I don't want columns to be ordered in ascending order(or any order for that matter).
For eg. Table: String id; String name; String age; +300 more fields
CSV formed is of schema: age name id +300 more columns in alphabetical order
but I want the CSV of the same ordering as of Model.
I could have used .select() or .selectExpr() but there I had to mention 300+ fields.
Is there any other easier way?
Currently using:
dataset.toDF().coalesce(1).selectExpr("templateId","batchId", +300 more fields ).write().format("com.databricks.spark.csv").option("nullValue","").mode(SaveMode.Overwrite).save(path);
A workaround I followed for the above question:
added the fields in a properties file(column.properties) under a single key
with fields comma-separated.
loaded that properties file in broadcast map.
used broadcast map in .selectExpr() method.
Code for loading properties file in broadcast map:
public static Map<String, String> getColumnMap() {
String propFileName = "column.properties";
InputStream inputStream =
ConfigurationLoader.class.getClassLoader().getResourceAsStream(propFileName);
if (inputStream != null) {
try {
prop.load(inputStream);
colMap = (Map) prop;
} catch (IOException e) {
// handle exception
}
}
return colMap;
}
JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Broadcast<Map<String, String>> broadcastProperty = sc.broadcast(propertiesMap);
Code for writing to CSV file:
dataset.toDF().coalesce(1).selectExpr(broadcastColumn.getValue().get(TemplateConstants.COLUMN).split(",")).write().format(ApplicationConstants.CSV_FORMAT).option(ApplicationConstants.NULL_VALUE, "").mode(SaveMode.Overwrite).save(path);
Use Case:
IMap and has 70K entries. Majority of operations are GET call for multiple keys. GET calls (90%) > PUT calls (10%). We are using EP for PUT calls.
Problem: What can be the most efficient way to get the data for multiple keys which can be present across multiple instances?
Possible Solution:
1. EP with ReadOnly, Offloadable and use executeOnKeys method.
2. parallely execute map.get(key) for all keys.
Is there any other efficient way to get the data for multiple keys?
Predicate
public class ExecuteOnSelectedKeyPredicateTest {
public static void main(String[] args) {
HazelcastInstance hazelcastInstance = Hazelcast.newHazelcastInstance();
IMap<String, Integer> map = hazelcastInstance.getMap("testMap");
map.put("testKey1", 1);
map.put("test", 2);
map.put("testKey3", 3);
// Get entries which ends with a number
Set<Map.Entry<String, Integer>> entries = map.entrySet(new ExecuteOnSelectedKeyPredicate("^.+?\\d$"));
System.out.println(entries);
}
public static class ExecuteOnSelectedKeyPredicate implements Predicate<String, Integer> {
private String keyRegex;
private transient Pattern pattern;
public ExecuteOnSelectedKeyPredicate(String keyRegex) {
this.keyRegex = keyRegex;
}
#Override
public boolean apply(Map.Entry<String, Integer> mapEntry) {
if (pattern == null) {
pattern = Pattern.compile(keyRegex);
}
Matcher matcher = pattern.matcher(mapEntry.getKey());
return matcher.matches();
}
}
}
Getting multiple keys the request has to possibly go to multiple members. executeOnKeys is pretty optimal here, since it calculates the partitions the operations needs to be executed on, and it sends that operation to these partitions only.
Offloadable makes sure you don't block the partitionThreads and ReadOnly optimizes the processing further.
Multiple get operations will generate more network traffic, since you have one operation per key.
How to deal with the RDD structure: after calling "map" function I get a RDD where myObject is my own class consisting on a serialization of a xml.
I want to merge every myObject of the RDD into one.
I have implemented the foreach function and called a specific function inside it but the problem is that there is a lot of myObject so it spends many time.
*Edit: piece of code implemeting 'reduce':
JavaRDD<MyObject> objects = null;
objects = input.map(new Function<String, MyObject>() { public MyObject call(String s) throws MalformedURLException, SAXException, ParserConfigurationException, IOException{
machine.initializeReader(delimiter);
return machine.Request(s);
}
});
MyObject finalResult = objects.reduce(new Function2<MyObject, MyObject, MyObject>() {
#Override
public MyObject call(MyObject myObject, MyObject myObject2) {
return myObject.merge(myObject2);
}
});
'Merge' function loops through 'MyObject' elements and merge common ones: if I have a tag with specific id and the same in the other "myObject' then I create a result containing the tag and adding 2.
The problem using 'reduce' or 'foreach' is the spent time.
Thanks
i am using hazelcast 3.6.1 and implementing distinct aggregate functionality using custom mapreduce to get solr facet kind of results.
public class DistinctMapper implements Mapper<String, Employee, String, Long>{
private transient SimpleEntry<String, Employee> entry = new SimpleEntry<String, Employee>();
private static final Long ONE = Long.valueOf(1L);
private Supplier<String, Employee, String> supplier;
public DistinctMapper(Supplier<String, Employee, String> supplier) {
this.supplier = supplier;
}
#Override
public void map(String key, Employee value, Context<String, Long> context) {
System.out.println("Object "+ entry + " and key "+key);
entry.setKey(key);
entry.setValue(value);
String fieldValue = (String) supplier.apply(entry);
//getValue(value, fieldName);
if (null != fieldValue){
context.emit(fieldValue, ONE);
}
}
}
and mapper is failing with NullPointerException. and sysout statement says entry object is null.
SimpleEntry : https://github.com/hazelcast/hazelcast/blob/v3.7-EA/hazelcast/src/main/java/com/hazelcast/mapreduce/aggregation/impl/SimpleEntry.java
Can you point me the issue in the above code ? Thanks.
entry field is transient. This means that it is not serialized, so when DistinctMapperobject is deserialized on hazecalst node, it's value is null.
Removing the transient will solve the NullPointerException.
On the side note:
Why do you need this entry field? It doesn't seem to have any use.
Here is [a Link](http://stackoverflow.com/questions/32448987/how-to-retrieve-a-very-big-cassandra-table-and-delete-some-unuse-data-from-it#comment52844466_32464409) of my question before.
After I get the cassandra data row by row in my program, I'm confused by the convert between cassandra row to java class. In java class the table of cassandra is convert to a ResultSet class,when I iterator it and get the row data,it returns a NPE. In fact,I can see the Object (or the data) while debuging the program. Here is My Iterator Code:
ResultSet rs=CassandraTools.getInstance().execute(cql);
Iterator<Row> iterator = rs.iterator();
while (iterator.hasNext()) {
Row row = iterator.next();
row.getString() ---->return NPE
The CassandraTools class is:
public class CassandraTools {
private static CassandraTools instance;
private CassandraTools() {
}
public static synchronized CassandraTools getInstance() {
if (instance == null) {
instance = new CassandraTools();
instance.init();
}
return instance;
}
Cluster cluster;
Session session;
public void init() {
if (cluster == null) {
cluster = new Cluster.Builder().addContactPoint("10.16.34.96")
.build();
if (session == null) {
session = cluster.connect("uc_passport");
}
}
}
public ResultSet execute(String cql) {
ResultSet rs = session.execute(cql);
// rs.forEach(n -> {
// System.out.println(n);
// });
return rs;
}
}
SO how could I convert the data in the row to A java Class?I have read the convert class in the API of spring data cassandra,but it is complicated to use for me. Who can help?
IMHO, If you want to map the rows of Cassandra to a java class, you should try to use an Object-Datastore mapper which does these things for you.
If you try to do this by yourself, you need to handle the java-cassandra datatype mappings, validations etc all by yourself which is very hectic job.
There are few (Kundera, Hibernate OGM, etc) opensource object-datastore mappers available and you can use them. I suggest you to try Kundera and check this for getting started with Cassandra.