Spark: merging RDD - apache-spark

How to deal with the RDD structure: after calling "map" function I get a RDD where myObject is my own class consisting on a serialization of a xml.
I want to merge every myObject of the RDD into one.
I have implemented the foreach function and called a specific function inside it but the problem is that there is a lot of myObject so it spends many time.
*Edit: piece of code implemeting 'reduce':
JavaRDD<MyObject> objects = null;
objects = input.map(new Function<String, MyObject>() { public MyObject call(String s) throws MalformedURLException, SAXException, ParserConfigurationException, IOException{
machine.initializeReader(delimiter);
return machine.Request(s);
}
});
MyObject finalResult = objects.reduce(new Function2<MyObject, MyObject, MyObject>() {
#Override
public MyObject call(MyObject myObject, MyObject myObject2) {
return myObject.merge(myObject2);
}
});
'Merge' function loops through 'MyObject' elements and merge common ones: if I have a tag with specific id and the same in the other "myObject' then I create a result containing the tag and adding 2.
The problem using 'reduce' or 'foreach' is the spent time.
Thanks

Related

Cucumber V5-V6 - passing complex object in feature file step

So I have recently migrated to v6 and I will try to simplify my question
I have the following class
#AllArgsConstructor
public class Songs {
String title;
List<String> genres;
}
In my scenario I want to have something like:
Then The results are as follows:
|title |genre |
|happy song |romance, happy|
And the implementation should be something like:
#Then("Then The results are as follows:")
public void theResultsAreAsFollows(Songs song) {
//Some code here
}
I have the default transformer
#DefaultParameterTransformer
#DefaultDataTableEntryTransformer(replaceWithEmptyString = "[blank]")
#DefaultDataTableCellTransformer
public Object transformer(Object fromValue, Type toValueType) {
ObjectMapper objectMapper = new ObjectMapper();
return objectMapper.convertValue(fromValue, objectMapper.constructType(toValueType));
}
My current issue is that I get the following error: Cannot construct instance of java.util.ArrayList (although at least one Creator exists)
How can I tell cucumber to interpret specific cells as lists? but keeping all in the same step not splitting apart? Or better how can I send an object in a steps containing different variable types such as List, HashSet, etc.
If I do a change and replace the list with a String everything is working as expected
#M.P.Korstanje thank you for your idea. If anyone is trying to find a solution for this here is the way I did it as per suggestions received. Inspected to see the type fromValue has and and updated the transform method into something like:
if (fromValue instanceof LinkedHashMap) {
Map<String, Object> map = (LinkedHashMap<String, Object>) fromValue;
Set<String> keys = map.keySet();
for (String key : keys) {
if (key.equals("genres")) {
List<String> genres = Arrays.asList(map.get(key).toString().split(",", -1));
map.put("genres", genres);
}
return objectMapper.convertValue(map, objectMapper.constructType(toValueType));
}
}
It is somehow quite specific but could not find a better solution :)

Empty set after collectAsList, even though it is not empty inside the transformation operator

I am trying to figure out if I can work with Kotlin and Spark,
and use the former's data classes instead of Scala's case classes.
I have the following data class:
data class Transaction(var context: String = "", var epoch: Long = -1L, var items: HashSet<String> = HashSet()) :
Serializable {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
}
And the relevant part of the main routine looks like this:
val transactionEncoder = Encoders.bean(Transaction::class.java)
val transactions = inputDataset
.groupByKey(KeyExtractor(), KeyExtractor.getKeyEncoder())
.mapGroups(TransactionCreator(), transactionEncoder)
.collectAsList()
transactions.forEach { println("collected Transaction=$it") }
With TransactionCreator defined as:
class TransactionCreator : MapGroupsFunction<Tuple2<String, Timestamp>, Row, Transaction> {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
override fun call(key: Tuple2<String, Timestamp>, values: MutableIterator<Row>): Transaction {
val seq = generateSequence { if (values.hasNext()) values.next().getString(2) else null }
val items = seq.toCollection(HashSet())
return Transaction(key._1, key._2.time, items).also { println("inside call Transaction=$it") }
}
}
However, I think I'm running into some sort of serialization problem,
because the set ends up empty after collection.
I see the following output:
inside call Transaction=Transaction(context=context1, epoch=1000, items=[c])
inside call Transaction=Transaction(context=context1, epoch=0, items=[a, b])
collected Transaction=Transaction(context=context1, epoch=0, items=[])
collected Transaction=Transaction(context=context1, epoch=1000, items=[])
I've tried a custom KryoRegistrator to see if it was a problem with Kotlin's HashSet:
class MyRegistrator : KryoRegistrator {
override fun registerClasses(kryo: Kryo) {
kryo.register(HashSet::class.java, JavaSerializer()) // kotlin's HashSet
}
}
But it doesn't seem to help.
Any other ideas?
Full code here.
It does seem to be a serialization issue.
The documentation of Encoders.bean states (Spark v2.4.0):
collection types: only array and java.util.List currently, map support is in progress
Porting the Transaction data class to Java and changing items to a java.util.List seems to help.

How to remove null data from JavaPairRDD

I am getting the Hbase data and trying to do a spark job on it. My table has around 70k rows and each row has a column 'type', which can have the values:post,comment or reply. Based on the type, I want to take out different Pair RDDs like shown below(for post).
JavaPairRDD<ImmutableBytesWritable, FlumePost> postPairRDD = hBaseRDD.mapToPair(
new PairFunction<Tuple2<ImmutableBytesWritable, Result>, ImmutableBytesWritable, FlumePost>() {
private static final long serialVersionUID = 1L;
public Tuple2<ImmutableBytesWritable, FlumePost> call(Tuple2<ImmutableBytesWritable, Result> arg0)
throws Exception {
FlumePost flumePost = new FlumePost();
ImmutableBytesWritable key = arg0._1;
Result result = arg0._2;
String type = Bytes.toString(result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("t")));
if (type.equals("post")) {
return new Tuple2<ImmutableBytesWritable, FlumePost>(key, flumePost);
} else {
return null;
}
}
}).distinct();
Problem here is, For all the rows with type other than post I have to send null value which is undesirable. And iteration goes on for 70k times for all the three types, wasting the cycles. So my first question is:
1) What is the efficient way to do this?
So now after getting 70k results I put the distinct() method to remove the duplication of null values. So I end up having one null value object in it. I expect 20327 results but I get 20328.
2) Is there a way to remove this null entry from the pair RDD?
You can use the filter operation on the RDD.
Simply call:
.filter(new Function<Tuple2<ImmutableBytesWritable, FlumePost>, Boolean>() {
#Override
public Boolean call(Tuple2<ImmutableBytesWritable, FlumePost> v1) throws Exception {
return v1 != null;
}
})
before calling distinct() to filter out the nulls.

Is private method in spring service implement class thread safe

I got a service in an project using Spring framework.
public class MyServiceImpl implements IMyService {
public MyObject foo(SomeObject obj) {
MyObject myobj = this.mapToMyObject(obj);
myobj.setLastUpdatedDate(new Date());
return myobj;
}
private MyObject mapToMyObject(SomeObject obj){
MyObject myojb = new MyObject();
ConvertUtils.register(new MyNullConvertor(), String.class);
ConvertUtils.register(new StringConvertorForDateType(), Date.class);
BeanUtils.copyProperties(myojb , obj);
ConvertUtils.deregister(Date.class);
return myojb;
}
}
Then I got a class to call foo() in multi-thread;
There goes the problem. In some of the threads, I got error when calling
BeanUtils.copyProperties(myojb , obj);
saying Cannot invoke com.my.MyObject.setStartDate - java.lang.ClassCastException#2da93171
obviously, this is caused by ConvertUtils.deregister(Date.class) which is supposed to be called after BeanUtils.copyProperties(myojb , obj);.
It looks like one of the threads deregistered the Date class out while another thread was just about to call BeanUtils.copyProperties(myojb , obj);.
So My question is how do I make the private method mapToMyObject() thread safe?
Or simply make the BeanUtils thread safe when it's used in a private method.
And will the problem still be there if I keep the code this way but instead I call this foo() method in sevlet? If many sevlets call at the same time, would this be a multi-thread case as well?
Edit: Removed synchronized keyword since it is not neccessary, see comments below.
Instead of using the static methods in the BeanUtils class, use a private BeanUtilsBean instance (http://commons.apache.org/proper/commons-beanutils/apidocs/org/apache/commons/beanutils/BeanUtilsBean.html). This way, you don't need to register/deregister your converters each time the method is called.
public class MyServiceImpl implements IMyService {
private final BeanUtilsBean beanUtilsBean = createBeanUtilsBean();
private static BeanUtilsBean createBeanUtilsBean() {
ConvertUtilsBean convertUtilsBean = new ConvertUtils();
convertUtilsBean.register(new MyNullConvertor(), String.class);
convertUtilsBean.register(new StringConvertorForDateType(), Date.class);
BeanUtilsBean beanUtilsBean = new BeanUtilsBean(convertUtilsBean);
return beanUtilsBean;
}
public MyObject foo(SomeObject obj) {
MyObject myobj = this.mapToMyObject(obj);
myobj.setLastUpdatedDate(new Date());
return myobj;
}
private MyObject mapToMyObject(SomeObject obj){
MyObject myojb = new MyObject();
beanUtilsBean.copyProperties(myojb , obj);
return myojb;
}
}
add a synchonized block to the sensitive portion of your code or synchronize the method:
http://docs.oracle.com/javase/tutorial/essential/concurrency/sync.html

Javafx PropertyValueFactory not populating Tableview

This has baffled me for a while now and I cannot seem to get the grasp of it. I'm using Cell Value Factory to populate a simple one column table and it does not populate in the table.
It does and I click the rows that are populated but I do not see any values in them- in this case String values. [I just edited this to make it clearer]
I have a different project under which it works under the same kind of data model. What am I doing wrong?
Here's the code. The commented code at the end seems to work though. I've checked to see if the usual mistakes- creating a new column instance or a new tableview instance, are there. Nothing. Please help!
//Simple Data Model
Stock.java
public class Stock {
private SimpleStringProperty stockTicker;
public Stock(String stockTicker) {
this.stockTicker = new SimpleStringProperty(stockTicker);
}
public String getstockTicker() {
return stockTicker.get();
}
public void setstockTicker(String stockticker) {
stockTicker.set(stockticker);
}
}
//Controller class
MainGuiController.java
private ObservableList<Stock> data;
#FXML
private TableView<Stock> stockTableView;// = new TableView<>(data);
#FXML
private TableColumn<Stock, String> tickerCol;
private void setTickersToCol() {
try {
Statement stmt = conn.createStatement();//conn is defined and works
ResultSet rsltset = stmt.executeQuery("SELECT ticker FROM tickerlist order by ticker");
data = FXCollections.observableArrayList();
Stock stockInstance;
while (rsltset.next()) {
stockInstance = new Stock(rsltset.getString(1).toUpperCase());
data.add(stockInstance);
}
} catch (SQLException ex) {
Logger.getLogger(WriteToFile.class.getName()).log(Level.SEVERE, null, ex);
System.out.println("Connection Failed! Check output console");
}
tickerCol.setCellValueFactory(new PropertyValueFactory<Stock,String>("stockTicker"));
stockTableView.setItems(data);
}
/*THIS, ON THE OTHER HAND, WORKS*/
/*Callback<CellDataFeatures<Stock, String>, ObservableValue<String>> cellDataFeat =
new Callback<CellDataFeatures<Stock, String>, ObservableValue<String>>() {
#Override
public ObservableValue<String> call(CellDataFeatures<Stock, String> p) {
return new SimpleStringProperty(p.getValue().getstockTicker());
}
};*/
Suggested solution (use a Lambda, not a PropertyValueFactory)
Instead of:
aColumn.setCellValueFactory(new PropertyValueFactory<Appointment,LocalDate>("date"));
Write:
aColumn.setCellValueFactory(cellData -> cellData.getValue().dateProperty());
For more information, see this answer:
Java: setCellValuefactory; Lambda vs. PropertyValueFactory; advantages/disadvantages
Solution using PropertyValueFactory
The lambda solution outlined above is preferred, but if you wish to use PropertyValueFactory, this alternate solution provides information on that.
How to Fix It
The case of your getter and setter methods are wrong.
getstockTicker should be getStockTicker
setstockTicker should be setStockTicker
Some Background Information
Your PropertyValueFactory remains the same with:
new PropertyValueFactory<Stock,String>("stockTicker")
The naming convention will seem more obvious when you also add a property accessor to your Stock class:
public class Stock {
private SimpleStringProperty stockTicker;
public Stock(String stockTicker) {
this.stockTicker = new SimpleStringProperty(stockTicker);
}
public String getStockTicker() {
return stockTicker.get();
}
public void setStockTicker(String stockticker) {
stockTicker.set(stockticker);
}
public StringProperty stockTickerProperty() {
return stockTicker;
}
}
The PropertyValueFactory uses reflection to find the relevant accessors (these should be public). First, it will try to use the stockTickerProperty accessor and, if that is not present fall back to getters and setters. Providing a property accessor is recommended as then you will automatically enable your table to observe the property in the underlying model, dynamically updating its data as the underlying model changes.
put the Getter and Setter method in you data class for all the elements.

Resources