How to merge data frame in Rcpp - rcpp

I have requirement to merge two Data frame contents , dataframe A and B need C = A + B , tried with push_back() but its very very slow.
// [[Rcpp::export]]
DataFrame mergeEx(DataFrame &dfSrcA , DataFrame &dfSrcB)
{
Rcpp::IntegerVector col = dfSrcA["A1"];
Rcpp::IntegerVector col1 = dfSrcA["A2"];
Rcpp::IntegerVector Bcol = dfSrcB["A1"];
Rcpp::IntegerVector Bcol1 = dfSrcB["A2"];
// col.push_back(7);
//col.push_back(8);
//col1.push_back(9);
//col1.push_back(10);
//return DataFrame::create(l,l1);
return DataFrame::create(Named("A1") = col, Named("A2") = col1);
}
tried with std::copy() as in below function but getting error
// [[Rcpp::export]]
DataFrame mergeEx(DataFrame &dfSrcA , DataFrame &dfSrcB)
{
Rcpp::IntegerVector col = dfSrcA["A1"];
Rcpp::IntegerVector col1 = dfSrcA["A2"];
Rcpp::IntegerVector Bcol = dfSrcB["A1"];
Rcpp::IntegerVector Bcol1 = dfSrcB["A2"];
long lsize = col.size();
std::copy(Bcol.begin(),Bcol.end(),col.begin()+lsize);
lsize=col1.size();
std::copy(Bcol1.begin(),Bcol1.end(),col1.begin()+lsize);
return DataFrame::create(Named("A1") = col, Named("A2") = col1);
}
C=mergeEx(A,B)
getting below error
Error in mergeEx(A, B, C) :
could not convert using R function : as.data.frame
Request your help to suggest fastest way to merge two dataframes using Rcpp as soon as possible, i am looking out speed as latency is concerned in merging these two data frames in R.
Also suggest me with good documentation link on dataframe , documents i found providing very basic information.

Related

I have a data frame in which its columns do not have name and I want to name them in RCPP?How can I do that?

I am very new in Rcpp. I have a data frame in which its columns do not have name and I want to name them in Rcpp. How can I do that? That is, this data frame is an input and then I want to name its columns in the first step.
Please let me know how I can do that.
Welcome to StackOverflow. We can modify the existing example in the RcppExamples package (which you may find helpful, just as other parts of the Rcpp documentation) to show this.
In essence, we just reassign a names attribute.
Code
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List DataFrameExample(const DataFrame & DF) {
// access each column by name
IntegerVector a = DF["a"];
CharacterVector b = DF["b"];
DateVector c = DF["c"];
// do something
a[2] = 42;
b[1] = "foo";
c[0] = c[0] + 7; // move up a week
// create a new data frame
DataFrame NDF = DataFrame::create(Named("a")=a,
Named("b")=b,
Named("c")=c);
// and reassign names
NDF.attr("names") = CharacterVector::create("tic", "tac", "toe");
// and return old and new in list
return List::create(Named("origDataFrame") = DF,
Named("newDataFrame") = NDF);
}
/*** R
D <- data.frame(a=1:3,
b=LETTERS[1:3],
c=as.Date("2011-01-01")+0:2)
rl <- DataFrameExample(D)
print(rl)
*/
Demo
R> Rcpp::sourceCpp("~/git/stackoverflow/61616170/answer.cpp")
R> D <- data.frame(a=1:3,
+ b=LETTERS[1:3],
+ c=as.Date("2011-01-01")+0:2)
R> rl <- DataFrameExample(D)
R> print(rl)
$origDataFrame
a b c
1 1 A 2011-01-08
2 2 foo 2011-01-02
3 42 C 2011-01-03
$newDataFrame
tic tac toe
1 1 A 2011-01-08
2 2 foo 2011-01-02
3 42 C 2011-01-03
R>
If you comment the line out you get the old names.

treeAggregate use case explanation

I am trying to understand treeAggregate but there isn't enough examples online.
So does the following code merges the elements of partition then calls makeSummary and in parallel do the same for each partition (sums the result and summarizes it again) then with depth set to (lets say) 5, is this repeated 5 times?
The result I want to get from these is to summarize the arrays until I get one of them.
val summary = input.transform(rdd=>{
rdd.treeAggregate(initialSet)(addToSet,mergePartitionSets,5)
// this returns Array[Double] not rdd but still
})
val initialSet = Array.empty[Double]
def addToSet = (s: Array[Double], v: (Int,Array[Double])) => {
val p=s ++ v._2
val ret = makeSummary(p,10000)
ret
}
val mergePartitionSets = (p1: Array[Double], p2: Array[Double]) => {
val p = p1 ++ p2
val ret = makeSummary(p,10000)
ret
}
//makeSummary selects half of the points of p randomly

Spark reduce by

I have streaming data coming as follows
id, date, value
i1, 12-01-2016, 10
i2, 12-02-2016, 20
i1, 12-01-2016, 30
i2, 12-05-2016, 40
Want to reduce by id to get aggregate value info by date like
output required from rdd is for a given id and list(days 365)
I have to put the value in the list position based on day of year like 12-01-2016 is 336 and as there are two instances for device i1 with same date they should be aggregated
id, List [0|1|2|3|... |336| 337| |340| |365]
i1, |10+30| - this goes to 336 position
i2, 20 40 -- this goes to 337 and 340 position
Please guide the reduce or group by transformation to do this.
I'll provide you the basic code snippet with few assumptions as you haven't specified about the language, data source or data format.
JavaDStream<String> lineStream = //Your data source for stream
JavaPairDStream<String, Long> firstReduce = lineStream.mapToPair(line -> {
String[] fields = line.split(",");
String idDate = fields[0] + fields[1];
Long value = Long.valueOf(fields[2]);
return new Tuple2<String, Long>(idDate, value);
}).reduceByKey((v1, v2) -> {
return (v1+v2);
});
firstReduce.map(idDateValueTuple -> {
String idDate = idDateValueTuple._1();
Long valueSum = idDateValueTuple._2();
String id = idDate.split(",")[0];
String date = idDate.split(",")[];
//TODO parse date and put the sumValue in array as you wish
}
Can only reach this far. Am not sure how to add each element of an array in the final step. Hope this helps!!!If you get the last step or any alternate way,appreciate if you post it here!!
def getDateDifference(dateStr:String):Int = {
val startDate = "01-01-2016"
val formatter = DateTimeFormatter.ofPattern("MM-dd-yyyy")
val oldDate = LocalDate.parse(startDate, formatter)
val currentDate = dateStr
val newDate = LocalDate.parse(currentDate, formatter)
return newDate.toEpochDay().toInt - oldDate.toEpochDay().toInt
}
def getArray(numberofDays:Int,data:Int):Iterable[Int] = {
val daysArray = new Array[Int](366)
daysArray(numberofDays) = data
return daysArray
}
val idRDD = <read from stream>
val idRDDMap = idRDD.map { rec => ((rec.split(",")(0),rec.split(",")(1)),
(getDateDifference(rec.split(",")(1)),rec.split(",")(2).toInt))}
val idRDDconsiceMap = idRDDMap.map { rec => (rec._1._1,getArray(rec._2._1, rec._2._2)) }
val finalRDD = idRDDconsiceMap.reduceByKey((acc,value)=>(???add each element of the arrays????))

I want to collect the data frame column values in an array list to conduct some computations, is it possible?

I am loading data from phoenix through this:
val tableDF = sqlContext.phoenixTableAsDataFrame("Hbtable", Array("ID", "distance"), conf = configuration)
and want to carry out the following computation on the column values distance:
val list=Array(10,20,30,40,10,20,0,10,20,30,40,50,60)//list of values from the column distance
val first=list(0)
val last=list(list.length-1)
var m = 0;
for (a <- 0 to list.length-2) {
if (list(a + 1) < list(a) && list(a+1)>=0)
{
m = m + list(a)
}
}
val totalDist=(m+last-first)
You can do something like this. It returns Array[Any]
`val array = df.select("distance").rdd.map(r => r(0)).collect()
If you want to get the data type properly, then you can use. It returns the Array[Int]
val array = df.select("distance").rdd.map(r => r(0).asInstanceOf[Int]).collect()

Accessing rows outside of window while aggregating in Spark dataframe

In short, in the example below I want to pin 'b to be the value in the row that the result will appear in.
Given:
a,b
1,2
4,6
3,7 ==> 'special would be: (1-7 + 4-7 + 3-7) == -13 in this row
val baseWin = Window.partitionBy("something_I_forgot").orderBy("whatever")
val sumWin = baseWin.rowsBetween(-2, 0)
frame.withColumn("special",sum( 'a - 'b ).over(win) )
Or another way to think of it is I want to close over the row when I calculate the sum so that I can pass in the value of 'b (in this case 7)
* Update *
Here is what I want to accomplish as an UDF. In short, I used a foldLeft.
def mad(field : Column, numPeriods : Integer) : Column = {
val baseWin = Window.partitionBy("exchange","symbol").orderBy("datetime")
val win = baseWin.rowsBetween(numPeriods + 1, 0)
val subFunc: (Seq[Double],Int) => Double = { (input: Seq[Double], numPeriods : Int) => {
val agg = grizzled.math.stats.mean(input: _*)
val fooBar = (1.0 / -numPeriods)*input.foldLeft(0.0)( (a,b) => a + Math.abs((b-agg)) )
fooBar
} }
val myUdf = udf( subFunc )
myUdf(collect_list(field.cast(DoubleType)).over(win),lit(numPeriods))
}
If I understood correctly what you're trying to do, I think you can refactor your logic a bit to achieve it. The way you have it right now, you're probably getting "-7" instead of -13.
For the "special" column, (1-7 + 4-7 + 3-7), you can calculate it like (sum(a) - count(*) * b):
dfA.withColumn("special",sum('a).over(win) - count("*").over(win) * 'b)

Resources