Spark - Performing union of Dataframes inside a for loop starting from empty DataFrame - apache-spark

I have a Dataframe with a column called "generationId" and other fields. Field "generationId" takes a range of integer values from 1 to N (upper bound to N is known and is small, between 10 and 15) and I want to process the DataFrame in the following way (pseudo code):
results = emptyDataFrame <=== how do I do this ?
for (i <- 0 until getN(df)) {
val input = df.filter($"generationId" === i)
results.union(getModel(i).transform(input))
}
Here getN(df) gives the N for that data frame based on some criteria. In the loop, input is filtered based on matching against "i" and then fed to some model (some internal library) which transforms the input by adding 3 more columns to it.
Ultimately I would like to get union of all those transformed data frames, so I have all columns of the original data frame plus the 3 additional columns added by the model for each row. I am not able to figure out how to initialize results and unionize the results in each iteration. I do know the exact schema of the result ahead of time. So I did
val newSchema = ...
but I am not sure how to pass that to emptyRDD function and build a empty Dataframe and use it inside the loop.
Also, if there is a much efficient way to do this inside map operation, please suggest.

you can do something like this:
(0 until getN(df))
.map(i => {
val input = df.filter($"generationId" === i)
getModel(i).transform(input)
})
.reduce(_ union _)
that way you don't need to worry about the empty df

Related

how to efficiently parse dataframe object into a map of key-value pairs

i'm working with a dataframe with the columns basketID and itemID. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket?
my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!
screen shot of sample data
the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")). heres the implementation I have using a for loop
// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row =>
basket(row(0).toString) += row(1).toString
)
You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.
import scala.collection.mutable
case class Items(basketID: String,itemID: String)
import spark.implicits._
val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
((l: mutable.Buffer[String], p: String) => l += p ,
(l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();
you can check other aggregation api's like reduceBy and groupBy over here.
please also check aggregateByKey vs groupByKey vs ReduceByKey differences.
This is efficient assuming your dataset is small enough to fit into the driver's memory. .collect will give you an array of rows on which you are iterating which is fine. If you want scalability then instead of Map[String, Set[String]] (this will reside in driver memory) you can use PairRDD[String, Set[String]] (this will be distributed).
//NOT TESTED
//Assuming df is dataframe with 2 columns, first is your basketId and second is itemId
df.rdd.map(row => (row.getAs[String](0), row.getAs[String](1)).groupByKey().mapValues(x => x.toSet)

Use data in Spark Dataframe column as condition or input in another column expression

I have an operation that I want to perform within PySpark 2.0 that would be easy to perform as a df.rdd.map, but since I would prefer to stay inside the Dataframe execution engine for performance reasons, I want to find a way to do this using Dataframe operations only.
The operation, in RDD-style, is something like this:
def precision_formatter(row):
formatter = "%.{}f".format(row.precision)
return row + [formatter % row.amount_raw / 10 ** row.precision]
df = df.rdd.map(precision_formatter)
Basically, I have a column that tells me, for each row, what the precision for my string formatting operation should be, and I want to selectively format the 'amount_raw' column as a string depending on that precision.
I don't know of a way to use the contents of one or more columns as input to another Column operation. The closest I can come is suggesting the use of Column.when with an externally-defined set of boolean operations that correspond to the set of possible boolean conditions/cases within the column or columns.
In this specific case, for instance, if you can obtain (or better yet, already have) all possible values of row.precision, then you can iterate over that set and apply a Column.when operation for each value in the set. I believe this set can be obtained with df.select('precision').distinct().collect().
Because the pyspark.sql.functions.when and Column.when operations themselves return a Column object, you can iterate over the items in the set (however it was obtained) and keep 'appending' when operations to each other programmatically until you have exhausted the set:
import pyspark.sql.functions as PSF
def format_amounts_with_precision(df, all_precisions_set):
amt_col = PSF.when(df['precision'] == 0, df['amount_raw'].cast(StringType()))
for precision in all_precisions_set:
if precision != 0: # this is a messy way of having a base case above
fmt_str = '%.{}f'.format(precision)
amt_col = amt_col.when(df['precision'] == precision,
PSF.format_string(fmt_str, df['amount_raw'] / 10 ** precision)
return df.withColumn('amount', amt_col)
You can do it with a python UDF. They can take as many input values (values from columns of a Row) and spit out a single output value. It would look something like this:
from pyspark.sql import types as T, functions as F
from pyspark.sql.function import udf, col
# Create example data frame
schema = T.StructType([
T.StructField('precision', T.IntegerType(), False),
T.StructField('value', T.FloatType(), False)
])
data = [
(1, 0.123456),
(2, 0.123456),
(3, 0.123456)
]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Define UDF and apply it
def format_func(precision, value):
format_str = "{:." + str(precision) + "f}"
return format_str.format(value)
format_udf = F.udf(format_func, T.StringType())
new_df = df.withColumn('formatted', format_udf('precision', 'value'))
new_df.show()
Also, if instead of the column precision value you wanted to use a global one, you could use the lit(..) function when you call it like this:
new_df = df.withColumn('formatted', format_udf(F.lit(2), 'value'))

Iterating over CompactBuffer in Spark

I have a PairedRDD where values are grouped based on key, from which I fetch a particular key:
pairsgrouped.lookup("token")
Output:
res6: Seq[Iterable[String]] = ArrayBuffer(CompactBuffer(EC-17A5206955089011B, EC-17A5206955089011A))
I want to iterate over these values and filter results even further. But I am not able to iterate over it.
I tried the following way:
pairsgrouped.lookup("token").foreach(println)
But this gives me this:
CompactBuffer(EC-17A5206955089011B, EC-17A5206955089011A)
I want to iterate over these values and filter the results using these values.
The values of your grouped data is an Iterable, not a list of values. You must collect the values first before you can print them. To get only the values for the group with key="token", you do an initial filtering.
val result = groupedData.filter{case(key, _) => key == "token"}
.values
.flatMap(i => i.toList)
.collect()
result foreach println

Spark DataFrame created from JavaRDD<Row> copies all columns data into first column

I have a DataFrame which I need to convert into JavaRDD<Row> and back to DataFrame I have the following code
DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file");
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDD<Row> modifiedRDD = sourceFrame.toJavaRDD().map(new Function<Row,Row>({
public Row call(Row row) throws Exception {
if(row != null) {
//updated row by creating new Row
return RowFactory.create(updateRow);
}
return null;
});
//now I convert above JavaRDD<Row> into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);
sourceFrame and modifiedFrame schema is same when I call sourceFrame.show() output is expected I see every column has corresponding values and no column is empty but when I call modifiedFrame.show() I see all the columns values gets merged into first column value for e.g. assume source DataFrame has 3 column as shown below
_col1 _col2 _col3
ABC 10 DEF
GHI 20 JKL
When I print modifiedFrame which I converted from JavaRDD it shows in the following order
_col1 _col2 _col3
ABC,10,DEF
GHI,20,JKL
As shown above all the _col1 has all the values and _col2 and _col3 is empty. I don't know what is wrong.
As I mentioned in question's comment ;
It might occurs because of giving list as a one parameter.
return RowFactory.create(updateRow);
When investigated Apache Spark docs and source codes ; In that specifying schema example They assign parameters one by one for all columns respectively. Just investigate the some source code roughly RowFactory.java class and GenericRow class doesn't allocate that one parameter. So Try to give parameters respectively for row's column's.
return RowFactory.create(updateRow.get(0),updateRow.get(1),updateRow.get(2)); // List Example
You may try to convert your list to array and then pass as a parameter.
YourObject[] updatedRowArray= new YourObject[updateRow.size()];
updateRow.toArray(updatedRowArray);
return RowFactory.create(updatedRowArray);
By the way RowFactory.create() method is creating Row objects. In Apache Spark documentation about Row object and RowFactory.create() method;
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for
primitives, as well as native primitive access. It is invalid to use
the native primitive interface to retrieve a value that is null,
instead a user must check isNullAt before attempting to retrieve a
value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in
Scala.
A Row object can be constructed by providing field values. Example:
import org.apache.spark.sql._
// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
According to documentation; You can also apply your own required algorithm to seperate rows columns while creating Row objects respectively. But i think converting list to array and pass parameter as an array will work for you(I couldn't try please post your feedbacks, thanks).

What is the most effective way to get elements of RDD in spark

I need to get values of two columns of a dataframe converted to RDD.
The first solution I have thought is that
First convert the RDD to List of Row RDD.collect()
then for each element of List, get values by using Row[i].getInt(column_index)
this solution works fine with small and medium size of data. But in large one, I got over memory.
My temporary solution is that I only create newRDD which contains only two Columns instead all columns. And then, apply my solution above, this may reduce most of needed memory.
Current implementation is like this:
Row[] rows = sparkDataFrame.collect();
for (int i = 0; i < rows.length; i++) { //about 50 million rows
int yTrue = rows[i].getInt(0);
int yPredict = rows[i].getInt(1);
}
Could you help me to improve my solution, or suggest me other solutions!
Thanks!
ps: I'm a new spark's user!
First you convert your big RDD into Dataframe and than directly you can select whatever columns you require.
// Create the DataFrame
DataFrame df = sqlContext.jsonFile("examples/src/main/resources/people.json");
// Select only the "name" column
df.select(df.col("name"), df.col("age")).show();
For more detail you can follow this link

Resources