Apache Spark - Map function returning empty dataset in java - apache-spark

My Code:
finalJoined.show();
Encoder<Row> rowEncoder = Encoders.bean(Row.class);
Dataset<Row> validatedDS = finalJoined.map(row -> validationRowMap(row), rowEncoder);
validatedDS.show();
Map function :
public static Row validationRowMap(Row row) {
//PART-A validateTxn()
System.out.println("Inside map");
//System.out.println("Value of CIS_DIVISION is " + row.getString(7));
//1. CIS_DIVISION
if ((row.getString(7)) == null || (row.getString(7)).trim().isEmpty()) {
System.out.println("CIS_DIVISION cannot be blank.");
}
return row;
}
Output :
finalJoined Dataset<Row> is properly shown with all columns and rows with proper values, however validatedDS Dataset<Row>is shown with only one column with empty values.
*Expected output : *
validatedDS should also show same values as finalJoined dataset because I am only performing validation inside the map function and not changing the dataset itself.
Please let me know if you need more information.

Encoders.bean is intended for usage with Bean classes. Row is not one of these (doesn't define setter and getters for specific fields, only generic getters).
To return Row object you have to use RowEncoder and provide expected output schema.
Check for example Encoder for Row Type Spark Datasets

Related

How to get value of a Spark dataset column value to use it dynamically in SQL Query?

I have a Dataset DS1 which is having one column value "LEVEL". I want to check this column value and get another column "COMPANIES" which is an array and based on some business logic, I have to update the values.
For this update operation, I am using withColumn() method.
DS1.withColumn("COMPANIES", functions.when(functions.col("LEVEL").gt(1), someMethod(sparkSession, functions.col("COMPANIES"), functions.col("LEVEL"))).otherwise(functions.col("value")));
inside the someMethod(), I am trying to use the Column as parameters.
private int[] someMethod(SparkSession sparkSession, Column companies, Column Level) {
String query = "Select cs.level from DS1 cs inner join DS2 cp on cs.level=" + (Level.minus(1)) + " and cs.company_private_id=ANY(" + companies + ")";
sparkSession.sql(query);
List<Integer> list = sparkSession.sql(query).collectAsList().get(0).getList(0);
return list.stream().mapToInt(i -> i).toArray();
}
I could not get the values for the variables Level, companies as they are of Column type. How to do the logic here.
Assuming data type for levels is Integer. If type is something else change row.getInteger(0) to row.getDecimal(0) for data type bigDecimal.
List<Row> dataSet = sparkSession.sql(query).collectAsList();
List<Integer> levels = dataSet.stream().map(row -> row.getInteger(0)).collect(Collectors.toList());

Spark dataset : Casting Columns of dataset

This is my dataset :
Dataset<Row> myResult = pot.select(col("number")
, col("document")
, explode(col("mask")).as("mask"));
I need to now create a new dataset from the existing myResult . something like below:
Dataset<Row> myResultNew = myResult.select(col("number")
, col("name")
, col("age")
, col("class")
, col("mask");
name , age and class are created from column document from Dataset myResult .
I guess I can call functions on the column document and then perform any operation on that.
myResult.select(extract(col("document")));
private String extract(final Column document) {
//TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
// PARSE DOCUMENT AND GET THEM.
XMLParser doc= (XMLParser) document // this doesnt work???????
}
My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name , age ,class. How can I do that. document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML .
Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking. An UDF can take the value of one or more columns and execute any logic with this input.
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;
[...]
UserDefinedFunction extract = udf(
(String document) -> {
List<String> result = new ArrayList<>();
XMLParser doc = XMLParser.parse(document);
String name = ... //read name from xml document
String age = ... //read age from xml document
String clazz = ... //read class from xml document
result.add(name);
result.add(age);
result.add(clazz);
return result;
}, DataTypes.createArrayType(DataTypes.StringType)
);
A restriction of UDFs is that they can only return one column. Therefore the function returns a String array that has to be unpacked afterwards.
Dataset<Row> myResultNew = myResult
.withColumn("extract", extract.apply(col("document"))) //1
.withColumn("name", col("extract").getItem(0)) //2
.withColumn("age", col("extract").getItem(1)) //2
.withColumn("class", col("extract").getItem(2)) //2
.drop("document", "extract"); //3
call the UDF and use the column that contains the xml document as parameter of the apply function
create the result columns out of the returned array from step 1
drop the intermediate columns
Note: the udf is executed once per row in the dataset. If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row. Due to the parallel nature of Spark it is not possible to reuse the parser for the next row. If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions. Here one would not need one parser per row but only one parser per partition of the dataset.
A completely different approach would be to use spark-xml.

unable to get the Row from ResultSet

The following function saves data in cassandra. It calls the abstract rowToModel method defined in the same class to convert the data returned from cassandra into the respective data model.
def saveDataToDatabase(data:M):Option[M] = { //TODOM - should this also be part of the Repository trait?
println("inserting in table "+tablename+" with partition key "+partitionKeyColumns +" and values "+data)
val insertQuery = insertValues(tablename,data)
println("insert query is "+insertQuery)
try {
val resultSet:ResultSet = session.execute(insertQuery) //execute can take a Statement. Insert is derived from Statement so I can use Insert.
println("resultset after insert: " + resultSet)
println("resultset applied: " + resultSet.wasApplied())
println(s"columns definition ${resultSet.getColumnDefinitions}")
if(resultSet.wasApplied()){
println(s"saved row ${resultSet.one()}")
val savedData = rowToModel(resultSet.one())
Some(savedData)
} else {
None
}
}catch {
case e:Exception => {
println("cassandra exception "+e)
None
}
}
}
The abstract rowToModel is defined as follows
override def rowToModel(row: Row): PracticeQuestionTag = {
PracticeQuestionTag(row.getLong("year"),row.getLong("month"),row.getLong("creation_time_hour"),
row.getLong("creation_time_minute"),row.getUUID("question_id"),row.getString("question_description"))
}
But the print statements I have defined in saveDataToDatabase are not printing the data. I expected that the prints will print PracticeQuestionTag but instead I see the following
I expect to see something like this - PracticeQuestionTag(2018,6,1,1,11111111-1111-1111-1111-111111111111,some description1) when I print one from ResultSet`. But what I see is
resultset after insert: ResultSet[ exhausted: false, Columns[[applied](boolean)]]
resultset applied: true
columns definition Columns[[applied](boolean)]
saved row Row[true]
row to Model called for row null
cassandra exception java.lang.NullPointerException
Why ResultSet, one and columnDefinitions is not showing me the values from the data model?
This is by design. The resultset of an insert will only contain a single row which tells if the result was applied or not.
When executing a conditional statement, the ResultSet will contain a
single Row with a column named “applied” of type boolean. This tells
whether the conditional statement was successful or not.
It makes sense also as ResultSet is supposed to return the result of the query and why would you want to make the result set object heavy by retuning all the inputs in the result set itself. More details can be found here.
Ofcourse Get queries will have the detailed result set.

Define UDF in Spark Scala

I need to use an UDF in Spark that takes in a timestamp, an Integer and another dataframe and returns a tuple of 3 values.
I keep hitting error after error and I'm not sure I'm trying to fix it right anymore.
Here is the function:
def determine_price (view_date: org.apache.spark.sql.types.TimestampType , product_id: Int, price_df: org.apache.spark.sql.DataFrame) : (Double, java.sql.Timestamp, Double) = {
var price_df_filtered = price_df.filter($"mkt_product_id" === product_id && $"created"<= view_date)
var price_df_joined = price_df_filtered.groupBy("mkt_product_id").agg("view_price" -> "min", "created" -> "max").withColumn("last_view_price_change", lit(1))
var price_df_final = price_df_joined.join(price_df_filtered, price_df_joined("max(created)") === price_df_filtered("created")).filter($"last_view_price_change" === 1)
var result = (price_df_final.select("view_price").head().getDouble(0), price_df_final.select("created").head().getTimestamp(0), price_df_final.select("min(view_price)").head().getDouble(0))
return result
}
val det_price_udf = udf(determine_price)
the error it gives me is:
error: missing argument list for method determine_price
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `determine_price _` or `determine_price(_,_,_)` instead of `determine_price`.
If I start adding the arguments I keep running in other errors such as Int expected Int.type found or object DataFrame is not a member of package org.apache.spark.sql
To give some context:
The idea is that I have a dataframe of prices, a product id and a date of creation and another dataframe containing product IDs and view dates.
I need to determine the price based on which was the last created price entry that is older than the view date.
Since each product ID has multiple view dates in the second dataframe. I thought an UDF is faster than a cross join. If anyone has a different idea, I'd be grateful.
You cannot pass the Dataframe inside UDF as UDF will be running on the Worker On a particular partition. And as you cannot use RDD on Worker( Is it possible to create nested RDDs in Apache Spark? ), similarly you cannot use the DataFrame on Worker too.!
You need to do a work around for this !

Spark DataFrame created from JavaRDD<Row> copies all columns data into first column

I have a DataFrame which I need to convert into JavaRDD<Row> and back to DataFrame I have the following code
DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file");
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDD<Row> modifiedRDD = sourceFrame.toJavaRDD().map(new Function<Row,Row>({
public Row call(Row row) throws Exception {
if(row != null) {
//updated row by creating new Row
return RowFactory.create(updateRow);
}
return null;
});
//now I convert above JavaRDD<Row> into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);
sourceFrame and modifiedFrame schema is same when I call sourceFrame.show() output is expected I see every column has corresponding values and no column is empty but when I call modifiedFrame.show() I see all the columns values gets merged into first column value for e.g. assume source DataFrame has 3 column as shown below
_col1 _col2 _col3
ABC 10 DEF
GHI 20 JKL
When I print modifiedFrame which I converted from JavaRDD it shows in the following order
_col1 _col2 _col3
ABC,10,DEF
GHI,20,JKL
As shown above all the _col1 has all the values and _col2 and _col3 is empty. I don't know what is wrong.
As I mentioned in question's comment ;
It might occurs because of giving list as a one parameter.
return RowFactory.create(updateRow);
When investigated Apache Spark docs and source codes ; In that specifying schema example They assign parameters one by one for all columns respectively. Just investigate the some source code roughly RowFactory.java class and GenericRow class doesn't allocate that one parameter. So Try to give parameters respectively for row's column's.
return RowFactory.create(updateRow.get(0),updateRow.get(1),updateRow.get(2)); // List Example
You may try to convert your list to array and then pass as a parameter.
YourObject[] updatedRowArray= new YourObject[updateRow.size()];
updateRow.toArray(updatedRowArray);
return RowFactory.create(updatedRowArray);
By the way RowFactory.create() method is creating Row objects. In Apache Spark documentation about Row object and RowFactory.create() method;
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for
primitives, as well as native primitive access. It is invalid to use
the native primitive interface to retrieve a value that is null,
instead a user must check isNullAt before attempting to retrieve a
value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in
Scala.
A Row object can be constructed by providing field values. Example:
import org.apache.spark.sql._
// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
According to documentation; You can also apply your own required algorithm to seperate rows columns while creating Row objects respectively. But i think converting list to array and pass parameter as an array will work for you(I couldn't try please post your feedbacks, thanks).

Resources