Java Spark convert Empty values in Dataframe to null - apache-spark

I'm trying to convert all empty values in my spark Dataframe to null using:
df.withColumn(colname, when(df.col(colname).equalTo(""), null)
.otherwise(df.col(colname)));
It's working but I'll have to do this for all the columns, is there any other way in java-spark where I could check for all the columns in dataframe and replace it with null.

If you want to apply that transformation to all your columns, you could use df.columns() to list all the columns and use the same construct on all of them with a for loop or a stream like below:
List<Column> list = Arrays
.stream(df.columns())
.map(colname -> functions
.when(df.col(colname).equalTo(""), null)
.otherwise(df.col(colname)))
.collect(Collectors.toList());
df.select(list.toArray(new Column[0]));

You can use forEach loop over the dataframe columns and do replace with null operation.
Dataset<Row> ds = //Input dataframe;
Stream.of(ds.columns()).forEach(c -> ds.withColumn(c, when(col(c).equalTo(""), null).otherwise(col(c))));

Related

How to add column to a DataFrame where value is fetched from a map with other column from row as key

I'm new to Spark, and trying to figure out how I can add a column to a DataFrame where its value is fetched from a HashMap, where the key is another value on the same row which where the value is being set.
For example, I have a map defined as follows:
var myMap: Map<Integer,Integer> = generateMap();
I want to add a new column to my DataFrame where its value is fetched from this map, with the key a current column value. A solution might look like this:
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", lit(myMap.get(col("EXISTING_COLUMN"))))
My issue with this code is that using the col function doesn't return a type of Int, like the keys in my HashMap.
Any suggestions?
I would create a dataframe from the map. Then do a join operation. It should be faster and can be reused.
A UDF (user-defined function) can also be used but they are black boxes to Catalyst, so I would be prudent in using them. Depending on where the content of the map is, it may also be complicated to pass it to a UDF.
As of the next version of Kotlin API for Apache Spark you will be able to simply create a udf which will be usable in almost this way.
val mapUDF by udf { input: Int -> myMap[input] }
dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))
You need to use UDF.
val mapUDF = udf((i:Int)=>myMap.getOrElse(i,0))
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))

How to filter a dataset rows using variables

I try to filter rows of a dataset using variables like so :
Dataset<Row> dataset = dF.select(dF.col("*")).filter(col(list.get(0)) == lit(list.get(1))));
But i get a compilation error :
Cannot resolve filter(boolean)
What's the solution to this ?
Filter takes column instead of boolean as parameter. So to compare columns you should use equalTo method that will return a column instead of ==:
Dataset<Row> dataset = dF.select(dF.col("*")).filter(col(list.get(0)).equalTo(lit(list.get(1)))));

How to create an expression or condition from a string value inside spark dataframe

I am trying to filter a column in dataframe using filter() function.
And the condition for filter is saved in a string variable like below.
val condition = ">10"
val outDF = df.filter((col("value") > expr(condition))
In the above code, is it possible to use expression or any SQL functions to convert the condition string value ">10" to an actual condition in filter function ?
Try below code.
val condition = "> 10"
df.filter(s"value ${condition}")
OR
df
.filter(expr(s"value ${condition}"))
.show(false)

Spark - Performing union of Dataframes inside a for loop starting from empty DataFrame

I have a Dataframe with a column called "generationId" and other fields. Field "generationId" takes a range of integer values from 1 to N (upper bound to N is known and is small, between 10 and 15) and I want to process the DataFrame in the following way (pseudo code):
results = emptyDataFrame <=== how do I do this ?
for (i <- 0 until getN(df)) {
val input = df.filter($"generationId" === i)
results.union(getModel(i).transform(input))
}
Here getN(df) gives the N for that data frame based on some criteria. In the loop, input is filtered based on matching against "i" and then fed to some model (some internal library) which transforms the input by adding 3 more columns to it.
Ultimately I would like to get union of all those transformed data frames, so I have all columns of the original data frame plus the 3 additional columns added by the model for each row. I am not able to figure out how to initialize results and unionize the results in each iteration. I do know the exact schema of the result ahead of time. So I did
val newSchema = ...
but I am not sure how to pass that to emptyRDD function and build a empty Dataframe and use it inside the loop.
Also, if there is a much efficient way to do this inside map operation, please suggest.
you can do something like this:
(0 until getN(df))
.map(i => {
val input = df.filter($"generationId" === i)
getModel(i).transform(input)
})
.reduce(_ union _)
that way you don't need to worry about the empty df

Spark DataFrame created from JavaRDD<Row> copies all columns data into first column

I have a DataFrame which I need to convert into JavaRDD<Row> and back to DataFrame I have the following code
DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file");
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDD<Row> modifiedRDD = sourceFrame.toJavaRDD().map(new Function<Row,Row>({
public Row call(Row row) throws Exception {
if(row != null) {
//updated row by creating new Row
return RowFactory.create(updateRow);
}
return null;
});
//now I convert above JavaRDD<Row> into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);
sourceFrame and modifiedFrame schema is same when I call sourceFrame.show() output is expected I see every column has corresponding values and no column is empty but when I call modifiedFrame.show() I see all the columns values gets merged into first column value for e.g. assume source DataFrame has 3 column as shown below
_col1 _col2 _col3
ABC 10 DEF
GHI 20 JKL
When I print modifiedFrame which I converted from JavaRDD it shows in the following order
_col1 _col2 _col3
ABC,10,DEF
GHI,20,JKL
As shown above all the _col1 has all the values and _col2 and _col3 is empty. I don't know what is wrong.
As I mentioned in question's comment ;
It might occurs because of giving list as a one parameter.
return RowFactory.create(updateRow);
When investigated Apache Spark docs and source codes ; In that specifying schema example They assign parameters one by one for all columns respectively. Just investigate the some source code roughly RowFactory.java class and GenericRow class doesn't allocate that one parameter. So Try to give parameters respectively for row's column's.
return RowFactory.create(updateRow.get(0),updateRow.get(1),updateRow.get(2)); // List Example
You may try to convert your list to array and then pass as a parameter.
YourObject[] updatedRowArray= new YourObject[updateRow.size()];
updateRow.toArray(updatedRowArray);
return RowFactory.create(updatedRowArray);
By the way RowFactory.create() method is creating Row objects. In Apache Spark documentation about Row object and RowFactory.create() method;
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for
primitives, as well as native primitive access. It is invalid to use
the native primitive interface to retrieve a value that is null,
instead a user must check isNullAt before attempting to retrieve a
value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in
Scala.
A Row object can be constructed by providing field values. Example:
import org.apache.spark.sql._
// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
According to documentation; You can also apply your own required algorithm to seperate rows columns while creating Row objects respectively. But i think converting list to array and pass parameter as an array will work for you(I couldn't try please post your feedbacks, thanks).

Resources