Spark DataFrame created from JavaRDD<Row> copies all columns data into first column - apache-spark

I have a DataFrame which I need to convert into JavaRDD<Row> and back to DataFrame I have the following code
DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file");
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDD<Row> modifiedRDD = sourceFrame.toJavaRDD().map(new Function<Row,Row>({
public Row call(Row row) throws Exception {
if(row != null) {
//updated row by creating new Row
return RowFactory.create(updateRow);
}
return null;
});
//now I convert above JavaRDD<Row> into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);
sourceFrame and modifiedFrame schema is same when I call sourceFrame.show() output is expected I see every column has corresponding values and no column is empty but when I call modifiedFrame.show() I see all the columns values gets merged into first column value for e.g. assume source DataFrame has 3 column as shown below
_col1 _col2 _col3
ABC 10 DEF
GHI 20 JKL
When I print modifiedFrame which I converted from JavaRDD it shows in the following order
_col1 _col2 _col3
ABC,10,DEF
GHI,20,JKL
As shown above all the _col1 has all the values and _col2 and _col3 is empty. I don't know what is wrong.

As I mentioned in question's comment ;
It might occurs because of giving list as a one parameter.
return RowFactory.create(updateRow);
When investigated Apache Spark docs and source codes ; In that specifying schema example They assign parameters one by one for all columns respectively. Just investigate the some source code roughly RowFactory.java class and GenericRow class doesn't allocate that one parameter. So Try to give parameters respectively for row's column's.
return RowFactory.create(updateRow.get(0),updateRow.get(1),updateRow.get(2)); // List Example
You may try to convert your list to array and then pass as a parameter.
YourObject[] updatedRowArray= new YourObject[updateRow.size()];
updateRow.toArray(updatedRowArray);
return RowFactory.create(updatedRowArray);
By the way RowFactory.create() method is creating Row objects. In Apache Spark documentation about Row object and RowFactory.create() method;
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for
primitives, as well as native primitive access. It is invalid to use
the native primitive interface to retrieve a value that is null,
instead a user must check isNullAt before attempting to retrieve a
value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in
Scala.
A Row object can be constructed by providing field values. Example:
import org.apache.spark.sql._
// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
According to documentation; You can also apply your own required algorithm to seperate rows columns while creating Row objects respectively. But i think converting list to array and pass parameter as an array will work for you(I couldn't try please post your feedbacks, thanks).

Related

How to add column to a DataFrame where value is fetched from a map with other column from row as key

I'm new to Spark, and trying to figure out how I can add a column to a DataFrame where its value is fetched from a HashMap, where the key is another value on the same row which where the value is being set.
For example, I have a map defined as follows:
var myMap: Map<Integer,Integer> = generateMap();
I want to add a new column to my DataFrame where its value is fetched from this map, with the key a current column value. A solution might look like this:
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", lit(myMap.get(col("EXISTING_COLUMN"))))
My issue with this code is that using the col function doesn't return a type of Int, like the keys in my HashMap.
Any suggestions?
I would create a dataframe from the map. Then do a join operation. It should be faster and can be reused.
A UDF (user-defined function) can also be used but they are black boxes to Catalyst, so I would be prudent in using them. Depending on where the content of the map is, it may also be complicated to pass it to a UDF.
As of the next version of Kotlin API for Apache Spark you will be able to simply create a udf which will be usable in almost this way.
val mapUDF by udf { input: Int -> myMap[input] }
dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))
You need to use UDF.
val mapUDF = udf((i:Int)=>myMap.getOrElse(i,0))
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))

How to validate large csv file either column wise or row wise in spark dataframe

I have a large data file of 10GB or more with 150 columns in which we need to validate each of its data (datatype/format/null/domain value/primary key ..) with different rule and finally create 2 output file one is having success data and another having error data with error details. we need to move the row in error file if any of column having error at very first time no need to validate further.
I am reading a file in spark data frame does we validate it column-wise or row-wise by which way we got the best performance?
To answer your question
I am reading a file in spark data frame do we validate it column-wise or row-wise by which way we got the best performance?
DataFrame is a distributed collection of data that is organized as set of rows distributed across the cluster and most of the transformation which is defined in spark is applied on the rows which work on Row object .
Psuedo code
import spark.implicits._
val schema = spark.read.csv(ip).schema
spark.read.textFile(inputFile).map(row => {
val errorInfo : Seq[(Row,String,Boolean)] = Seq()
val data = schema.foreach(f => {
// f.dataType //get field type and have custom logic on field type
// f.name // get field name i.e., column name
// val fieldValue = row.getAs(f.name) //get field value and have check's on field value on field type
// if any error in field value validation then populate #errorInfo info object i.e (row,"error_info",false)
// otherwise i.e (row,"",true)
})
data.filter(x => x._3).write.save(correctLoc)
data.filter(x => !x._3).write.save(errorLoc)
})

Spark dataset : Casting Columns of dataset

This is my dataset :
Dataset<Row> myResult = pot.select(col("number")
, col("document")
, explode(col("mask")).as("mask"));
I need to now create a new dataset from the existing myResult . something like below:
Dataset<Row> myResultNew = myResult.select(col("number")
, col("name")
, col("age")
, col("class")
, col("mask");
name , age and class are created from column document from Dataset myResult .
I guess I can call functions on the column document and then perform any operation on that.
myResult.select(extract(col("document")));
private String extract(final Column document) {
//TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
// PARSE DOCUMENT AND GET THEM.
XMLParser doc= (XMLParser) document // this doesnt work???????
}
My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name , age ,class. How can I do that. document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML .
Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking. An UDF can take the value of one or more columns and execute any logic with this input.
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;
[...]
UserDefinedFunction extract = udf(
(String document) -> {
List<String> result = new ArrayList<>();
XMLParser doc = XMLParser.parse(document);
String name = ... //read name from xml document
String age = ... //read age from xml document
String clazz = ... //read class from xml document
result.add(name);
result.add(age);
result.add(clazz);
return result;
}, DataTypes.createArrayType(DataTypes.StringType)
);
A restriction of UDFs is that they can only return one column. Therefore the function returns a String array that has to be unpacked afterwards.
Dataset<Row> myResultNew = myResult
.withColumn("extract", extract.apply(col("document"))) //1
.withColumn("name", col("extract").getItem(0)) //2
.withColumn("age", col("extract").getItem(1)) //2
.withColumn("class", col("extract").getItem(2)) //2
.drop("document", "extract"); //3
call the UDF and use the column that contains the xml document as parameter of the apply function
create the result columns out of the returned array from step 1
drop the intermediate columns
Note: the udf is executed once per row in the dataset. If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row. Due to the parallel nature of Spark it is not possible to reuse the parser for the next row. If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions. Here one would not need one parser per row but only one parser per partition of the dataset.
A completely different approach would be to use spark-xml.

Apache Spark - Map function returning empty dataset in java

My Code:
finalJoined.show();
Encoder<Row> rowEncoder = Encoders.bean(Row.class);
Dataset<Row> validatedDS = finalJoined.map(row -> validationRowMap(row), rowEncoder);
validatedDS.show();
Map function :
public static Row validationRowMap(Row row) {
//PART-A validateTxn()
System.out.println("Inside map");
//System.out.println("Value of CIS_DIVISION is " + row.getString(7));
//1. CIS_DIVISION
if ((row.getString(7)) == null || (row.getString(7)).trim().isEmpty()) {
System.out.println("CIS_DIVISION cannot be blank.");
}
return row;
}
Output :
finalJoined Dataset<Row> is properly shown with all columns and rows with proper values, however validatedDS Dataset<Row>is shown with only one column with empty values.
*Expected output : *
validatedDS should also show same values as finalJoined dataset because I am only performing validation inside the map function and not changing the dataset itself.
Please let me know if you need more information.
Encoders.bean is intended for usage with Bean classes. Row is not one of these (doesn't define setter and getters for specific fields, only generic getters).
To return Row object you have to use RowEncoder and provide expected output schema.
Check for example Encoder for Row Type Spark Datasets

Spark - Performing union of Dataframes inside a for loop starting from empty DataFrame

I have a Dataframe with a column called "generationId" and other fields. Field "generationId" takes a range of integer values from 1 to N (upper bound to N is known and is small, between 10 and 15) and I want to process the DataFrame in the following way (pseudo code):
results = emptyDataFrame <=== how do I do this ?
for (i <- 0 until getN(df)) {
val input = df.filter($"generationId" === i)
results.union(getModel(i).transform(input))
}
Here getN(df) gives the N for that data frame based on some criteria. In the loop, input is filtered based on matching against "i" and then fed to some model (some internal library) which transforms the input by adding 3 more columns to it.
Ultimately I would like to get union of all those transformed data frames, so I have all columns of the original data frame plus the 3 additional columns added by the model for each row. I am not able to figure out how to initialize results and unionize the results in each iteration. I do know the exact schema of the result ahead of time. So I did
val newSchema = ...
but I am not sure how to pass that to emptyRDD function and build a empty Dataframe and use it inside the loop.
Also, if there is a much efficient way to do this inside map operation, please suggest.
you can do something like this:
(0 until getN(df))
.map(i => {
val input = df.filter($"generationId" === i)
getModel(i).transform(input)
})
.reduce(_ union _)
that way you don't need to worry about the empty df

Resources