Spark dataset : Casting Columns of dataset - apache-spark

This is my dataset :
Dataset<Row> myResult = pot.select(col("number")
, col("document")
, explode(col("mask")).as("mask"));
I need to now create a new dataset from the existing myResult . something like below:
Dataset<Row> myResultNew = myResult.select(col("number")
, col("name")
, col("age")
, col("class")
, col("mask");
name , age and class are created from column document from Dataset myResult .
I guess I can call functions on the column document and then perform any operation on that.
myResult.select(extract(col("document")));
private String extract(final Column document) {
//TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
// PARSE DOCUMENT AND GET THEM.
XMLParser doc= (XMLParser) document // this doesnt work???????
}
My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name , age ,class. How can I do that. document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML .

Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking. An UDF can take the value of one or more columns and execute any logic with this input.
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;
[...]
UserDefinedFunction extract = udf(
(String document) -> {
List<String> result = new ArrayList<>();
XMLParser doc = XMLParser.parse(document);
String name = ... //read name from xml document
String age = ... //read age from xml document
String clazz = ... //read class from xml document
result.add(name);
result.add(age);
result.add(clazz);
return result;
}, DataTypes.createArrayType(DataTypes.StringType)
);
A restriction of UDFs is that they can only return one column. Therefore the function returns a String array that has to be unpacked afterwards.
Dataset<Row> myResultNew = myResult
.withColumn("extract", extract.apply(col("document"))) //1
.withColumn("name", col("extract").getItem(0)) //2
.withColumn("age", col("extract").getItem(1)) //2
.withColumn("class", col("extract").getItem(2)) //2
.drop("document", "extract"); //3
call the UDF and use the column that contains the xml document as parameter of the apply function
create the result columns out of the returned array from step 1
drop the intermediate columns
Note: the udf is executed once per row in the dataset. If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row. Due to the parallel nature of Spark it is not possible to reuse the parser for the next row. If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions. Here one would not need one parser per row but only one parser per partition of the dataset.
A completely different approach would be to use spark-xml.

Related

How to validate large csv file either column wise or row wise in spark dataframe

I have a large data file of 10GB or more with 150 columns in which we need to validate each of its data (datatype/format/null/domain value/primary key ..) with different rule and finally create 2 output file one is having success data and another having error data with error details. we need to move the row in error file if any of column having error at very first time no need to validate further.
I am reading a file in spark data frame does we validate it column-wise or row-wise by which way we got the best performance?
To answer your question
I am reading a file in spark data frame do we validate it column-wise or row-wise by which way we got the best performance?
DataFrame is a distributed collection of data that is organized as set of rows distributed across the cluster and most of the transformation which is defined in spark is applied on the rows which work on Row object .
Psuedo code
import spark.implicits._
val schema = spark.read.csv(ip).schema
spark.read.textFile(inputFile).map(row => {
val errorInfo : Seq[(Row,String,Boolean)] = Seq()
val data = schema.foreach(f => {
// f.dataType //get field type and have custom logic on field type
// f.name // get field name i.e., column name
// val fieldValue = row.getAs(f.name) //get field value and have check's on field value on field type
// if any error in field value validation then populate #errorInfo info object i.e (row,"error_info",false)
// otherwise i.e (row,"",true)
})
data.filter(x => x._3).write.save(correctLoc)
data.filter(x => !x._3).write.save(errorLoc)
})

Apache Spark - Map function returning empty dataset in java

My Code:
finalJoined.show();
Encoder<Row> rowEncoder = Encoders.bean(Row.class);
Dataset<Row> validatedDS = finalJoined.map(row -> validationRowMap(row), rowEncoder);
validatedDS.show();
Map function :
public static Row validationRowMap(Row row) {
//PART-A validateTxn()
System.out.println("Inside map");
//System.out.println("Value of CIS_DIVISION is " + row.getString(7));
//1. CIS_DIVISION
if ((row.getString(7)) == null || (row.getString(7)).trim().isEmpty()) {
System.out.println("CIS_DIVISION cannot be blank.");
}
return row;
}
Output :
finalJoined Dataset<Row> is properly shown with all columns and rows with proper values, however validatedDS Dataset<Row>is shown with only one column with empty values.
*Expected output : *
validatedDS should also show same values as finalJoined dataset because I am only performing validation inside the map function and not changing the dataset itself.
Please let me know if you need more information.
Encoders.bean is intended for usage with Bean classes. Row is not one of these (doesn't define setter and getters for specific fields, only generic getters).
To return Row object you have to use RowEncoder and provide expected output schema.
Check for example Encoder for Row Type Spark Datasets

Define UDF in Spark Scala

I need to use an UDF in Spark that takes in a timestamp, an Integer and another dataframe and returns a tuple of 3 values.
I keep hitting error after error and I'm not sure I'm trying to fix it right anymore.
Here is the function:
def determine_price (view_date: org.apache.spark.sql.types.TimestampType , product_id: Int, price_df: org.apache.spark.sql.DataFrame) : (Double, java.sql.Timestamp, Double) = {
var price_df_filtered = price_df.filter($"mkt_product_id" === product_id && $"created"<= view_date)
var price_df_joined = price_df_filtered.groupBy("mkt_product_id").agg("view_price" -> "min", "created" -> "max").withColumn("last_view_price_change", lit(1))
var price_df_final = price_df_joined.join(price_df_filtered, price_df_joined("max(created)") === price_df_filtered("created")).filter($"last_view_price_change" === 1)
var result = (price_df_final.select("view_price").head().getDouble(0), price_df_final.select("created").head().getTimestamp(0), price_df_final.select("min(view_price)").head().getDouble(0))
return result
}
val det_price_udf = udf(determine_price)
the error it gives me is:
error: missing argument list for method determine_price
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `determine_price _` or `determine_price(_,_,_)` instead of `determine_price`.
If I start adding the arguments I keep running in other errors such as Int expected Int.type found or object DataFrame is not a member of package org.apache.spark.sql
To give some context:
The idea is that I have a dataframe of prices, a product id and a date of creation and another dataframe containing product IDs and view dates.
I need to determine the price based on which was the last created price entry that is older than the view date.
Since each product ID has multiple view dates in the second dataframe. I thought an UDF is faster than a cross join. If anyone has a different idea, I'd be grateful.
You cannot pass the Dataframe inside UDF as UDF will be running on the Worker On a particular partition. And as you cannot use RDD on Worker( Is it possible to create nested RDDs in Apache Spark? ), similarly you cannot use the DataFrame on Worker too.!
You need to do a work around for this !

CreateDataFrame or SaveAsTable intuitively encode in pyspark 1.6

I am trying to save a table in spark1.6 using pyspark. All of the tables columns are saved as text, I'm wondering if I can change this:
product = sc.textFile('s3://path/product.txt')
product = m3product.map(lambda x: x.split("\t"))
product = sqlContext.createDataFrame(product, ['productid', 'marketID', 'productname', 'prod'])
product.saveAsTable("product", mode='overwrite')
Is there something in the last 2 commands that could automatically recognize productid and marketid as numerics? I have a lot of files and a lot of fields to upload so ideally it would be automatic
Is there something in the last 2 commands that could automatically recognize productid and marketid as numerics
If you pass int or float (depending on what you need) pyspark will convert the data type for you.
In your case, changing the lambda function in
product = m3product.map(lambda x: x.split("\t"))
product = sqlContext.createDataFrame(product, ['productid', 'marketID', 'productname', 'prod'])
to
from pyspark.sql.types import Row
def split_product_line(line):
fields = line.split('\t')
return Row(
productid=int(fields[0]),
marketID=int(fields[1]),
...
)
product = m3product.map(split_product_line).toDF()
You will find it much easier to control data types and possibly error/exception checks.
Try to prohibit lambda functions if possible :)

Spark DataFrame created from JavaRDD<Row> copies all columns data into first column

I have a DataFrame which I need to convert into JavaRDD<Row> and back to DataFrame I have the following code
DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file");
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDD<Row> modifiedRDD = sourceFrame.toJavaRDD().map(new Function<Row,Row>({
public Row call(Row row) throws Exception {
if(row != null) {
//updated row by creating new Row
return RowFactory.create(updateRow);
}
return null;
});
//now I convert above JavaRDD<Row> into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);
sourceFrame and modifiedFrame schema is same when I call sourceFrame.show() output is expected I see every column has corresponding values and no column is empty but when I call modifiedFrame.show() I see all the columns values gets merged into first column value for e.g. assume source DataFrame has 3 column as shown below
_col1 _col2 _col3
ABC 10 DEF
GHI 20 JKL
When I print modifiedFrame which I converted from JavaRDD it shows in the following order
_col1 _col2 _col3
ABC,10,DEF
GHI,20,JKL
As shown above all the _col1 has all the values and _col2 and _col3 is empty. I don't know what is wrong.
As I mentioned in question's comment ;
It might occurs because of giving list as a one parameter.
return RowFactory.create(updateRow);
When investigated Apache Spark docs and source codes ; In that specifying schema example They assign parameters one by one for all columns respectively. Just investigate the some source code roughly RowFactory.java class and GenericRow class doesn't allocate that one parameter. So Try to give parameters respectively for row's column's.
return RowFactory.create(updateRow.get(0),updateRow.get(1),updateRow.get(2)); // List Example
You may try to convert your list to array and then pass as a parameter.
YourObject[] updatedRowArray= new YourObject[updateRow.size()];
updateRow.toArray(updatedRowArray);
return RowFactory.create(updatedRowArray);
By the way RowFactory.create() method is creating Row objects. In Apache Spark documentation about Row object and RowFactory.create() method;
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for
primitives, as well as native primitive access. It is invalid to use
the native primitive interface to retrieve a value that is null,
instead a user must check isNullAt before attempting to retrieve a
value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in
Scala.
A Row object can be constructed by providing field values. Example:
import org.apache.spark.sql._
// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
According to documentation; You can also apply your own required algorithm to seperate rows columns while creating Row objects respectively. But i think converting list to array and pass parameter as an array will work for you(I couldn't try please post your feedbacks, thanks).

Resources