Field data validation using spark dataframe - apache-spark

I have a bunch of columns, sample like my data displayed as show below.
I need to check the columns for errors and will have to generate two output files.
I'm using Apache Spark 2.0 and I would like to do this in a efficient way.
Schema Details
---------------
EMPID - (NUMBER)
ENAME - (STRING,SIZE(50))
GENDER - (STRING,SIZE(1))
Data
----
EMPID,ENAME,GENDER
1001,RIO,M
1010,RICK,MM
1015,123MYA,F
My excepected output files should be as shown below:
1.
EMPID,ENAME,GENDER
1001,RIO,M
1010,RICK,NULL
1015,NULL,F
2.
EMPID,ERROR_COLUMN,ERROR_VALUE,ERROR_DESCRIPTION
1010,GENDER,"MM","OVERSIZED"
1010,GENDER,"MM","VALUE INVALID FOR GENDER"
1015,ENAME,"123MYA","NAME SHOULD BE A STRING"
Thanks

I have not really worked with Spark 2.0, so I'll try answering your question with a solution in Spark 1.6.
// Load you base data
val input = <<you input dataframe>>
//Extract the schema of your base data
val originalSchema = input.schema
// Modify you existing schema with you additional metadata fields
val modifiedSchema= originalSchema.add("ERROR_COLUMN", StringType, true)
.add("ERROR_VALUE", StringType, true)
.add("ERROR_DESCRIPTION", StringType, true)
// write a custom validation function
def validateColumns(row: Row): Row = {
var err_col: String = null
var err_val: String = null
var err_desc: String = null
val empId = row.getAs[String]("EMPID")
val ename = row.getAs[String]("ENAME")
val gender = row.getAs[String]("GENDER")
// do checking here and populate (err_col,err_val,err_desc) with values if applicable
Row.merge(row, Row(err_col),Row(err_val),Row(err_desc))
}
// Call you custom validation function
val validateDF = input.map { row => validateColumns(row) }
// Reconstruct the DataFrame with additional columns
val checkedDf = sqlContext.createDataFrame(validateDF, newSchema)
// Filter out row having errors
val errorDf = checkedDf.filter($"ERROR_COLUMN".isNotNull && $"ERROR_VALUE".isNotNull && $"ERROR_DESCRIPTION".isNotNull)
// Filter our row having no errors
val errorFreeDf = checkedDf.filter($"ERROR_COLUMN".isNull && !$"ERROR_VALUE".isNull && !$"ERROR_DESCRIPTION".isNull)
I have used this approach personally and it works for me. I hope it points you in the right direction.

Related

Join apache spark dataframes properly with scala avoiding null values

Hellow everyone!
I have two DataFrames in apache spark (2.3) and I want to join them properly. I will explain below what I mean with 'properly'. First of all the two dataframes holds the following information:
nodeDf: ( id, year, title, authors, journal, abstract )
edgeDf: ( srcId, dstId, label )
The label could be 0 or 1 in case node1 is connected with node2 or not.
I want to combine this two dataframes to get one dataframe withe the following information:
JoinedDF: ( id_from, year_from, title_from, journal_from, abstract_from, id_to, year_to, title_to, journal_to, abstract_to, time_dist )
time_dist = abs(year_from - year_to)
When I said 'properly' I meant that the query must be as fast as it could be and I don't want to contain null rows or cels ( value on a row ).
I have tried the following but I took me 500 -540 sec to execute the query and the final dataframe contains null values. I don't even know if the dataframes ware joined correctly.
I want to mention that the node file from which I create the nodeDF has 27770 rows and the edge file (edgeDf) has 615512 rows.
Code:
val spark = SparkSession.builder().master("local[*]").appName("Logistic Regression").getOrCreate()
val sc = spark.sparkContext
val data = sc.textFile("resources/data/training_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1), fields(2).toInt)
})
val data2 = sc.textFile("resources/data/test_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1))
})
import spark.implicits._
val trainingDF = data.toDF("srcId","dstId", "label")
val testDF = data2.toDF("srcId","dstId")
val infoRDD = spark.read.option("header","false").option("inferSchema","true").format("csv").load("resources/data/node_information.csv")
val infoDF = infoRDD.toDF("srcId","year","title","authors","jurnal","abstract")
println("Showing linksDF sample...")
trainingDF.show(5)
println("Rows of linksDF: ",trainingDF.count())
println("Showing infoDF sample...")
infoDF.show(2)
println("Rows of infoDF: ",infoDF.count())
println("Joining linksDF and infoDF...")
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" === $"b.srcId")
println(joinedDF.count())
joinedDF = joinedDF.select($"a.srcId",$"a.dstId",$"a.label",$"b.year",$"b.title",$"b.authors",$"b.jurnal",$"b.abstract")
joinedDF.show(5)
val graphX = new GraphX()
val pageRankDf =graphX.computePageRank(spark,"resources/data/training_set.txt",0.0001)
println("Joining joinedDF and pageRankDf...")
joinedDF = joinedDF.as("a").join(pageRankDf.as("b"),$"a.srcId" === $"b.nodeId")
var dfWithRanks = joinedDF.select("srcId","dstId","label","year","title","authors","jurnal","abstract","rank").withColumnRenamed("rank","pgRank")
dfWithRanks.show(5)
println("Renameming joinedDF...")
dfWithRanks = dfWithRanks
.withColumnRenamed("srcId","id_from")
.withColumnRenamed("dstId","id_to")
.withColumnRenamed("year","year_from")
.withColumnRenamed("title","title_from")
.withColumnRenamed("authors","authors_from")
.withColumnRenamed("jurnal","jurnal_from")
.withColumnRenamed("abstract","abstract_from")
var infoDfRenamed = dfWithRanks
.withColumnRenamed("id_from","id_from")
.withColumnRenamed("id_to","id_to")
.withColumnRenamed("year_from","year_to")
.withColumnRenamed("title_from","title_to")
.withColumnRenamed("authors_from","authors_to")
.withColumnRenamed("jurnal_from","jurnal_to")
.withColumnRenamed("abstract_from","abstract_to").select("id_to","year_to","title_to","authors_to","jurnal_to","jurnal_to")
var finalDF = dfWithRanks.as("a").join(infoDF.as("b"),$"a.id_to" === $"b.srcId")
finalDF = finalDF
.withColumnRenamed("year","year_to")
.withColumnRenamed("title","title_to")
.withColumnRenamed("authors","authors_to")
.withColumnRenamed("jurnal","jurnal_to")
.withColumnRenamed("abstract","abstract_to")
println("Dropping unused columns from joinedDF...")
finalDF = finalDF.drop("srcId")
finalDF.show(5)
Here are my results!
Avoid all calculations and code related to pgRank! Is there any proper way to do this join works?
You can filter your data first and then join, in that case you will avoid nulls
df.filter($"ColumnName".isNotNull)
use <=> operator in your joining column condition
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" <=> $"b.srcId")
There is a function in spark 2.1 or greater is eqNullSafe
var joinedDF = trainingDF.join(infoDF,trainingDF("srcId").eqNullSafe(infoDF("srcId")))

How to set the property name when converting an array column to json in spark? (w/o udf) [duplicate]

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

Applying a specific schema on an Apache Spark data frame

I am trying to apply a particular schema on a dataframe , the schema seems to have been applied but all dataframe operations like count, show, etc. always fails with NullPointerException as shown below:
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:218)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
Here is my code:
var fieldSchema = ListBuffer[StructField]()
val columns = mydf.columns
for (i <- 0 until columns.length) {
val columns = mydf.columns
val colName = columns(i)
fieldSchema += StructField(colName, mydf.schema(i).dataType, true, null)
}
val schema = StructType(fieldSchema.toList)
val newdf = sqlContext.createDataFrame(df.rdd, schema) << df is the original dataframe
newdf.printSchema() << this prints the new applied schema
println("newdf count:"+newdf.count()) << this fails with null pointer exception
In short,there are actually 3 dataframes:
df - the original data frame
mydf- the schema that I'm trying to apply on df is coming from this dataframe
newdf- creating a new dataframe same as that of df, but with different schema

Filtering a Dataframe based on another Dataframe in Spark

I have a dataframe df with columns
date: timestamp
status : String
name : String
I'm trying to find last status of the all the names
val users = df.select("name").distinct
val final_status = users.map( t =>
{
val _name = t.getString(0)
val record = df.where(col("name") === _name)
val lastRecord = userRecord.sort(desc("date")).first
lastRecord
})
This works with an array, but with spark dataframe it is throwing java.lang.NullPointerException
Update1 : Using removeDuplicates
df.sort(desc("date")).removeDuplicates("name")
Is this a good solution?
This
df.sort(desc("date")).removeDuplicates("name")
is not guaranteed to work. The solutions in response to this question should work for you
spark: How to do a dropDuplicates on a dataframe while keeping the highest timestamped row

how to introduce the schema in a Row in Spark?

In the Row Java API there is a row.schema(), however there is not a row.set(StructType schema).
Also I tried to RowFactorie.create(objets), but I don't know how to proceed
UPDATE:
The problems is how to generate a new dataframe when I modify the structure in workers I put the example
DataFrame sentenceData = jsql.createDataFrame(jrdd, schema);
List<Row> resultRows2 = sentenceData.toJavaRDD()
.map(new MyFunction<Row, Row>(parameters) {
/** my map function **//
public Row call(Row row) {
// I want to change Row definition adding new columns
Row newRow = functionAddnewNewColumns (row);
StructType newSchema = functionGetNewSchema (row.schema);
// Here I want to insert the structure
//
return newRow
}
}
}).collect();
JavaRDD<Row> jrdd = jsc.parallelize(resultRows);
// Here is the problema I don't know how to get the new schema to create the new modified dataframe
DataFrame newDataframe = jsql.createDataFrame(jrdd, newSchema);
You can create a row with Schema by using:
Row newRow = new GenericRowWithSchema(values, newSchema);
You do not set a schema on a row - that makes no sense. You can, however, create a DataFrame (or pre-Spark 1.3 a JavaSchemaRDD) with a given schema using the sqlContext.
DataFrame dataFrame = sqlContext.createDataFrame(rowRDD, schema)
The dataframe will have the schema, you have provided.
For further information, please consult the documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
EDIT: According to updated question
Your can generate new rows in your map-function which will get you a new rdd of type JavaRDD<Row>
DataFrame sentenceData = jsql.createDataFrame(jrdd, schema);
JavaRDD<Row> newRowRDD = sentenceData
.toJavaRDD()
.map(row -> functionAddnewNewColumns(row)) // Assuming functionAddnewNewColumns returns a Row
You then define the new schema
StructField[] fields = new StructField[] {
new StructField("column1",...),
new StructField("column2",...),
...
};
StructType newSchema = new StructType(fields);
Create a new DataFrame from your rowRDD with newSchema as schema
DataFrame newDataframe = jsql.createDataFrame(newRowRDD, newSchema)
This is a pretty old thread, but I just had a use case where I needed to generate data with Spark and quickly work with data on the row level and then build a new dataframe from the rows. Took me a bit to put it together so maybe it will help someone.
Here we're taking a "template" row, modifying some data, adding a new column with appropriate "row-level" schema and then using that new row and schema to create a new DF with appropriate "new schema", so going "bottom up" :) This is building on #Christian answer originally, so contributing a simplified snippet back.
def fillTemplateRow(row: Row, newUUID:String) = {
var retSeq = Seq[Any]()
(row.schema,row.toSeq).zipped.foreach(
(s,r)=> {
// println(s"s=${s},r=${r}")
val retval = s.name match {
case "uuid" => {
newUUID
}
case _ => r
}
retSeq = retSeq :+ retval
})
var moreSchema = StructType(List(
StructField("metadata_id", StringType, true)
))
var newSchema = StructType(templateRow.schema ++ moreSchema)
retSeq = retSeq :+ "newid"
var retRow = new GenericRowWithSchema(
retSeq.toArray,
newSchema
): Row
retRow
}
var newRow = fillTemplateRow(templateRow, "test-user-1")
var usersDF = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(newRow)),
newRow.schema
)
usersDF.select($"uuid",$"metadata_id").show()

Resources