how to introduce the schema in a Row in Spark? - apache-spark

In the Row Java API there is a row.schema(), however there is not a row.set(StructType schema).
Also I tried to RowFactorie.create(objets), but I don't know how to proceed
UPDATE:
The problems is how to generate a new dataframe when I modify the structure in workers I put the example
DataFrame sentenceData = jsql.createDataFrame(jrdd, schema);
List<Row> resultRows2 = sentenceData.toJavaRDD()
.map(new MyFunction<Row, Row>(parameters) {
/** my map function **//
public Row call(Row row) {
// I want to change Row definition adding new columns
Row newRow = functionAddnewNewColumns (row);
StructType newSchema = functionGetNewSchema (row.schema);
// Here I want to insert the structure
//
return newRow
}
}
}).collect();
JavaRDD<Row> jrdd = jsc.parallelize(resultRows);
// Here is the problema I don't know how to get the new schema to create the new modified dataframe
DataFrame newDataframe = jsql.createDataFrame(jrdd, newSchema);

You can create a row with Schema by using:
Row newRow = new GenericRowWithSchema(values, newSchema);

You do not set a schema on a row - that makes no sense. You can, however, create a DataFrame (or pre-Spark 1.3 a JavaSchemaRDD) with a given schema using the sqlContext.
DataFrame dataFrame = sqlContext.createDataFrame(rowRDD, schema)
The dataframe will have the schema, you have provided.
For further information, please consult the documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
EDIT: According to updated question
Your can generate new rows in your map-function which will get you a new rdd of type JavaRDD<Row>
DataFrame sentenceData = jsql.createDataFrame(jrdd, schema);
JavaRDD<Row> newRowRDD = sentenceData
.toJavaRDD()
.map(row -> functionAddnewNewColumns(row)) // Assuming functionAddnewNewColumns returns a Row
You then define the new schema
StructField[] fields = new StructField[] {
new StructField("column1",...),
new StructField("column2",...),
...
};
StructType newSchema = new StructType(fields);
Create a new DataFrame from your rowRDD with newSchema as schema
DataFrame newDataframe = jsql.createDataFrame(newRowRDD, newSchema)

This is a pretty old thread, but I just had a use case where I needed to generate data with Spark and quickly work with data on the row level and then build a new dataframe from the rows. Took me a bit to put it together so maybe it will help someone.
Here we're taking a "template" row, modifying some data, adding a new column with appropriate "row-level" schema and then using that new row and schema to create a new DF with appropriate "new schema", so going "bottom up" :) This is building on #Christian answer originally, so contributing a simplified snippet back.
def fillTemplateRow(row: Row, newUUID:String) = {
var retSeq = Seq[Any]()
(row.schema,row.toSeq).zipped.foreach(
(s,r)=> {
// println(s"s=${s},r=${r}")
val retval = s.name match {
case "uuid" => {
newUUID
}
case _ => r
}
retSeq = retSeq :+ retval
})
var moreSchema = StructType(List(
StructField("metadata_id", StringType, true)
))
var newSchema = StructType(templateRow.schema ++ moreSchema)
retSeq = retSeq :+ "newid"
var retRow = new GenericRowWithSchema(
retSeq.toArray,
newSchema
): Row
retRow
}
var newRow = fillTemplateRow(templateRow, "test-user-1")
var usersDF = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(newRow)),
newRow.schema
)
usersDF.select($"uuid",$"metadata_id").show()

Related

Spark Streaming reach dataframe columns and add new column looking up to Redis

In my previous question(Spark Structured Streaming dynamic lookup with Redis ) , i succeeded to reach redis with mapparttions thanks to https://stackoverflow.com/users/689676/fe2s
I tried to use mappartitions but i could not solve one point, how i can reach per row column in the below code part while iterating.
Because i want to enrich my per-row against my lookup fields kept in Redis.
I found something like this, but how i can reach dataframe columns and add new column looking up to Redis.
for any help i really much appreciate, Thanks.
import org.apache.spark.sql.types._
def transformRow(row: Row): Row = {
Row.fromSeq(row.toSeq ++ Array[Any]("val1", "val2"))
}
def transformRows(iter: Iterator[Row]): Iterator[Row] =
{
val redisConn =new RedisClient("xxx.xxx.xx.xxx",6379,1,Option("Secret123"))
println(redisConn.get("ModelValidityPeriodName").getOrElse(""))
//want to reach DataFrame column here
redisConn.close()
iter.map(transformRow)
}
val newSchema = StructType(raw_customer_df.schema.fields ++
Array(
StructField("ModelValidityPeriod", StringType, false),
StructField("ModelValidityPeriod2", StringType, false)
)
)
spark.sqlContext.createDataFrame(raw_customer_df.rdd.mapPartitions(transformRows), newSchema).show
Iterator iter represents an iterator over the dataframe rows. So if I got your question correctly, you can access column values by iterative over iter and calling
row.getAs[Column_Type](column_name)
Something like this
def transformRows(iter: Iterator[Row]): Iterator[Row] = {
val redisConn = new RedisClient("xxx.xxx.xx.xxx",6379,1,Option("Secret123"))
println(redisConn.get("ModelValidityPeriodName").getOrElse(""))
//want to reach DataFrame column here
val res = iter.map { row =>
val columnValue = row.getAs[String]("column_name")
// lookup in redis
val valueFromRedis = redisConn.get(...)
Row.fromSeq(row.toSeq ++ Array[Any](valueFromRedis))
}.toList
redisConn.close()
res.iterator
}

How to convert Java ArrayList to Apache Spark Dataset?

I have a list like this:
List<String> dataList = new ArrayList<>();
dataList.add("A");
dataList.add("B");
dataList.add("C");
I need to convert Dataset<Row> dataDs = Seq(dataList).toDs();
List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> dataDs = spark.createDataset(data, Encoders.STRING());
Dataset<String> dataListDs = spark.createDataset(dataList, Encoders.STRING());
dataDs.show();
You can convert a List<String> to Dataset<Row> like so:
Get a List<Object> from List<String> on each element with correct Object class. eg - Integer, String, etc.
Generate List<Row> from List<Object>
Get datatypeList and headerList which you want for Dataset<Row> schema.
Construct the schema object:
Create dataset
List<Object> data = new ArrayList();
data.add("hello");
data.add(null);
List<Row> ls = new ArrayList<Row>();
Row row = RowFactory.create(data.toArray());
ls.add(row);
List<DataType> datatype = new ArrayList<String>();
datatype.add(DataTypes.StringType);
datatype.add(DataTypes.IntegerType);
List<String> header = new ArrayList<String>();
headerList.add("Field_1_string");
headerList.add("Field_1_integer");
StructField structField1 = new StructField(headerList.get(0), datatype.get(0), true, org.apache.spark.sql.types.Metadata.empty());
StructField structField2 = new StructField(headerList.get(1), datatype.get(1), true, org.apache.spark.sql.types.Metadata.empty());
List<StructField> structFieldsList = new ArrayList<>();
structFieldsList.add(structField1);
structFieldsList.add(structField2);
StructType schema = new StructType(structFieldsList.toArray(new StructField[0]));
Dataset<Row> dataset = sparkSession.createDataFrame(ls, schema);
dataset.show();
dataset.printSchema();
This is the derived answer that worked for me. It is inspired from NiharGht's answer.
suppose we have the list like this (not to run but just idea)
List<List<Integer>> data = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]
];
Now to convert each List to Row so that can be used to make DF
List<Row> rows = new ArrayList<>();
for (List<Integer> that_line : data){
Row row = RowFactory.create(that_line.toArray());
rows.add(row);
}
Then just make the dataframe! (no instead of using RDD, use the List
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema); // supposing you have schema already.
r2DF.show();
The catch is in this line:
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema);
It is where we are usually using RDD instead of the List.

Field data validation using spark dataframe

I have a bunch of columns, sample like my data displayed as show below.
I need to check the columns for errors and will have to generate two output files.
I'm using Apache Spark 2.0 and I would like to do this in a efficient way.
Schema Details
---------------
EMPID - (NUMBER)
ENAME - (STRING,SIZE(50))
GENDER - (STRING,SIZE(1))
Data
----
EMPID,ENAME,GENDER
1001,RIO,M
1010,RICK,MM
1015,123MYA,F
My excepected output files should be as shown below:
1.
EMPID,ENAME,GENDER
1001,RIO,M
1010,RICK,NULL
1015,NULL,F
2.
EMPID,ERROR_COLUMN,ERROR_VALUE,ERROR_DESCRIPTION
1010,GENDER,"MM","OVERSIZED"
1010,GENDER,"MM","VALUE INVALID FOR GENDER"
1015,ENAME,"123MYA","NAME SHOULD BE A STRING"
Thanks
I have not really worked with Spark 2.0, so I'll try answering your question with a solution in Spark 1.6.
// Load you base data
val input = <<you input dataframe>>
//Extract the schema of your base data
val originalSchema = input.schema
// Modify you existing schema with you additional metadata fields
val modifiedSchema= originalSchema.add("ERROR_COLUMN", StringType, true)
.add("ERROR_VALUE", StringType, true)
.add("ERROR_DESCRIPTION", StringType, true)
// write a custom validation function
def validateColumns(row: Row): Row = {
var err_col: String = null
var err_val: String = null
var err_desc: String = null
val empId = row.getAs[String]("EMPID")
val ename = row.getAs[String]("ENAME")
val gender = row.getAs[String]("GENDER")
// do checking here and populate (err_col,err_val,err_desc) with values if applicable
Row.merge(row, Row(err_col),Row(err_val),Row(err_desc))
}
// Call you custom validation function
val validateDF = input.map { row => validateColumns(row) }
// Reconstruct the DataFrame with additional columns
val checkedDf = sqlContext.createDataFrame(validateDF, newSchema)
// Filter out row having errors
val errorDf = checkedDf.filter($"ERROR_COLUMN".isNotNull && $"ERROR_VALUE".isNotNull && $"ERROR_DESCRIPTION".isNotNull)
// Filter our row having no errors
val errorFreeDf = checkedDf.filter($"ERROR_COLUMN".isNull && !$"ERROR_VALUE".isNull && !$"ERROR_DESCRIPTION".isNull)
I have used this approach personally and it works for me. I hope it points you in the right direction.

Applying a specific schema on an Apache Spark data frame

I am trying to apply a particular schema on a dataframe , the schema seems to have been applied but all dataframe operations like count, show, etc. always fails with NullPointerException as shown below:
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:218)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
Here is my code:
var fieldSchema = ListBuffer[StructField]()
val columns = mydf.columns
for (i <- 0 until columns.length) {
val columns = mydf.columns
val colName = columns(i)
fieldSchema += StructField(colName, mydf.schema(i).dataType, true, null)
}
val schema = StructType(fieldSchema.toList)
val newdf = sqlContext.createDataFrame(df.rdd, schema) << df is the original dataframe
newdf.printSchema() << this prints the new applied schema
println("newdf count:"+newdf.count()) << this fails with null pointer exception
In short,there are actually 3 dataframes:
df - the original data frame
mydf- the schema that I'm trying to apply on df is coming from this dataframe
newdf- creating a new dataframe same as that of df, but with different schema

Execute SQL on Ignite cache of BinaryObjects

I am creating a cache of BinaryObject from spark a dataframe and then I want to perform SQL on that ignite cache.
Here is my code where bank is the dataframe which contains three fields (id,name and age):
val ic = new IgniteContext(sc, () => new IgniteConfiguration())
val cacheConfig = new CacheConfiguration[BinaryObject, BinaryObject]()
cacheConfig.setName("test123")
cacheConfig.setStoreKeepBinary(true)
cacheConfig.setIndexedTypes(classOf[BinaryObject], classOf[BinaryObject])
val qe = new QueryEntity()
qe.setKeyType(TestKey)
qe.setValueType(TestValue)
val fields = new java.util.LinkedHashMap[String, String]()
fields.put("id", "java.lang.Long")
fields.put("name", "java.lang.String")
fields.put("age", "java.lang.Int")
qe.setFields(fields)
val qes = new java.util.ArrayList[QueryEntity]()
qes.add(qe)
cacheConfig.setQueryEntities(qes)
val cache = ic.fromCache[BinaryObject, BinaryObject](cacheConfig)
cache.savePairs(bank.rdd, (row: Bank, iContext: IgniteContext) => {
val keyBuilder = iContext.ignite().binary().builder("TestKey");
keyBuilder.setField("id", row.id);
val key = keyBuilder.build();
val valueBuilder = iContext.ignite().binary().builder("TestValue");
valueBuilder.setField("name", row.name);
valueBuilder.setField("age", row.age);
val value = valueBuilder.build();
(key, value);
}, true)
Now I am trying to execute an SQL query like this:
cache.sql("select age from TestValue")
Which is failing with following exception:
Caused by: org.h2.jdbc.JdbcSQLException: Column "AGE" not found; SQL statement:
select age from TestValue [42122-191]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
at org.h2.message.DbException.get(DbException.java:179)
at org.h2.message.DbException.get(DbException.java:155)
at org.h2.expression.ExpressionColumn.optimize(ExpressionColumn.java:147)
at org.h2.command.dml.Select.prepare(Select.java:852)
What am I doing wrong here?
The type of field age is incorrect, it should be the following:
fields.put("age", "java.lang.Integer")

Resources