How to convert Java ArrayList to Apache Spark Dataset? - apache-spark

I have a list like this:
List<String> dataList = new ArrayList<>();
dataList.add("A");
dataList.add("B");
dataList.add("C");
I need to convert Dataset<Row> dataDs = Seq(dataList).toDs();

List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> dataDs = spark.createDataset(data, Encoders.STRING());
Dataset<String> dataListDs = spark.createDataset(dataList, Encoders.STRING());
dataDs.show();

You can convert a List<String> to Dataset<Row> like so:
Get a List<Object> from List<String> on each element with correct Object class. eg - Integer, String, etc.
Generate List<Row> from List<Object>
Get datatypeList and headerList which you want for Dataset<Row> schema.
Construct the schema object:
Create dataset
List<Object> data = new ArrayList();
data.add("hello");
data.add(null);
List<Row> ls = new ArrayList<Row>();
Row row = RowFactory.create(data.toArray());
ls.add(row);
List<DataType> datatype = new ArrayList<String>();
datatype.add(DataTypes.StringType);
datatype.add(DataTypes.IntegerType);
List<String> header = new ArrayList<String>();
headerList.add("Field_1_string");
headerList.add("Field_1_integer");
StructField structField1 = new StructField(headerList.get(0), datatype.get(0), true, org.apache.spark.sql.types.Metadata.empty());
StructField structField2 = new StructField(headerList.get(1), datatype.get(1), true, org.apache.spark.sql.types.Metadata.empty());
List<StructField> structFieldsList = new ArrayList<>();
structFieldsList.add(structField1);
structFieldsList.add(structField2);
StructType schema = new StructType(structFieldsList.toArray(new StructField[0]));
Dataset<Row> dataset = sparkSession.createDataFrame(ls, schema);
dataset.show();
dataset.printSchema();

This is the derived answer that worked for me. It is inspired from NiharGht's answer.
suppose we have the list like this (not to run but just idea)
List<List<Integer>> data = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]
];
Now to convert each List to Row so that can be used to make DF
List<Row> rows = new ArrayList<>();
for (List<Integer> that_line : data){
Row row = RowFactory.create(that_line.toArray());
rows.add(row);
}
Then just make the dataframe! (no instead of using RDD, use the List
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema); // supposing you have schema already.
r2DF.show();
The catch is in this line:
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema);
It is where we are usually using RDD instead of the List.

Related

Field data validation using spark dataframe

I have a bunch of columns, sample like my data displayed as show below.
I need to check the columns for errors and will have to generate two output files.
I'm using Apache Spark 2.0 and I would like to do this in a efficient way.
Schema Details
---------------
EMPID - (NUMBER)
ENAME - (STRING,SIZE(50))
GENDER - (STRING,SIZE(1))
Data
----
EMPID,ENAME,GENDER
1001,RIO,M
1010,RICK,MM
1015,123MYA,F
My excepected output files should be as shown below:
1.
EMPID,ENAME,GENDER
1001,RIO,M
1010,RICK,NULL
1015,NULL,F
2.
EMPID,ERROR_COLUMN,ERROR_VALUE,ERROR_DESCRIPTION
1010,GENDER,"MM","OVERSIZED"
1010,GENDER,"MM","VALUE INVALID FOR GENDER"
1015,ENAME,"123MYA","NAME SHOULD BE A STRING"
Thanks
I have not really worked with Spark 2.0, so I'll try answering your question with a solution in Spark 1.6.
// Load you base data
val input = <<you input dataframe>>
//Extract the schema of your base data
val originalSchema = input.schema
// Modify you existing schema with you additional metadata fields
val modifiedSchema= originalSchema.add("ERROR_COLUMN", StringType, true)
.add("ERROR_VALUE", StringType, true)
.add("ERROR_DESCRIPTION", StringType, true)
// write a custom validation function
def validateColumns(row: Row): Row = {
var err_col: String = null
var err_val: String = null
var err_desc: String = null
val empId = row.getAs[String]("EMPID")
val ename = row.getAs[String]("ENAME")
val gender = row.getAs[String]("GENDER")
// do checking here and populate (err_col,err_val,err_desc) with values if applicable
Row.merge(row, Row(err_col),Row(err_val),Row(err_desc))
}
// Call you custom validation function
val validateDF = input.map { row => validateColumns(row) }
// Reconstruct the DataFrame with additional columns
val checkedDf = sqlContext.createDataFrame(validateDF, newSchema)
// Filter out row having errors
val errorDf = checkedDf.filter($"ERROR_COLUMN".isNotNull && $"ERROR_VALUE".isNotNull && $"ERROR_DESCRIPTION".isNotNull)
// Filter our row having no errors
val errorFreeDf = checkedDf.filter($"ERROR_COLUMN".isNull && !$"ERROR_VALUE".isNull && !$"ERROR_DESCRIPTION".isNull)
I have used this approach personally and it works for me. I hope it points you in the right direction.

Applying a specific schema on an Apache Spark data frame

I am trying to apply a particular schema on a dataframe , the schema seems to have been applied but all dataframe operations like count, show, etc. always fails with NullPointerException as shown below:
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:218)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
Here is my code:
var fieldSchema = ListBuffer[StructField]()
val columns = mydf.columns
for (i <- 0 until columns.length) {
val columns = mydf.columns
val colName = columns(i)
fieldSchema += StructField(colName, mydf.schema(i).dataType, true, null)
}
val schema = StructType(fieldSchema.toList)
val newdf = sqlContext.createDataFrame(df.rdd, schema) << df is the original dataframe
newdf.printSchema() << this prints the new applied schema
println("newdf count:"+newdf.count()) << this fails with null pointer exception
In short,there are actually 3 dataframes:
df - the original data frame
mydf- the schema that I'm trying to apply on df is coming from this dataframe
newdf- creating a new dataframe same as that of df, but with different schema

Spark ALS with strings labels - Conversion back to string

I have this code:
val userIndexer: StringIndexer = new StringIndexer()
.setInputCol("userKey")
.setOutputCol("user")
val alsRatings = userIndexerModel.transform(ratings)
val matrixFactorizationModel = ALS.trainImplicit(alsRatings.rdd, rank = 10, iterations = 10)
val rec = matrixFactorizationModel.recommendProductsForUsers(20)
This gives me back recommendations with user ids. I want to have my user key strings back. What is the more efficient way to do it? Thanks.
PD: I certainly cannot understand why ALS library developers don't accept string labels. It's extremely painful and expensive to deal with conversions (string to int and then int to string) from the outside. Hope there is an issue or something in their backlog.
I generally run the StringIndexer collect the labels in the Driver. And
parallelize the labels with an index. And instead of calling Transform using the StringIndexer. I join the DataFrames to get the same result as a StringIndexer.
val swidConverter = new StringIndexer()
.setInputCol("id")
.setOutputCol("idIndex").fit(df)
val idDf = spark.sparkContext.parallelize(
swidConverter.labels.zipWithIndex
).toDF("id", "idIndex").repartition(PARTITION_SIZE) // set the partition size depending on your data size.
// Joining the idDf(DataFrame) with the actual Data.
val indexedDF = df.join(idDf,idDf.col("id")===df.col("id")).select("idIndex","product_id","rating")
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("idIndex")
.setItemCol("product_id")
.setRatingCol("rating")
val model = als.fit(indexedDF)
val resultRaw = model.recommendForAllUsers(4)
// Joining the idDf(DataFrame) with the Result to get the original ID from the indexed Id.
val resultDf = resultRaw.join(idDf,resultRaw.col("idIndex")===idDf.col("idIndex")).select("id","recommendations")

how to introduce the schema in a Row in Spark?

In the Row Java API there is a row.schema(), however there is not a row.set(StructType schema).
Also I tried to RowFactorie.create(objets), but I don't know how to proceed
UPDATE:
The problems is how to generate a new dataframe when I modify the structure in workers I put the example
DataFrame sentenceData = jsql.createDataFrame(jrdd, schema);
List<Row> resultRows2 = sentenceData.toJavaRDD()
.map(new MyFunction<Row, Row>(parameters) {
/** my map function **//
public Row call(Row row) {
// I want to change Row definition adding new columns
Row newRow = functionAddnewNewColumns (row);
StructType newSchema = functionGetNewSchema (row.schema);
// Here I want to insert the structure
//
return newRow
}
}
}).collect();
JavaRDD<Row> jrdd = jsc.parallelize(resultRows);
// Here is the problema I don't know how to get the new schema to create the new modified dataframe
DataFrame newDataframe = jsql.createDataFrame(jrdd, newSchema);
You can create a row with Schema by using:
Row newRow = new GenericRowWithSchema(values, newSchema);
You do not set a schema on a row - that makes no sense. You can, however, create a DataFrame (or pre-Spark 1.3 a JavaSchemaRDD) with a given schema using the sqlContext.
DataFrame dataFrame = sqlContext.createDataFrame(rowRDD, schema)
The dataframe will have the schema, you have provided.
For further information, please consult the documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
EDIT: According to updated question
Your can generate new rows in your map-function which will get you a new rdd of type JavaRDD<Row>
DataFrame sentenceData = jsql.createDataFrame(jrdd, schema);
JavaRDD<Row> newRowRDD = sentenceData
.toJavaRDD()
.map(row -> functionAddnewNewColumns(row)) // Assuming functionAddnewNewColumns returns a Row
You then define the new schema
StructField[] fields = new StructField[] {
new StructField("column1",...),
new StructField("column2",...),
...
};
StructType newSchema = new StructType(fields);
Create a new DataFrame from your rowRDD with newSchema as schema
DataFrame newDataframe = jsql.createDataFrame(newRowRDD, newSchema)
This is a pretty old thread, but I just had a use case where I needed to generate data with Spark and quickly work with data on the row level and then build a new dataframe from the rows. Took me a bit to put it together so maybe it will help someone.
Here we're taking a "template" row, modifying some data, adding a new column with appropriate "row-level" schema and then using that new row and schema to create a new DF with appropriate "new schema", so going "bottom up" :) This is building on #Christian answer originally, so contributing a simplified snippet back.
def fillTemplateRow(row: Row, newUUID:String) = {
var retSeq = Seq[Any]()
(row.schema,row.toSeq).zipped.foreach(
(s,r)=> {
// println(s"s=${s},r=${r}")
val retval = s.name match {
case "uuid" => {
newUUID
}
case _ => r
}
retSeq = retSeq :+ retval
})
var moreSchema = StructType(List(
StructField("metadata_id", StringType, true)
))
var newSchema = StructType(templateRow.schema ++ moreSchema)
retSeq = retSeq :+ "newid"
var retRow = new GenericRowWithSchema(
retSeq.toArray,
newSchema
): Row
retRow
}
var newRow = fillTemplateRow(templateRow, "test-user-1")
var usersDF = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(newRow)),
newRow.schema
)
usersDF.select($"uuid",$"metadata_id").show()

How to convert List to JavaRDD

We know that in spark there is a method rdd.collect which converts RDD to a list.
List<String> f= rdd.collect();
String[] array = f.toArray(new String[f.size()]);
I am trying to do exactly opposite in my project. I have an ArrayList of String which I want to convert to JavaRDD. I am looking for this solution for quite some time but have not found the answer. Can anybody please help me out here?
You're looking for JavaSparkContext.parallelize(List) and similar. This is just like in the Scala API.
Adding to Sean Owen and others solutions
You can use JavaSparkContext#parallelizePairs for List ofTuple
List<Tuple2<Integer, Integer>> pairs = new ArrayList<>();
pairs.add(new Tuple2<>(0, 5));
pairs.add(new Tuple2<>(1, 3));
JavaSparkContext sc = new JavaSparkContext();
JavaPairRDD<Integer, Integer> rdd = sc.parallelizePairs(pairs);
There are two ways to convert a collection to a RDD.
1) sc.Parallelize(collection)
2) sc.makeRDD(collection)
Both of the method are identical, so we can use any of them
If you are using a .scala file, or you don't want to or cannot use JavaSparkContext, then you could:
use SparkContext instead of JavaSparkContext
convert your Java List to a Scala List
use SparkContext's parallelize method
For example:
List<String> javaList = new ArrayList<>()
javaList.add("abc")
javaList.add("def")
sc.parallelize(javaList.asScala)
This will generate an RDD for you.
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("fieldx1", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("fieldx2", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("fieldx3", DataTypes.LongType, true));
List<Row> data = new ArrayList<>();
data.add(RowFactory.create("","",""));
Dataset<Row> rawDataSet = spark.createDataFrame(data, schema).toDF();

Resources