Spark CVS load - custom schema - custom object - apache-spark

Encoder<Transaction> encoder = Encoders.bean(Transaction.class);
Dataset<Row> transactionDS = sparkSession
.read()
.format("csv")
.option("header", true)
.option("delimiter", ",")
.option("enforceSchema", false)
.option("multiLine", false)
.schema(encoder.schema())
.load("s3a://xxx/testSchema.csv");
.as(encoder);
System.out.println("==============schema starts============");
transactionDS.printSchema();
System.out.println("==============schema ends============");
transactionDS.show(10, true); // this is the line that bombs.
My CVS is this -
transactionId,accountId
1,2
10,44
I'm printing my schema in the logs - (you see, the columns are now flipped, or sorted - Ah!)
==============schema starts============
root
|-- accountId: integer (nullable = true)
|-- transactionId: long (nullable = true)
==============schema ends============
I'm getting below error
Caused by: java.lang.IllegalArgumentException: CSV header does not conform to the schema.
Header: transactionId, accounted
Schema: accountId, transactionId
Expected: accountId but found: transactionId
This is what my Tranaction class looks like.
public class Transaction implements Serializable {
private static final long serialVersionUID = 7648268336292069686L;
private Long transactionId;
private Integer accountId;
public Long getTransactionId() {
return transactionId;
}
public void setTransactionId(Long transactionId) {
this.transactionId = transactionId;
}
public Integer getAccountId() {
return accountId;
}
public void setAccountId(Integer accountId) {
this.accountId = accountId;
}
}
Question - Why Spark is not able to match my schema? The ordering is messed up. In my CSV, I'm passing transactionid, accountId but spark takes my schema accountId, transctionId. Ah!

do not use encoder.schema to load csv file, its column order may not according to csv.

Unlike parquet csv doesn't have a schema, so it will not apply the correct order, what you can do is to read the csv without:
.schema(encoder.schema())
Then apply the schema to the dataset that you just created.

This is what I ended up doing -
Encoder<Transaction> encoder = Encoders.bean(Transaction.class);
// read data from S3
System.out.println("Going to read file......................................");
Dataset<Transaction> transactionDS = sparkSession
.read()
.format("csv")
.option("header", true)
.option("delimiter", ",")
//.option("enforceSchema", false)
.option("inferSchema", false)
.option("dateFormat", "yyyy-MM-dd")
//.option("multiLine", false)
//.schema(encoder.schema())
.schema(_createSchema())
.csv("s3a://xxx/transactions_4_with_column_names.csv")
.as(encoder);
The _createSchema() function is below -
private static StructType _createSchema() {
List<StructField> list = new ArrayList<StructField>() {
private static final long serialVersionUID = -4953991596584287923L;
{
add(DataTypes.createStructField("transactionId", DataTypes.LongType, true));
add(DataTypes.createStructField("accountId", DataTypes.IntegerType, true));
add(DataTypes.createStructField("destAccountId", DataTypes.IntegerType, true));
add(DataTypes.createStructField("destPostDate", DataTypes.DateType, true));
}
};
return new StructType(list.toArray(new StructField[0]));
}

Related

Spark java dataframe String cannot be converted to struct

I have the below spark schema defined
StructType state = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("version", DataTypes.IntegerType, false),
DataTypes.createStructField("value", DataTypes.StringType, false)
});
ArrayType relationship = DataTypes.createArrayType(DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cid", DataTypes.StringType, false),
DataTypes.createStructField("state", state, false),
}));
StructType cr = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cmg", relationship, false)
});
StructType schema = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cr", cr, false)
});
If I create the dataframe as
Row r1 = RowFactory.create("{cr:{cmg:[{cid:\"B06XW5BXJZ\",state:{version:19,value:"approved"}}]}}");
List<Row> rowList = ImmutableList.of(r1);
Dataset<Row> df = spark.sqlContext().createDataFrame(rowList, schema);
The code gives below error
The value ({cr:{cmg:[{cid:"B06XW5BXJZ",state:{version:19,value:"approved"}}]}}) of the type (java.lang.String) cannot be converted to struct<cmg:array<struct<cid:string,state:struct<version:int,value:string>>>>
What am I missing?
When you execute createDataFrame(rowList, schema) Spark tries to interpret the content of each element in rowList using the provided schema.
However, the values in rowList are strings, and not structured objects, so Spark is unable to apply the schema.
You have various options to load that object into a dataframe in structured form.
Load the data as json string and use spark to parse it
String jsonRow = "{cr:{cmg:[{cid:\"B06XW5BXJZ\",state:{version:19,value:\"approved\"}}]}}";
Dataset<Row> df = spark.createDataset(List.of(jsonRow), Encoders.STRING())
.select(functions.from_json(functions.col("value"), schema, Map.of("allowUnquotedFieldNames", "true")));
in this case it first creates a Dataset<String> in which each row contains a single String column (value) and then uses the from_json spark sql function to parse the json using your schema.
Also note the use of the allowUnquotedFieldNames=true option, required because in the input string the field names are not quoted.
Manually create structured rows and load them in a Dataframe
Row structuredRow = RowFactory.create(RowFactory.create(List.of(RowFactory.create("B06XW5BXJZ", RowFactory.create(19, "approved")))));
Dataset<Row> df = spark.createDataFrame(List.of(structuredRow), schema);
This extends your initial attempt to use the RowFactory to manually create the rows. The rows must reflect the structure defined in the schema (or rather, the schema must respect the structure of the rows).
Use a custom Java bean class
Class definitions
public static class State implements Serializable {
private Integer version;
private String value;
// getters, setters, constructors
}
public static class Relationship implements Serializable {
private String cid;
private State state;
// getters, setters, constructors
}
public static class Cr implements Serializable {
private List<Relationship> cmg;
// getters, setters, constructors
}
public static class RowBean implements Serializable {
private Cr cr;
// getters, setters, constructors
}
Use the bean class to create a Dataset
RowBean row = new RowBean(new Cr(List.of(new Relationship("B06XW5BXJZ", new State(19, "approved")))));
Dataset<RowBean> ds = spark.createDataset(List.of(row), Encoders.bean(RowBean.class));
In this case, using a custom Java bean / Scala case class, the schema is extracted directly from the class structure using Encoders.bean()

Spark dataframe from dynamic schema Or filter out rows that do not satisfy the schema

I have written a kafka producer that tails the contents of a log file(format:csv).The kafka consumer is a streaming application that creates a JavaDStream.
using forEachRDD method,I'm splitting each line of file over the delimiter ',' and creating Row object.I have specified schema that has 7 columns.
Then I am creating dataframe using the JavaRDD and schema.
But the problem here is that,all the rows in the log file do not have same number of columns.
Thus, is there any way to filter out such rows that do not satisfy the schema or to create schema dynamically based on the row content?
Following is the part of the code:
JavaDStream<String> msgDataStream =directKafkaStream.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
msgDataStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
String[] splitMsg=msg.split(",");
Object[] vals = new Object[splitMsg.length];
for(int i=0;i<splitMsg.length;i++)
{
vals[i]=splitMsg[i].replace("\"","").trim();
}
Row row = RowFactory.create(vals);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("timeIpReq", DataTypes.StringType, true),DataTypes.createStructField("SrcMac", DataTypes.StringType, true),
DataTypes.createStructField("Proto", DataTypes.StringType, true),DataTypes.createStructField("ACK", DataTypes.StringType, true),
DataTypes.createStructField("srcDst", DataTypes.StringType, true),DataTypes.createStructField("NATSrcDst", DataTypes.StringType, true),
DataTypes.createStructField("len", DataTypes.StringType, true)});
//Get Spark 2.0 session
Dataset<Row> msgDataFrame = session.createDataFrame(rowRDD, schema);
A simple way to remove rows that do not match the expected schema is to use flatMap with a Option type, also, if your target is to build a DataFrame, we use the same flatMap step to apply a schema to the data. This is facilitated in Scala by the use of case classes.
// Create Schema
case class NetInfo(timeIpReq: String, srcMac: String, proto: String, ack: String, srcDst: String, natSrcDst: String, len: String)
val netInfoStream = msgDataStream.flatMap{msg =>
val parts = msg.split(",")
if (parts.size == 7) { //filter out messages with unmatching set of fields
val Array(time, src, proto, ack, srcDst, natSrcDst, len) = parts // use a extractor to get the different parts in variables
Some(NetInfo(time, src, proto, ack, srcDst, natSrcDst, len)) // return a valid record
} else {
None // We don't have a valid. Return None
}
}
netInfoStream.foreachRDD{rdd =>
import sparkSession.implicits._
val df = rdd.toDF() // DataFrame transformation is possible on RDDs with a schema (based on a case class)
// do stuff with the dataframe
}
Regarding:
all the rows in the log file do not have same number of columns.
Assuming that they all represent the same kind of data but with potentially some columns missing, the right strategy would be to either filter out the incomplete data (like exemplified here) or use optional values in a defined schema if there is a deterministic way to know what fields are missing. This requirement should be posed to the upstream applications that generate the data. It's common to represent missing values in CSV with empty comma sequences (e.g. field0,,field2,,,field5)
A dynamic schema to handle per-row differences would not make sense, as there would be no way to apply that to complete DataFrame composed of rows with different schemas.

Merge multiple columns in a Spark DataFrame [Java]

How to combine multiple columns (say 3) from a DataFrame in a single column (in a new DataFrame) where each row becomes a Spark DenseVector? Similar to this thread but in Java and with a few tweaks mentioned below.
I tried using a UDF like this:
private UDF3<Double, Double, Double, Row> toColumn = new UDF3<Double, Double, Double, Row>() {
private static final long serialVersionUID = 1L;
public Row call(Double first, Double second, Double third) throws Exception {
Row row = RowFactory.create(Vectors.dense(first, second, third));
return row;
}
};
And then register the UDF:
sqlContext.udf().register("toColumn", toColumn, dataType);
Where the dataType is:
StructType dataType = DataTypes.createStructType(new StructField[]{
new StructField("bla", new VectorUDT(), false, Metadata.empty()),
});
When I call this UDF on a DataFrame with 3 columns and print out the schema of the new DataFrame, I get this:
root
|-- features: struct (nullable = true)
| |-- bla: vector (nullable = false)
The problem here is that I need a vector to be outside, not within a struct.
Something like this:
root
|-- features: vector (nullable = true)
I don't know how to get this since the register function requires the return type of UDF to be DataType (which, in turn, doesn't provide a VectorType)
You actually nested the vector type into a struct manually by using this data type:
new StructField("bla", new VectorUDT(), false, Metadata.empty()),
If you remove the outer StructField, you will get what you want. Of course, in this case, you need to modify a bit the signature of your function definition. That is, you need to return with the type Vector.
Please see below my concrete example of what I mean in the form of a simple JUnit test.
package sample.spark.test;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.VectorUDT;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.api.java.UDF3;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.junit.Test;
import java.io.Serializable;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;
public class ToVectorTest implements Serializable {
private static final long serialVersionUID = 2L;
private UDF3<Double, Double, Double, Vector> toColumn = new UDF3<Double, Double, Double, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Double first, Double second, Double third) throws Exception {
return Vectors.dense(first, second, third);
}
};
#Test
public void testUDF() {
// context
final JavaSparkContext sc = new JavaSparkContext("local", "ToVectorTest");
final SQLContext sqlContext = new SQLContext(sc);
// test input
final DataFrame input = sqlContext.createDataFrame(
sc.parallelize(
Arrays.asList(
RowFactory.create(1.0, 2.0, 3.0),
RowFactory.create(4.0, 5.0, 6.0),
RowFactory.create(7.0, 8.0, 9.0),
RowFactory.create(10.0, 11.0, 12.0)
)),
DataTypes.createStructType(
Arrays.asList(
new StructField("feature1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("feature2", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("feature3", DataTypes.DoubleType, false, Metadata.empty())
)
)
);
input.registerTempTable("input");
// expected output
final Set<Vector> expectedOutput = new HashSet<>(Arrays.asList(
Vectors.dense(1.0, 2.0, 3.0),
Vectors.dense(4.0, 5.0, 6.0),
Vectors.dense(7.0, 8.0, 9.0),
Vectors.dense(10.0, 11.0, 12.0)
));
// processing
sqlContext.udf().register("toColumn", toColumn, new VectorUDT());
final DataFrame outputDF = sqlContext.sql("SELECT toColumn(feature1, feature2, feature3) AS x FROM input");
final Set<Vector> output = new HashSet<>(outputDF.toJavaRDD().map(r -> r.<Vector>getAs("x")).collect());
// evaluation
assertEquals(expectedOutput.size(), output.size());
for (Vector x : output) {
assertTrue(expectedOutput.contains(x));
}
// show the schema and the content
System.out.println(outputDF.schema());
outputDF.show();
sc.stop();
}
}

How to create a a spark dataframe from Integer RDD

How can I create a DataFrame from an JavaRDD contains Integers. I have done something like below but not working.
List<Integer> input = Arrays.asList(101, 103, 105);
JavaRDD<Integer> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, Integer.class);
I got ClassCastException saying org.apache.spark.sql.types.IntegerType$ cannot be cast to org.apache.spark.sql.types.StructType
How can I achieve this?
Apparently (although not intuitively), this createDataFrame overload can only work for "Bean" types, which means types that do not correspond to any built-in Spark SQL type.
You can see that in the source code, the class you pass is matched with a Spark SQL type in JavaTypeInference.inferDataType, and the result is cast into a StructType (see dataType.asInstanceOf[StructType] in SQLContext.getSchema - but the built in "primitive" types (like IntegerType) are NOT StructTypes... Looks like a bug or undocumented behavior to me....
WORKAROUNDS:
Wrap your Integers with a "bean" class (that's ugly, I know):
public static class MyBean {
final int value;
MyBean(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
List<MyBean> input = Arrays.asList(new MyBean(101), new MyBean(103), new MyBean(105));
JavaRDD<MyBean> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, MyBean.class);
dataframe.show(); // this works...
Convert to RDD<Row> yourself:
// convert to Rows:
JavaRDD<Row> rowRdd = inputRDD.map(new Function<Integer, Row>() {
#Override
public Row call(Integer v1) throws Exception {
return RowFactory.create(v1);
}
});
// create schema (this looks nicer in Scala...):
StructType schema = new StructType(new StructField[]{new StructField("number", IntegerType$.MODULE$, false, Metadata.empty())});
DataFrame dataframe = sqlcontext.createDataFrame(rowRdd, schema);
dataframe.show(); // this works...
Now in Spark 2.2 you can do the following to create a Dataset.
Dataset<Integer> dataSet = sqlContext().createDataset(javardd.rdd(), Encoders.INT());

Spark SQL: How to call UDF from DataFrame operation using JAVA

I would like to know how to call UDF function from function of domain-specific language(DSL) in Spark SQL using JAVA.
I have UDF function (just for example):
UDF2 equals = new UDF2<String, String, Boolean>() {
#Override
public Boolean call(String first, String second) throws Exception {
return first.equals(second);
}
};
I've registered it to sqlContext
sqlContext.udf().register("equals", equals, DataTypes.BooleanType);
When I run following query, my UDF is called and I get a result.
sqlContext.sql("SELECT p0.value FROM values p0 WHERE equals(p0.value, 'someString')");
I would transfrom this query using functions of domain specific language in Spark SQL, and I am not sure how to do it.
valuesDF.select("value").where(???);
I found that there exists callUDF() function where one of its parameters is Function2 fnctn but not UDF2.
How can I use UDF and functions from DSL?
I found a solution with which I am half-satisfied.
It is possible to call UDF as a Column Condition such as:
valuesDF.filter("equals(columnName, 'someString')").select("columnName");
But I still wonder if it is possible to call UDF directly.
Edit:
Btw, it is possible to call udf directly e.g:
df.where(callUdf("equals", scala.collection.JavaConversions.asScalaBuffer(
Arrays.asList(col("columnName"), col("otherColumnName"))
).seq())).select("columnName");
import of org.​apache.​spark.​sql.​functions is required.
When querying a dataframe, you should just be able to execute the UDF using something like this:
sourceDf.filter(equals(col("columnName"), "someString")).select("columnName")
where col("columnName") is the column you want to compare.
Here is working code example. It works with Spark 1.5.x and 1.6.x. The trick to calling UDF's from within a pipeline transformer is to use the sqlContext() on the DataFrame to register your UDF
#Test
public void test() {
// https://issues.apache.org/jira/browse/SPARK-12484
logger.info("BEGIN");
DataFrame df = createData();
final String tableName = "myTable";
sqlContext.registerDataFrameAsTable(df, tableName);
logger.info("print schema");
df.printSchema();
logger.info("original data before we applied UDF");
df.show();
MyUDF udf = new MyUDF();
final String udfName = "myUDF";
sqlContext.udf().register(udfName, udf, DataTypes.StringType);
String fmt = "SELECT *, %s(%s) as transformedByUDF FROM %s";
String stmt = String.format(fmt, udfName, tableName+".labelStr", tableName);
logger.info("AEDWIP stmt:{}", stmt);
DataFrame udfDF = sqlContext.sql(stmt);
Row[] results = udfDF.head(3);
for (Row row : results) {
logger.info("row returned by applying UDF {}", row);
}
logger.info("AEDWIP udfDF schema");
udfDF.printSchema();
logger.info("AEDWIP udfDF data");
udfDF.show();
logger.info("END");
}
DataFrame createData() {
Features f1 = new Features(1, category1);
Features f2 = new Features(2, category2);
ArrayList<Features> data = new ArrayList<Features>(2);
data.add(f1);
data.add(f2);
//JavaRDD<Features> rdd = javaSparkContext.parallelize(Arrays.asList(f1, f2));
JavaRDD<Features> rdd = javaSparkContext.parallelize(data);
DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
return df;
}
class MyUDF implements UDF1<String, String> {
private static final long serialVersionUID = 1L;
#Override
public String call(String s) throws Exception {
logger.info("AEDWIP s:{}", s);
String ret = s.equalsIgnoreCase(category1) ? category1 : category3;
return ret;
}
}
public class Features implements Serializable{
private static final long serialVersionUID = 1L;
int id;
String labelStr;
Features(int id, String l) {
this.id = id;
this.labelStr = l;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getLabelStr() {
return labelStr;
}
public void setLabelStr(String labelStr) {
this.labelStr = labelStr;
}
}
this is the output
+---+--------+
| id|labelStr|
+---+--------+
| 1| noise|
| 2| ack|
+---+--------+
root
|-- id: integer (nullable = false)
|-- labelStr: string (nullable = true)
|-- transformedByUDF: string (nullable = true)
+---+--------+----------------+
| id|labelStr|transformedByUDF|
+---+--------+----------------+
| 1| noise| noise|
| 2| ack| signal|
+---+--------+----------------+

Resources