I have the below spark schema defined
StructType state = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("version", DataTypes.IntegerType, false),
DataTypes.createStructField("value", DataTypes.StringType, false)
});
ArrayType relationship = DataTypes.createArrayType(DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cid", DataTypes.StringType, false),
DataTypes.createStructField("state", state, false),
}));
StructType cr = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cmg", relationship, false)
});
StructType schema = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cr", cr, false)
});
If I create the dataframe as
Row r1 = RowFactory.create("{cr:{cmg:[{cid:\"B06XW5BXJZ\",state:{version:19,value:"approved"}}]}}");
List<Row> rowList = ImmutableList.of(r1);
Dataset<Row> df = spark.sqlContext().createDataFrame(rowList, schema);
The code gives below error
The value ({cr:{cmg:[{cid:"B06XW5BXJZ",state:{version:19,value:"approved"}}]}}) of the type (java.lang.String) cannot be converted to struct<cmg:array<struct<cid:string,state:struct<version:int,value:string>>>>
What am I missing?
When you execute createDataFrame(rowList, schema) Spark tries to interpret the content of each element in rowList using the provided schema.
However, the values in rowList are strings, and not structured objects, so Spark is unable to apply the schema.
You have various options to load that object into a dataframe in structured form.
Load the data as json string and use spark to parse it
String jsonRow = "{cr:{cmg:[{cid:\"B06XW5BXJZ\",state:{version:19,value:\"approved\"}}]}}";
Dataset<Row> df = spark.createDataset(List.of(jsonRow), Encoders.STRING())
.select(functions.from_json(functions.col("value"), schema, Map.of("allowUnquotedFieldNames", "true")));
in this case it first creates a Dataset<String> in which each row contains a single String column (value) and then uses the from_json spark sql function to parse the json using your schema.
Also note the use of the allowUnquotedFieldNames=true option, required because in the input string the field names are not quoted.
Manually create structured rows and load them in a Dataframe
Row structuredRow = RowFactory.create(RowFactory.create(List.of(RowFactory.create("B06XW5BXJZ", RowFactory.create(19, "approved")))));
Dataset<Row> df = spark.createDataFrame(List.of(structuredRow), schema);
This extends your initial attempt to use the RowFactory to manually create the rows. The rows must reflect the structure defined in the schema (or rather, the schema must respect the structure of the rows).
Use a custom Java bean class
Class definitions
public static class State implements Serializable {
private Integer version;
private String value;
// getters, setters, constructors
}
public static class Relationship implements Serializable {
private String cid;
private State state;
// getters, setters, constructors
}
public static class Cr implements Serializable {
private List<Relationship> cmg;
// getters, setters, constructors
}
public static class RowBean implements Serializable {
private Cr cr;
// getters, setters, constructors
}
Use the bean class to create a Dataset
RowBean row = new RowBean(new Cr(List.of(new Relationship("B06XW5BXJZ", new State(19, "approved")))));
Dataset<RowBean> ds = spark.createDataset(List.of(row), Encoders.bean(RowBean.class));
In this case, using a custom Java bean / Scala case class, the schema is extracted directly from the class structure using Encoders.bean()
Related
Encoder<Transaction> encoder = Encoders.bean(Transaction.class);
Dataset<Row> transactionDS = sparkSession
.read()
.format("csv")
.option("header", true)
.option("delimiter", ",")
.option("enforceSchema", false)
.option("multiLine", false)
.schema(encoder.schema())
.load("s3a://xxx/testSchema.csv");
.as(encoder);
System.out.println("==============schema starts============");
transactionDS.printSchema();
System.out.println("==============schema ends============");
transactionDS.show(10, true); // this is the line that bombs.
My CVS is this -
transactionId,accountId
1,2
10,44
I'm printing my schema in the logs - (you see, the columns are now flipped, or sorted - Ah!)
==============schema starts============
root
|-- accountId: integer (nullable = true)
|-- transactionId: long (nullable = true)
==============schema ends============
I'm getting below error
Caused by: java.lang.IllegalArgumentException: CSV header does not conform to the schema.
Header: transactionId, accounted
Schema: accountId, transactionId
Expected: accountId but found: transactionId
This is what my Tranaction class looks like.
public class Transaction implements Serializable {
private static final long serialVersionUID = 7648268336292069686L;
private Long transactionId;
private Integer accountId;
public Long getTransactionId() {
return transactionId;
}
public void setTransactionId(Long transactionId) {
this.transactionId = transactionId;
}
public Integer getAccountId() {
return accountId;
}
public void setAccountId(Integer accountId) {
this.accountId = accountId;
}
}
Question - Why Spark is not able to match my schema? The ordering is messed up. In my CSV, I'm passing transactionid, accountId but spark takes my schema accountId, transctionId. Ah!
do not use encoder.schema to load csv file, its column order may not according to csv.
Unlike parquet csv doesn't have a schema, so it will not apply the correct order, what you can do is to read the csv without:
.schema(encoder.schema())
Then apply the schema to the dataset that you just created.
This is what I ended up doing -
Encoder<Transaction> encoder = Encoders.bean(Transaction.class);
// read data from S3
System.out.println("Going to read file......................................");
Dataset<Transaction> transactionDS = sparkSession
.read()
.format("csv")
.option("header", true)
.option("delimiter", ",")
//.option("enforceSchema", false)
.option("inferSchema", false)
.option("dateFormat", "yyyy-MM-dd")
//.option("multiLine", false)
//.schema(encoder.schema())
.schema(_createSchema())
.csv("s3a://xxx/transactions_4_with_column_names.csv")
.as(encoder);
The _createSchema() function is below -
private static StructType _createSchema() {
List<StructField> list = new ArrayList<StructField>() {
private static final long serialVersionUID = -4953991596584287923L;
{
add(DataTypes.createStructField("transactionId", DataTypes.LongType, true));
add(DataTypes.createStructField("accountId", DataTypes.IntegerType, true));
add(DataTypes.createStructField("destAccountId", DataTypes.IntegerType, true));
add(DataTypes.createStructField("destPostDate", DataTypes.DateType, true));
}
};
return new StructType(list.toArray(new StructField[0]));
}
I have a following class that reads csv data into Spark's Dataset. Everything works fine if I just simply read and return the data.
However, if I apply a MapFunction to the data before returning from function, I get
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: com.Workflow.
I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow class in my mapping logic. I'm not calling any Workflow class function in my mapping logic. So why is Spark trying to serialize Workflow class? Any help will be appreciated.
public class Workflow {
private final SparkSession spark;
public Dataset<Row> readData(){
final StructType schema = new StructType()
.add("text", "string", false)
.add("category", "string", false);
Dataset<Row> data = spark.read()
.schema(schema)
.csv(dataPath);
/*
* works fine till here if I call
* return data;
*/
Dataset<Row> cleanedData = data.map(new MapFunction<Row, Row>() {
public Row call(Row row){
/* some mapping logic */
return row;
}
}, RowEncoder.apply(schema));
cleanedData.printSchema();
/* .... ERROR .... */
cleanedData.show();
return cleanedData;
}
}
anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution
you could make Workflow implement Serializeble and SparkSession as #transient
We are currently exploring Apache Spark (with Hadoop) for performing large scale
data transformation (in Java).
We are using the new looking (and experimental) DataSourceV2 interfaces to build our custom
output data files. A component of this is an implementation of the org.apache.spark.sql.sources.v2.writer.DataWriter
interface. It all works beautifully, except for one problem:
The org.apache.spark.sql.sources.v2.writer.DataWriter.write(record) method is often (but not always)
called twice for the same input record.
Here is what I hope is enough code for you to get the gist of what we're doing:
Basically we have many large sets of input data that we land via a Spark application
into Hadoop tables using code that looks something like:
final Dataset<Row> jdbcTableDataset = sparkSession.read()
.format("jdbc")
.option("url", sqlServerUrl)
.option("dbtable", tableName)
.option("user", jdbcUser)
.option("password", jdbcPassword)
.load();
final DataFrameWriter<Row> dataFrameWriter = jdbcTableDataset.write();
dataFrameWriter.save(hdfsDestination + "/" + tableName);
There's roughly fifty of these tables, for what it is worth. I know that there are no duplicates
in the data because dataFrameWriter.count() and dataFrameWriter.distinct().count()
returns the same value.
The transformation process involves performing join operations on these tables and writing
the result to files in the (shared) file system in a custom format. The resulting rows contain a unique key,
a dataGroup column, a dataSubGroup column and about 40 other columns. The selected records are
ordered by dataGroup, dataSubGroup and key.
Each output file is distinguished by the dataGroup column, which is used to partition the write operation:
final Dataset<Row> selectedData = dataSelector.selectData();
selectedData
.write()
.partitionBy("dataGroup")
.format("au.com.mycompany.myformat.DefaultSource")
.save("/path/to/shared/directory/");
To give you an idea of the scale, the resulting selected data consists of fifty-sixty million
records, unevenly split between roughly 3000 dataGroup files. Large, but not enormous.
The partitionBy("dataGroup") neatly ensures that each dataGroup file is processed by a
single executor. So far so good.
My datasource implements the new looking (and experimental) DataSourceV2 interface:
package au.com.mycompany.myformat;
import java.io.Serializable;
import java.util.Optional;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.sources.DataSourceRegister;
import org.apache.spark.sql.sources.v2.DataSourceOptions;
import org.apache.spark.sql.sources.v2.WriteSupport;
import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
import org.apache.spark.sql.types.StructType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class DefaultSource implements DataSourceRegister, WriteSupport , Serializable {
private static final Logger logger = LoggerFactory.getLogger(DefaultSource.class);
public DefaultSource() {
logger.info("created");
}
#Override
public String shortName() {
logger.info("shortName");
return "myformat";
}
#Override
public Optional<DataSourceWriter> createWriter(String writeUUID, StructType schema, SaveMode mode, DataSourceOptions options) {
return Optional.of(new MyFormatSourceWriter(writeUUID, schema, mode, options));
}
}
There's a DataSourceWriter implementation:
public class MyFormatSourceWriter implements DataSourceWriter, Serializable {
...
}
and a DataSourceWriterFactory implementation:
public class MyDataWriterFactory implements DataWriterFactory<InternalRow> {
...
}
and finally a DataWriter implementation. It seems that a DataWriter is created and sent to
each executor. Therefore each DataWriter will process many of the dataGroups.
Each record has a unique key column.
public class MyDataWriter implements DataWriter<InternalRow>, Serializable {
private static final Logger logger = LoggerFactory.getLogger(XcdDataWriter.class);
...
MyDataWriter(File buildDirectory, StructType schema, int partitionId) {
this.buildDirectory = buildDirectory;
this.schema = schema;
this.partitionId = partitionId;
logger.debug("Created MyDataWriter for partition {}", partitionId);
}
private String getFieldByName(InternalRow row, String fieldName) {
return Optional.ofNullable(row.getUTF8String(schema.fieldIndex(fieldName)))
.orElse(UTF8String.EMPTY_UTF8)
.toString();
}
/**
* Rows are written here. Each row has a unique key column as well as a dataGroup
* column. Right now we are frequently getting called with the same record twice.
*/
#Override
public void write(InternalRow record) throws IOException {
String nextDataFileName = getFieldByName(record, "dataGroup") + ".myExt";
// some non-trivial logic for determining the right output file
...
// write the output record
outputWriter.append(getFieldByName(row, "key")).append(',')
.append(getFieldByName(row, "prodDate")).append(',')
.append(getFieldByName(row, "nation")).append(',')
.append(getFieldByName(row, "plant")).append(',')
...
}
#Override
public WriterCommitMessage commit() throws IOException {
...
outputWriter.close();
...
logger.debug("Committed partition {} with {} data files for zip file {} for a total of {} zip files",
partitionId, dataFileCount, dataFileName, dataFileCount);
return new MyWriterCommitMessage(partitionId, dataFileCount);
}
#Override
public void abort() throws IOException {
logger.error("Failed to collect data for schema: {}", schema);
...
}
}
Right now I'm working around this by keeping track of the last key that was processed and ignoring
duplicates.
I have written a kafka producer that tails the contents of a log file(format:csv).The kafka consumer is a streaming application that creates a JavaDStream.
using forEachRDD method,I'm splitting each line of file over the delimiter ',' and creating Row object.I have specified schema that has 7 columns.
Then I am creating dataframe using the JavaRDD and schema.
But the problem here is that,all the rows in the log file do not have same number of columns.
Thus, is there any way to filter out such rows that do not satisfy the schema or to create schema dynamically based on the row content?
Following is the part of the code:
JavaDStream<String> msgDataStream =directKafkaStream.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
msgDataStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
String[] splitMsg=msg.split(",");
Object[] vals = new Object[splitMsg.length];
for(int i=0;i<splitMsg.length;i++)
{
vals[i]=splitMsg[i].replace("\"","").trim();
}
Row row = RowFactory.create(vals);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("timeIpReq", DataTypes.StringType, true),DataTypes.createStructField("SrcMac", DataTypes.StringType, true),
DataTypes.createStructField("Proto", DataTypes.StringType, true),DataTypes.createStructField("ACK", DataTypes.StringType, true),
DataTypes.createStructField("srcDst", DataTypes.StringType, true),DataTypes.createStructField("NATSrcDst", DataTypes.StringType, true),
DataTypes.createStructField("len", DataTypes.StringType, true)});
//Get Spark 2.0 session
Dataset<Row> msgDataFrame = session.createDataFrame(rowRDD, schema);
A simple way to remove rows that do not match the expected schema is to use flatMap with a Option type, also, if your target is to build a DataFrame, we use the same flatMap step to apply a schema to the data. This is facilitated in Scala by the use of case classes.
// Create Schema
case class NetInfo(timeIpReq: String, srcMac: String, proto: String, ack: String, srcDst: String, natSrcDst: String, len: String)
val netInfoStream = msgDataStream.flatMap{msg =>
val parts = msg.split(",")
if (parts.size == 7) { //filter out messages with unmatching set of fields
val Array(time, src, proto, ack, srcDst, natSrcDst, len) = parts // use a extractor to get the different parts in variables
Some(NetInfo(time, src, proto, ack, srcDst, natSrcDst, len)) // return a valid record
} else {
None // We don't have a valid. Return None
}
}
netInfoStream.foreachRDD{rdd =>
import sparkSession.implicits._
val df = rdd.toDF() // DataFrame transformation is possible on RDDs with a schema (based on a case class)
// do stuff with the dataframe
}
Regarding:
all the rows in the log file do not have same number of columns.
Assuming that they all represent the same kind of data but with potentially some columns missing, the right strategy would be to either filter out the incomplete data (like exemplified here) or use optional values in a defined schema if there is a deterministic way to know what fields are missing. This requirement should be posed to the upstream applications that generate the data. It's common to represent missing values in CSV with empty comma sequences (e.g. field0,,field2,,,field5)
A dynamic schema to handle per-row differences would not make sense, as there would be no way to apply that to complete DataFrame composed of rows with different schemas.
How can I create a DataFrame from an JavaRDD contains Integers. I have done something like below but not working.
List<Integer> input = Arrays.asList(101, 103, 105);
JavaRDD<Integer> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, Integer.class);
I got ClassCastException saying org.apache.spark.sql.types.IntegerType$ cannot be cast to org.apache.spark.sql.types.StructType
How can I achieve this?
Apparently (although not intuitively), this createDataFrame overload can only work for "Bean" types, which means types that do not correspond to any built-in Spark SQL type.
You can see that in the source code, the class you pass is matched with a Spark SQL type in JavaTypeInference.inferDataType, and the result is cast into a StructType (see dataType.asInstanceOf[StructType] in SQLContext.getSchema - but the built in "primitive" types (like IntegerType) are NOT StructTypes... Looks like a bug or undocumented behavior to me....
WORKAROUNDS:
Wrap your Integers with a "bean" class (that's ugly, I know):
public static class MyBean {
final int value;
MyBean(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
List<MyBean> input = Arrays.asList(new MyBean(101), new MyBean(103), new MyBean(105));
JavaRDD<MyBean> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, MyBean.class);
dataframe.show(); // this works...
Convert to RDD<Row> yourself:
// convert to Rows:
JavaRDD<Row> rowRdd = inputRDD.map(new Function<Integer, Row>() {
#Override
public Row call(Integer v1) throws Exception {
return RowFactory.create(v1);
}
});
// create schema (this looks nicer in Scala...):
StructType schema = new StructType(new StructField[]{new StructField("number", IntegerType$.MODULE$, false, Metadata.empty())});
DataFrame dataframe = sqlcontext.createDataFrame(rowRdd, schema);
dataframe.show(); // this works...
Now in Spark 2.2 you can do the following to create a Dataset.
Dataset<Integer> dataSet = sqlContext().createDataset(javardd.rdd(), Encoders.INT());