How to create a a spark dataframe from Integer RDD - apache-spark

How can I create a DataFrame from an JavaRDD contains Integers. I have done something like below but not working.
List<Integer> input = Arrays.asList(101, 103, 105);
JavaRDD<Integer> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, Integer.class);
I got ClassCastException saying org.apache.spark.sql.types.IntegerType$ cannot be cast to org.apache.spark.sql.types.StructType
How can I achieve this?

Apparently (although not intuitively), this createDataFrame overload can only work for "Bean" types, which means types that do not correspond to any built-in Spark SQL type.
You can see that in the source code, the class you pass is matched with a Spark SQL type in JavaTypeInference.inferDataType, and the result is cast into a StructType (see dataType.asInstanceOf[StructType] in SQLContext.getSchema - but the built in "primitive" types (like IntegerType) are NOT StructTypes... Looks like a bug or undocumented behavior to me....
WORKAROUNDS:
Wrap your Integers with a "bean" class (that's ugly, I know):
public static class MyBean {
final int value;
MyBean(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
List<MyBean> input = Arrays.asList(new MyBean(101), new MyBean(103), new MyBean(105));
JavaRDD<MyBean> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, MyBean.class);
dataframe.show(); // this works...
Convert to RDD<Row> yourself:
// convert to Rows:
JavaRDD<Row> rowRdd = inputRDD.map(new Function<Integer, Row>() {
#Override
public Row call(Integer v1) throws Exception {
return RowFactory.create(v1);
}
});
// create schema (this looks nicer in Scala...):
StructType schema = new StructType(new StructField[]{new StructField("number", IntegerType$.MODULE$, false, Metadata.empty())});
DataFrame dataframe = sqlcontext.createDataFrame(rowRdd, schema);
dataframe.show(); // this works...

Now in Spark 2.2 you can do the following to create a Dataset.
Dataset<Integer> dataSet = sqlContext().createDataset(javardd.rdd(), Encoders.INT());

Related

How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF?

As question, How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF ?
You can access task information using TaskContext:
import org.apache.spark.TaskContext
sc.parallelize(Seq[Int](), 4).mapPartitions(_ => {
val ctx = TaskContext.get
val stageId = ctx.stageId
val partId = ctx.partitionId
val hostname = java.net.InetAddress.getLocalHost().getHostName()
Iterator(s"Stage: $stageId, Partition: $partId, Host: $hostname")}).collect.foreach(println)
A similar functionality has been added to PySpark in Spark 2.2.0 (SPARK-18576):
from pyspark import TaskContext
import socket
def task_info(*_):
ctx = TaskContext()
return ["Stage: {0}, Partition: {1}, Host: {2}".format
(ctx.stageId(), ctx.partitionId(), socket.gethostname())]
for x in sc.parallelize([], 4).mapPartitions(task_info).collect():
print(x)
I think it will provide you the information about the task including map id you are looking for.
I have found the correct answer on my own, we can get the taskID in a hive UDF the way as below :
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public void configure(MapredContext context) {
//get the number of tasks 获取task总数量
int numTasks = context.getJobConf().getNumMapTasks();
//get the current taskID 获取当前taskID
String taskID = context.getJobConf().get("mapred.task.id");
this.tmpStr = numTasks + "_h_xXx_h_" + taskID;
}
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
#Override
public Object evaluate(DeferredObject[] arguments) {
result.set(this.tmpStr);
return this.result;
}
#Override
public String getDisplayString(String[] children) {
return "RowSeq-func()";
}
}
but this would be effective only in MapReduce execution engine, it would not work in a SparkSQL engine.
Test code as below:
add jar hdfs:///home/dp/tmp/shaw/my_udf.jar;
create temporary function seqx AS 'com.udf.TestUDF';
with core as (
select
device_id
from
test_table
where
p_date = '20210309'
and product = 'google'
distribute by
device_id
)
select
seqx() as seqs,
count(1) as cc
from
core
group by
seqx()
order by
seqs asc
Result in MR engine as below, see we have got the task number and taskID successfully:
Result in Spark engine with same sql above, the UDF is not valid, we get nothing about taskID:
If you run your HQL in Spark engine and call the Hive UDF meanwhile, and really need to get the partitionId in Spark, see the code below :
import org.apache.spark.TaskContext;
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-initial-pid";
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
public Object evaluate(DeferredObject[] arguments) {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-evaluate-pid";
result.set(this.tmpStr);
return this.result;
}
}
As above, you can get the Spark partitionId by calling TaskContext.getPartitionId() in the override method initialize or evalute of UDF class.
Notice: your UDF must has params, suchs select my_udf(param), this would lead your UDF initialized in multiple tasks, if your UDF do not have a param, it will be initialized at the Driver, and the Driver do not have the taskContext and partitionId, so you would get nothing.
The image below is a result produced by the above UDF executed in Spark engine,see, we get the partitionIds successfully :

Spark java dataframe String cannot be converted to struct

I have the below spark schema defined
StructType state = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("version", DataTypes.IntegerType, false),
DataTypes.createStructField("value", DataTypes.StringType, false)
});
ArrayType relationship = DataTypes.createArrayType(DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cid", DataTypes.StringType, false),
DataTypes.createStructField("state", state, false),
}));
StructType cr = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cmg", relationship, false)
});
StructType schema = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("cr", cr, false)
});
If I create the dataframe as
Row r1 = RowFactory.create("{cr:{cmg:[{cid:\"B06XW5BXJZ\",state:{version:19,value:"approved"}}]}}");
List<Row> rowList = ImmutableList.of(r1);
Dataset<Row> df = spark.sqlContext().createDataFrame(rowList, schema);
The code gives below error
The value ({cr:{cmg:[{cid:"B06XW5BXJZ",state:{version:19,value:"approved"}}]}}) of the type (java.lang.String) cannot be converted to struct<cmg:array<struct<cid:string,state:struct<version:int,value:string>>>>
What am I missing?
When you execute createDataFrame(rowList, schema) Spark tries to interpret the content of each element in rowList using the provided schema.
However, the values in rowList are strings, and not structured objects, so Spark is unable to apply the schema.
You have various options to load that object into a dataframe in structured form.
Load the data as json string and use spark to parse it
String jsonRow = "{cr:{cmg:[{cid:\"B06XW5BXJZ\",state:{version:19,value:\"approved\"}}]}}";
Dataset<Row> df = spark.createDataset(List.of(jsonRow), Encoders.STRING())
.select(functions.from_json(functions.col("value"), schema, Map.of("allowUnquotedFieldNames", "true")));
in this case it first creates a Dataset<String> in which each row contains a single String column (value) and then uses the from_json spark sql function to parse the json using your schema.
Also note the use of the allowUnquotedFieldNames=true option, required because in the input string the field names are not quoted.
Manually create structured rows and load them in a Dataframe
Row structuredRow = RowFactory.create(RowFactory.create(List.of(RowFactory.create("B06XW5BXJZ", RowFactory.create(19, "approved")))));
Dataset<Row> df = spark.createDataFrame(List.of(structuredRow), schema);
This extends your initial attempt to use the RowFactory to manually create the rows. The rows must reflect the structure defined in the schema (or rather, the schema must respect the structure of the rows).
Use a custom Java bean class
Class definitions
public static class State implements Serializable {
private Integer version;
private String value;
// getters, setters, constructors
}
public static class Relationship implements Serializable {
private String cid;
private State state;
// getters, setters, constructors
}
public static class Cr implements Serializable {
private List<Relationship> cmg;
// getters, setters, constructors
}
public static class RowBean implements Serializable {
private Cr cr;
// getters, setters, constructors
}
Use the bean class to create a Dataset
RowBean row = new RowBean(new Cr(List.of(new Relationship("B06XW5BXJZ", new State(19, "approved")))));
Dataset<RowBean> ds = spark.createDataset(List.of(row), Encoders.bean(RowBean.class));
In this case, using a custom Java bean / Scala case class, the schema is extracted directly from the class structure using Encoders.bean()

Cast spark Row to String

I want to read a timestamp column from Spark Dataset and cast it to String using appropriate format. Code looks like this:
import static org.apache.spark.sql.functions.*;
...
String result;
for (Row groupedRow : datasetGrouped.collectAsList()) {
for (StructField sf : groupedRow.schema().fields()) {
result = getDatasetFromRow(groupedRow).withColumn("fieldName", functions.date_format(col("fieldToGet"), "dd.MM.yyyy")).
collectAsList().stream().findFirst().get().getAs("fieldName");
}
}
...
private static Dataset<Row> getDatasetFromRow(Row row){
List<Row> rowListToGetDataset = new ArrayList<>();
List<String> strListToGetDataset = new ArrayList<>();
for(StructField sf : row.schema().fields()) {
strListToGetDataset.add(row.getAs(sf.name()));
}
rowListToGetDataset.add(RowFactory.create(strListToGetDataset.toArray()));
return SparkService.sqlContext().createDataFrame(rowListToGetDataset, row.schema());
}
This one is ugly and I'm looking for a solution, that doesn't create additional Dataset with timestamp field casted to String in format I need.
App uses Java Spark API, so any suggestions in Java please.
Spark ver: 2.3.1

Stop Ordering of dataset columns while writing to CSV

I want to write a dataset to CSV file but I don't want columns to be ordered in ascending order(or any order for that matter).
For eg. Table: String id; String name; String age; +300 more fields
CSV formed is of schema: age name id +300 more columns in alphabetical order
but I want the CSV of the same ordering as of Model.
I could have used .select() or .selectExpr() but there I had to mention 300+ fields.
Is there any other easier way?
Currently using:
dataset.toDF().coalesce(1).selectExpr("templateId","batchId", +300 more fields ).write().format("com.databricks.spark.csv").option("nullValue","").mode(SaveMode.Overwrite).save(path);
A workaround I followed for the above question:
added the fields in a properties file(column.properties) under a single key
with fields comma-separated.
loaded that properties file in broadcast map.
used broadcast map in .selectExpr() method.
Code for loading properties file in broadcast map:
public static Map<String, String> getColumnMap() {
String propFileName = "column.properties";
InputStream inputStream =
ConfigurationLoader.class.getClassLoader().getResourceAsStream(propFileName);
if (inputStream != null) {
try {
prop.load(inputStream);
colMap = (Map) prop;
} catch (IOException e) {
// handle exception
}
}
return colMap;
}
JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Broadcast<Map<String, String>> broadcastProperty = sc.broadcast(propertiesMap);
Code for writing to CSV file:
dataset.toDF().coalesce(1).selectExpr(broadcastColumn.getValue().get(TemplateConstants.COLUMN).split(",")).write().format(ApplicationConstants.CSV_FORMAT).option(ApplicationConstants.NULL_VALUE, "").mode(SaveMode.Overwrite).save(path);

Spark dataframe from dynamic schema Or filter out rows that do not satisfy the schema

I have written a kafka producer that tails the contents of a log file(format:csv).The kafka consumer is a streaming application that creates a JavaDStream.
using forEachRDD method,I'm splitting each line of file over the delimiter ',' and creating Row object.I have specified schema that has 7 columns.
Then I am creating dataframe using the JavaRDD and schema.
But the problem here is that,all the rows in the log file do not have same number of columns.
Thus, is there any way to filter out such rows that do not satisfy the schema or to create schema dynamically based on the row content?
Following is the part of the code:
JavaDStream<String> msgDataStream =directKafkaStream.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
msgDataStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
String[] splitMsg=msg.split(",");
Object[] vals = new Object[splitMsg.length];
for(int i=0;i<splitMsg.length;i++)
{
vals[i]=splitMsg[i].replace("\"","").trim();
}
Row row = RowFactory.create(vals);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("timeIpReq", DataTypes.StringType, true),DataTypes.createStructField("SrcMac", DataTypes.StringType, true),
DataTypes.createStructField("Proto", DataTypes.StringType, true),DataTypes.createStructField("ACK", DataTypes.StringType, true),
DataTypes.createStructField("srcDst", DataTypes.StringType, true),DataTypes.createStructField("NATSrcDst", DataTypes.StringType, true),
DataTypes.createStructField("len", DataTypes.StringType, true)});
//Get Spark 2.0 session
Dataset<Row> msgDataFrame = session.createDataFrame(rowRDD, schema);
A simple way to remove rows that do not match the expected schema is to use flatMap with a Option type, also, if your target is to build a DataFrame, we use the same flatMap step to apply a schema to the data. This is facilitated in Scala by the use of case classes.
// Create Schema
case class NetInfo(timeIpReq: String, srcMac: String, proto: String, ack: String, srcDst: String, natSrcDst: String, len: String)
val netInfoStream = msgDataStream.flatMap{msg =>
val parts = msg.split(",")
if (parts.size == 7) { //filter out messages with unmatching set of fields
val Array(time, src, proto, ack, srcDst, natSrcDst, len) = parts // use a extractor to get the different parts in variables
Some(NetInfo(time, src, proto, ack, srcDst, natSrcDst, len)) // return a valid record
} else {
None // We don't have a valid. Return None
}
}
netInfoStream.foreachRDD{rdd =>
import sparkSession.implicits._
val df = rdd.toDF() // DataFrame transformation is possible on RDDs with a schema (based on a case class)
// do stuff with the dataframe
}
Regarding:
all the rows in the log file do not have same number of columns.
Assuming that they all represent the same kind of data but with potentially some columns missing, the right strategy would be to either filter out the incomplete data (like exemplified here) or use optional values in a defined schema if there is a deterministic way to know what fields are missing. This requirement should be posed to the upstream applications that generate the data. It's common to represent missing values in CSV with empty comma sequences (e.g. field0,,field2,,,field5)
A dynamic schema to handle per-row differences would not make sense, as there would be no way to apply that to complete DataFrame composed of rows with different schemas.

Resources