Cast spark Row to String - apache-spark

I want to read a timestamp column from Spark Dataset and cast it to String using appropriate format. Code looks like this:
import static org.apache.spark.sql.functions.*;
...
String result;
for (Row groupedRow : datasetGrouped.collectAsList()) {
for (StructField sf : groupedRow.schema().fields()) {
result = getDatasetFromRow(groupedRow).withColumn("fieldName", functions.date_format(col("fieldToGet"), "dd.MM.yyyy")).
collectAsList().stream().findFirst().get().getAs("fieldName");
}
}
...
private static Dataset<Row> getDatasetFromRow(Row row){
List<Row> rowListToGetDataset = new ArrayList<>();
List<String> strListToGetDataset = new ArrayList<>();
for(StructField sf : row.schema().fields()) {
strListToGetDataset.add(row.getAs(sf.name()));
}
rowListToGetDataset.add(RowFactory.create(strListToGetDataset.toArray()));
return SparkService.sqlContext().createDataFrame(rowListToGetDataset, row.schema());
}
This one is ugly and I'm looking for a solution, that doesn't create additional Dataset with timestamp field casted to String in format I need.
App uses Java Spark API, so any suggestions in Java please.
Spark ver: 2.3.1

Related

Custom Payload class in Python for precombine and combineAndGet in Apache Hudi And Pyspark

We are migrating our code base from spark-java to PySpark. We were handling custom aggregations for merging data using preCombine() and combineAndGetUpdateValue() and had implemented this in our Spark-Java code. Example below:
package com.paytm.sparkjobs.utils.hudi;
public class MergeMdrPayloadAndPersist extends BaseAvroPayload implements HoodieRecordPayload<MergeMdrPayloadAndPersist> {
public static final Logger logger = LoggerFactory.getLogger(MergeMdrPayloadAndPersist.class);
private GenericRecord record = null;
public MergeMdrPayloadAndPersist(GenericRecord record, Comparable orderingVal) {
super(record, orderingVal);
this.record = record;
}
#Override
public MergeMdrPayloadAndPersist preCombine(MergeMdrPayloadAndPersist mergeMdrPayloadAndPersist) {
//custom logic for aggregations
return new MergeMdrPayloadAndPersist(mergeMdrPayloadAndPersist.record, mergeMdrPayloadAndPersist.orderingVal);
}
#Override
public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord indexedRecord, Schema schema) throws IOException {
//custom logic for aggregations
MergeMdrPayloadAndPersist mergedDoc = new MergeMdrPayloadAndPersist(inputPayload.record, inputPayload.orderingVal);
return mergedDoc.getInsertValue(schema);
}
#Override
public Option<IndexedRecord> getInsertValue(Schema schema) throws IOException {
if (this.recordBytes.length == 0) {
return Option.empty();
} else {
IndexedRecord indexedRecord = HoodieAvroUtils.bytesToAvro(this.recordBytes, schema);
return this.isDeleteRecord((GenericRecord)indexedRecord) ? Option.empty() : Option.of(indexedRecord);
}
}
private boolean isDeleteRecord(GenericRecord genericRecord) {
Object deleteMarker = genericRecord.get("_hoodie_is_deleted");
return deleteMarker instanceof Boolean && (Boolean)deleteMarker;
}
}
Can I know how do we write a custom Payload class/function in python to handle our aggregation and merging logic? Some code examples would help.
There is no way to achieve this with pyspark, where Hudi doesn't have its own python API, it uses spark python API to interact with its java/scala classes, which is based on py4j, and you cannot create a java class using py4j because the java class needs to be created before compiling the java code.
The best way is creating a small java jar containing your classes, and adding it to your pyspark shell/submit.

How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF?

As question, How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF ?
You can access task information using TaskContext:
import org.apache.spark.TaskContext
sc.parallelize(Seq[Int](), 4).mapPartitions(_ => {
val ctx = TaskContext.get
val stageId = ctx.stageId
val partId = ctx.partitionId
val hostname = java.net.InetAddress.getLocalHost().getHostName()
Iterator(s"Stage: $stageId, Partition: $partId, Host: $hostname")}).collect.foreach(println)
A similar functionality has been added to PySpark in Spark 2.2.0 (SPARK-18576):
from pyspark import TaskContext
import socket
def task_info(*_):
ctx = TaskContext()
return ["Stage: {0}, Partition: {1}, Host: {2}".format
(ctx.stageId(), ctx.partitionId(), socket.gethostname())]
for x in sc.parallelize([], 4).mapPartitions(task_info).collect():
print(x)
I think it will provide you the information about the task including map id you are looking for.
I have found the correct answer on my own, we can get the taskID in a hive UDF the way as below :
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public void configure(MapredContext context) {
//get the number of tasks 获取task总数量
int numTasks = context.getJobConf().getNumMapTasks();
//get the current taskID 获取当前taskID
String taskID = context.getJobConf().get("mapred.task.id");
this.tmpStr = numTasks + "_h_xXx_h_" + taskID;
}
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
#Override
public Object evaluate(DeferredObject[] arguments) {
result.set(this.tmpStr);
return this.result;
}
#Override
public String getDisplayString(String[] children) {
return "RowSeq-func()";
}
}
but this would be effective only in MapReduce execution engine, it would not work in a SparkSQL engine.
Test code as below:
add jar hdfs:///home/dp/tmp/shaw/my_udf.jar;
create temporary function seqx AS 'com.udf.TestUDF';
with core as (
select
device_id
from
test_table
where
p_date = '20210309'
and product = 'google'
distribute by
device_id
)
select
seqx() as seqs,
count(1) as cc
from
core
group by
seqx()
order by
seqs asc
Result in MR engine as below, see we have got the task number and taskID successfully:
Result in Spark engine with same sql above, the UDF is not valid, we get nothing about taskID:
If you run your HQL in Spark engine and call the Hive UDF meanwhile, and really need to get the partitionId in Spark, see the code below :
import org.apache.spark.TaskContext;
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-initial-pid";
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
public Object evaluate(DeferredObject[] arguments) {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-evaluate-pid";
result.set(this.tmpStr);
return this.result;
}
}
As above, you can get the Spark partitionId by calling TaskContext.getPartitionId() in the override method initialize or evalute of UDF class.
Notice: your UDF must has params, suchs select my_udf(param), this would lead your UDF initialized in multiple tasks, if your UDF do not have a param, it will be initialized at the Driver, and the Driver do not have the taskContext and partitionId, so you would get nothing.
The image below is a result produced by the above UDF executed in Spark engine,see, we get the partitionIds successfully :

Stop Ordering of dataset columns while writing to CSV

I want to write a dataset to CSV file but I don't want columns to be ordered in ascending order(or any order for that matter).
For eg. Table: String id; String name; String age; +300 more fields
CSV formed is of schema: age name id +300 more columns in alphabetical order
but I want the CSV of the same ordering as of Model.
I could have used .select() or .selectExpr() but there I had to mention 300+ fields.
Is there any other easier way?
Currently using:
dataset.toDF().coalesce(1).selectExpr("templateId","batchId", +300 more fields ).write().format("com.databricks.spark.csv").option("nullValue","").mode(SaveMode.Overwrite).save(path);
A workaround I followed for the above question:
added the fields in a properties file(column.properties) under a single key
with fields comma-separated.
loaded that properties file in broadcast map.
used broadcast map in .selectExpr() method.
Code for loading properties file in broadcast map:
public static Map<String, String> getColumnMap() {
String propFileName = "column.properties";
InputStream inputStream =
ConfigurationLoader.class.getClassLoader().getResourceAsStream(propFileName);
if (inputStream != null) {
try {
prop.load(inputStream);
colMap = (Map) prop;
} catch (IOException e) {
// handle exception
}
}
return colMap;
}
JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Broadcast<Map<String, String>> broadcastProperty = sc.broadcast(propertiesMap);
Code for writing to CSV file:
dataset.toDF().coalesce(1).selectExpr(broadcastColumn.getValue().get(TemplateConstants.COLUMN).split(",")).write().format(ApplicationConstants.CSV_FORMAT).option(ApplicationConstants.NULL_VALUE, "").mode(SaveMode.Overwrite).save(path);

How to create a a spark dataframe from Integer RDD

How can I create a DataFrame from an JavaRDD contains Integers. I have done something like below but not working.
List<Integer> input = Arrays.asList(101, 103, 105);
JavaRDD<Integer> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, Integer.class);
I got ClassCastException saying org.apache.spark.sql.types.IntegerType$ cannot be cast to org.apache.spark.sql.types.StructType
How can I achieve this?
Apparently (although not intuitively), this createDataFrame overload can only work for "Bean" types, which means types that do not correspond to any built-in Spark SQL type.
You can see that in the source code, the class you pass is matched with a Spark SQL type in JavaTypeInference.inferDataType, and the result is cast into a StructType (see dataType.asInstanceOf[StructType] in SQLContext.getSchema - but the built in "primitive" types (like IntegerType) are NOT StructTypes... Looks like a bug or undocumented behavior to me....
WORKAROUNDS:
Wrap your Integers with a "bean" class (that's ugly, I know):
public static class MyBean {
final int value;
MyBean(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
List<MyBean> input = Arrays.asList(new MyBean(101), new MyBean(103), new MyBean(105));
JavaRDD<MyBean> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, MyBean.class);
dataframe.show(); // this works...
Convert to RDD<Row> yourself:
// convert to Rows:
JavaRDD<Row> rowRdd = inputRDD.map(new Function<Integer, Row>() {
#Override
public Row call(Integer v1) throws Exception {
return RowFactory.create(v1);
}
});
// create schema (this looks nicer in Scala...):
StructType schema = new StructType(new StructField[]{new StructField("number", IntegerType$.MODULE$, false, Metadata.empty())});
DataFrame dataframe = sqlcontext.createDataFrame(rowRdd, schema);
dataframe.show(); // this works...
Now in Spark 2.2 you can do the following to create a Dataset.
Dataset<Integer> dataSet = sqlContext().createDataset(javardd.rdd(), Encoders.INT());

How to convert the cassandra row to my java class?

Here is [a Link](http://stackoverflow.com/questions/32448987/how-to-retrieve-a-very-big-cassandra-table-and-delete-some-unuse-data-from-it#comment52844466_32464409) of my question before.
After I get the cassandra data row by row in my program, I'm confused by the convert between cassandra row to java class. In java class the table of cassandra is convert to a ResultSet class,when I iterator it and get the row data,it returns a NPE. In fact,I can see the Object (or the data) while debuging the program. Here is My Iterator Code:
ResultSet rs=CassandraTools.getInstance().execute(cql);
Iterator<Row> iterator = rs.iterator();
while (iterator.hasNext()) {
Row row = iterator.next();
row.getString() ---->return NPE
The CassandraTools class is:
public class CassandraTools {
private static CassandraTools instance;
private CassandraTools() {
}
public static synchronized CassandraTools getInstance() {
if (instance == null) {
instance = new CassandraTools();
instance.init();
}
return instance;
}
Cluster cluster;
Session session;
public void init() {
if (cluster == null) {
cluster = new Cluster.Builder().addContactPoint("10.16.34.96")
.build();
if (session == null) {
session = cluster.connect("uc_passport");
}
}
}
public ResultSet execute(String cql) {
ResultSet rs = session.execute(cql);
// rs.forEach(n -> {
// System.out.println(n);
// });
return rs;
}
}
SO how could I convert the data in the row to A java Class?I have read the convert class in the API of spring data cassandra,but it is complicated to use for me. Who can help?
IMHO, If you want to map the rows of Cassandra to a java class, you should try to use an Object-Datastore mapper which does these things for you.
If you try to do this by yourself, you need to handle the java-cassandra datatype mappings, validations etc all by yourself which is very hectic job.
There are few (Kundera, Hibernate OGM, etc) opensource object-datastore mappers available and you can use them. I suggest you to try Kundera and check this for getting started with Cassandra.

Resources