How to return json data in Dataset<Row> with encoder(structType) in Spark? - apache-spark

I tried to return required parameter in DataSet. Whenever am return the data to row am not able to encode the data with struct Type,If suppose am using Map/JSONObject it was throwing Map/jsonobject it not a valid External schema, Below code i tried? Any help will be appreciate thanks in advance
DataSet<Row>//
Row rowdat=RowFactory.create(jsondata)
Return rowdat.iterator();
//Dataset data will be **** [[{"employees:"accountant","firstname":"walter", "age":"54"}]]
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("employees", DataTypes.StringType, true),
DataTypes.createStructField("firstname", DataTypes.StringType, true),
DataTypes.createStructField("age", DataTypes.StringType, true)
});
ExpressionEncoder express=RowCoder.apply(schema)

Related

Spark Java API to create schema from config file

 
I am looking for an approach to read table schema from a config file, to avoid hardcoding it in Spark (Java). For example, to read two .csv files I create schemas as below:
#1
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("emp_dept",StringType, true),
DataTypes.createStructField("empid",IntegerType, true),
DataTypes.createStructField("empdesignation",StringType, true),
DataTypes.createStructField("emp_salary",IntegerType, true)
});
Dataset<Row> df1 = spark.read().format("csv")
.option("header", "true")
.schema(schema)
.csv(path);
#2
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("emp_details",StringType, true),
DataTypes.createStructField("empid",IntegerType, true),
DataTypes.createStructField("empfistname",StringType, true),
DataTypes.createStructField("emplastname,IntegerType, true)
});
Dataset<Row> df2 = spark.read().format("csv")
.option("header", "true")
.schema(schema)
.csv(path);
Instead of creating multiple schemas like this, I'd like to create it a from config file.
Perhaps you can use one of the StructType's static methods, StructType.fromDDL(String ddl) or DataType.fromJson(String json), depending on what you want your config file to look like. For example, simple DDL:
scala> import org.apache.spark.sql.types._
scala> val struct = StructType.fromDDL("id int, descr string")
struct: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true),StructField(descr,StringType,true))

How to return GenericInternalRow from spark udf

I have a spark udf written in scala that takes couuple of columns and apply some logic and output InternalRow. There is spark schema of StructType also present.
But when I try to return the InternalRow from UDF there is exception
java.lang.UnsupportedOperationException: Schema for type
org.apache.spark.sql.catalyst.GenericInternalRow is not supported
val getData = (hash : String, type : String) => {
val schema = hash match {
case "people" =>
peopleSchema
case "empl" => emplSchema
}
getGenericInternalRow(schema)
}
val data = udf(getData)
Spark Version : 2.4.5

Inserting multiple rows into SQL Server from Node.js

I am working on a project that will upload some records to SQL Server from a node.js program. Right now, this is my approach (inside an async function):
con = await sql.connect(`mssql://${SQL.user}:${SQL.password}#${SQL.server}/${SQL.database}?encrypt=true`);
for (r of RECORDS) {
columns = `([column1], [column2], [column3])`;
values = `(#col1, #col2, #col3)`;
await con
.request()
.input("col1", sql.Int, r.col1)
.input("col2", sql.VarChar, r.col2)
.input("col3", sql.VarChar, r.col3)
.query(`INSERT INTO [dbo].[table1] ${columns} VALUES ${values}`);
}
Where records is an array of objects in the form:
RECORDS = [
{ col1: 1, col2: "asd", col3: "A" },
{ col1: 2, col2: "qwerty", col3: "B" },
// ...
];
This code works, nevertheless, I have the feeling that it is not efficient at all. I have an upload of around 4k records and it takes roughly 10 minutes, it does not look good.
I believe if I can create a single query - instead of wrapping single inserts inside a for loop - with all the record values it will be faster, and I know there is a syntax for reaching that in SQL:
INSERT INTO table1 (column1, column2, column3) VALUES (1, "asd", "A"), (2, "qwerty", "B"), (...);
However I cannot find any documentation from mssql module for node on how to prepare the parameterized inputs to do everything in a single transaction.
Can anyone guide me into the right direction?
Thanks in advance.
Also, very similar to the bulk insert, you can use a table valued parameter.
sql.connect("mssql://${SQL.user}:${SQL.password}#${SQL.server}/${SQL.database}?encrypt=true")
.then(() => {
const table = new sql.Table();
table.columns.add('col1', sql.Int);
table.columns.add('col2', sql.VarChar(20));
table.columns.add('col3', sql.VarChar(20));
// add data
table.rows.add(1, 'asd', 'A');
table.rows.add(2, 'qwerty', 'B');
const request = new sql.Request();
request.input('table1', table);
request.execute('procMyProcedure', function (err, recordsets, returnValue) {
console.dir(JSON.stringify(recordsets[0][0]));
res.end(JSON.stringify(recordsets[0][0]));
});
});
And then for the SQL side, create a user defined table type
CREATE TYPE typeMyType AS TABLE
(
Col1 int,
Col2 varchar(20),
Col3 varchar(20)
)
And then use this in the stored procedure
CREATE PROCEDURE procMyProcedure
#table1 typeMyType READONLY
AS
BEGIN
INSERT INTO table1 (Col1, Col2, Col3)
SELECT Col1, Col2, Col3
FROM #MyRecords
END
This gives you more control over the data and lets you do more with the data in sql before you actually insert.
As pointed out by #JoaquinAlvarez, bulk insert should be used as replied here: Bulk inserting with Node mssql package
For my case, the code was like:
return await sql.connect(`mssql://${SQL.user}:${SQL.password}#${SQL.server}/${SQL.database}?encrypt=true`).then(() => {
table = new sql.Table("table1");
table.create = true;
table.columns.add("column1", sql.Int, { nullable: false });
table.columns.add("column2", sql.VarChar, { length: Infinity, nullable: true });
table.columns.add("column3", sql.VarChar(250), { nullable: true });
// add here rows to insert into the table
for (r of RECORDS) {
table.rows.add(r.col1, r.col2, r.col3);
}
return new sql.Request().bulk(table);
});
The SQL data types have to match (obviously) the column type of the existing table table1. Note the case of column2, which is a column defined in SQL as varchar(max).
Thanks Joaquin! I went down on the time significantly from 10 minutes to a few seconds

Convert a spark dataframe Row to Avro and publish to kakfa

I have a spark data frame with the below schema and trying to stream this dataframe to Kafka using Avro
```root
|-- clientTag: struct (nullable = true)
| |-- key: string (nullable = true)
|-- contactPoint: struct (nullable = true)
| |-- email: string (nullable = true)
| |-- type: string (nullable = true)
|-- performCheck: string (nullable = true)```
Sample Record: {"performCheck" : "N", "clientTag" :{"key":"value"}, "contactPoint": {"email":"abc#gmail.com", "type":"EML"}}
Avro Schema:
{
"name":"Message",
"namespace":"kafka.sample.avro",
"type":"record",
"fields":[
{"type":"string", "name":"id"},
{"type":"string", "name":"email"}
{"type":"string", "name":"type"}
]
}
I have couple of questions.
What is the best way to convert a org.apache.spark.sql.Row to Avro Message because i want to extract email and type from the dataframe for each Row and use those values to construct an Avro message?
Eventually, all the Avro messages will be send to Kafka. So, if there is an error while producing, how can i collect all the Row's that failed to be produced to Kafka and return a dataframe?
Thanks for the help
You can try this.
Q#1: You can extract child elements of dataframe using dot notation as:
val dfJSON = spark.read.json("/json/path/sample_avro_data_as_json.json") //can read from schema registry
.withColumn("id", $"clientTag.key")
.withColumn("email", $"contactPoint.email")
.withColumn("type", $"contactPoint.type")
Then u can directly use these columns while assigning values to Avro record that you serialize & send to Kafka.
Q#2: You can keep track of success & failure something like this. This is not fully working code, but can give you an idea.
dfJSON.foreachPartition( currentPartition => {
var producer = new KafkaProducer[String, Array[Byte]](props)
var schema: Schema = ...//Get schema from schema registry or avsc file
val schemaRegProps = Map("schema.registry.url" -> schemaRegistryUrl)
val client = new CachedSchemaRegistryClient(schemaRegistryUrl, Int.MaxValue)
valueSerializer = new KafkaAvroSerializer(client)
valueSerializer.configure(schemaRegProps, false)
val failedRecDF = currentPartition.map(rec =>{
try {
var avroRecord: GenericRecord = new GenericData.Record(schema)
avroRecord.put("id", rec.getAs[String]("id"))
avroRecord.put("email", rec.getAs[String]("email"))
avroRecord.put("type", rec.getAs[String]("type"))
// Serialize record in Producer record & send to Kafka
producer.send(new ProducerRecord[String, Array[Byte]](kafkaTopic, rec.getAs[String]("id").toString(), valueSerializer.serialize(kafkaTopic, avroRecord).toArray))
(rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Success")
}catch{
case e: Exception => println("*** Exception *** ")
e.printStackTrace()
(rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Failed")
}
})//.toDF("id", "email", "type", "sent_status")
failedRecDF.foreach(println)
//You can retry or log them
})
Response would be:
(111,abc#gmail.com,EML,Success)
You can do whatever you want to do with it.

Exporting nested fields with invalid characters from Spark 2 to Parquet [duplicate]

This question already has answers here:
Spark Dataframe validating column names for parquet writes
(7 answers)
Closed 4 years ago.
I am trying to use spark 2.0.2 to convert a JSON file into parquet.
The JSON file comes from an external source and therefor the schema can't be changed before it arrives.
The file contains a map of attributes. The attribute names arn't known before I receive the file.
The attribute names contain characters that can't be used in parquet.
{
"id" : 1,
"name" : "test",
"attributes" : {
"name=attribute" : 10,
"name=attribute with space" : 100,
"name=something else" : 10
}
}
Both the space and equals character can't be used in parquet, I get the following error:
org.apache.spark.sql.AnalysisException: Attribute name "name=attribute" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
As these are nested fields I can't rename them using an alias, is this true?
I have tried renaming the fields within the schema as suggested here: How to rename fields in an DataFrame corresponding to nested JSON. This works for some files, However, I now get the following stackoverflow:
java.lang.StackOverflowError
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:65)
at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:258)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1563)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1576)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576)
at scala.collection.immutable.List.foreach(List.scala:381)
...
repeat
...
I want to do one of the following:
Strip invalid characters from the field names as I load the data into spark
Change the column names in the schema without causing stack overflows
Somehow change the schema to load the original data but use the following internally:
{
"id" : 1,
"name" : "test",
"attributes" : [
{"key":"name=attribute", "value" : 10},
{"key":"name=attribute with space", "value" : 100},
{"key":"name=something else", "value" : 10}
]
}
I solved the problem this way:
df.toDF(df
.schema
.fieldNames
.map(name => "[ ,;{}()\\n\\t=]+".r.replaceAllIn(name, "_")): _*)
where I replaced all incorrect symbols by "_".
The only solution I have found to work,so far, is to reload the data with a modified schema. The new schema will load the attributes into a map.
Dataset<Row> newData = sql.read().json(path);
StructType newSchema = (StructType) toMapType(newData.schema(), null, "attributes");
newData = sql.read().schema(newSchema).json(path);
private DataType toMapType(DataType dataType, String fullColName, String col) {
if (dataType instanceof StructType) {
StructType structType = (StructType) dataType;
List<StructField> renamed = Arrays.stream(structType.fields()).map(
f -> toMapType(f, fullColName == null ? f.name() : fullColName + "." + f.name(), col)).collect(Collectors.toList());
return new StructType(renamed.toArray(new StructField[renamed.size()]));
}
return dataType;
}
private StructField toMapType(StructField structField, String fullColName, String col) {
if (fullColName.equals(col)) {
return new StructField(col, new MapType(DataTypes.StringType, DataTypes.LongType, true), true, Metadata.empty());
} else if (col.startsWith(fullColName)) {
return new StructField(structField.name(), toMapType(structField.dataType(), fullColName, col), structField.nullable(), structField.metadata());
}
return structField;
}
I have the same problem with #:.
In our case, we solved flattering the DataFrame.
val ALIAS_RE: Regex = "[_.:#]+".r
val FIRST_AT_RE: Regex = "^_".r
def getFieldAlias(field_name: String): String = {
FIRST_AT_RE.replaceAllIn(ALIAS_RE.replaceAllIn(field_name, "_"), "")
}
def selectFields(df: DataFrame, fields: List[String]): DataFrame = {
var fields_to_select = List[Column]()
for (field <- fields) {
val alias = getFieldAlias(field)
fields_to_select +:= col(field).alias(alias)
}
df.select(fields_to_select: _*)
}
So the following json:
{
object: 'blabla',
schema: {
#type: 'blabla',
name#id: 'blabla'
}
}
That will be transformed [object, schema.#type, schema.name#id].
# and dots (in your case =) will create problems for SparkSQL.
So after our SelectFields you can end with
[object, schema_type, schema_name_id]. Flattered DataFrame.

Resources