Spark SQL expand array to multiple columns - apache-spark

I am storing json messages for each row update from a oracle source in S3.
json structure is as below
{
"tableName": "ORDER",
"action": "UPDATE",
"timeStamp": "2016-09-04 20:05:08.000000",
"uniqueIdentifier": "31200477027942016-09-05 20:05:08.000000",
"columnList": [{
"columnName": "ORDER_NO",
"newValue": "31033045",
"oldValue": ""
}, {
"columnName": "ORDER_TYPE",
"newValue": "N/B",
"oldValue": ""
}]
}
I am using spark sql to find the latest record for each key based on max value for unique identifier.
columnList is a array with list of columns for the table .i want to join multiple tables and fetch the records which are latest.
How can i join the columns from the json array of one table with columns from another table. Is there a way to explode the json array to multiple columns . For example above json will have ORDER_NO as one column and ORDER_TYPE as another column. How can i create a data frame with multiple columns based on columnName field
For eg: new RDD should have the columns (tableName, action, timeStamp, uniqueIdentifier, ORDER_NO, ORDER_NO)
Value of ORDER_NO and ORDER_NO field should be mapped from newValue field in json.

Have found a solution for this by programmatically creating the schema by using the RDD apis
Dataset<Row> dataFrame = spark.read().json(inputPath);
dataFrame.printSchema();
JavaRDD<Row> rdd = dataFrame.toJavaRDD();
SchemaBuilder schemaBuilder = new SchemaBuilder();
// get the schema column names in appended format
String columnNames = schemaBuilder.populateColumnSchema(rdd.first(), dataFrame.columns());
SchemaBuilder is a custom class created which takes the rdd details and return a delimiter separated column names.
Then using RowFactory.create call, json values are mapped to the schema.
Doc reference http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Related

Cosmos db null value

I have two kind of record mention below in my table staudentdetail of cosmosDb.In below example previousSchooldetail is nullable filed and it can be present for student or not.
sample record below :-
{
"empid": "1234",
"empname": "ram",
"schoolname": "high school ,bankur",
"class": "10",
"previousSchooldetail": {
"prevSchoolName": "1763440",
"YearLeft": "2001"
} --(Nullable)
}
{
"empid": "12345",
"empname": "shyam",
"schoolname": "high school",
"class": "10"
}
I am trying to access the above record from azure databricks using pyspark or scala code .But when we are building the dataframe reading it from cosmos db it does not bring previousSchooldetail detail in the data frame.But when we change the query including id for which the previousSchooldetail show in the data frame .
Case 1:-
val Query = "SELECT * FROM c "
Result when query fired directly
empid
empname
schoolname
class
Case2:-
val Query = "SELECT * FROM c where c.empid=1234"
Result when query fired with where clause.
empid
empname
school name
class
previousSchooldetail
prevSchoolName
YearLeft
Could you please tell me why i am not able to get previousSchooldetail in case 1 and how should i proceed.
As #Jayendran, mentioned in the comments, the first query will give you the previouschooldetail document wherever they are available. Else, the column would not be present.
You can have this column present for all the scenarios by using the IS_DEFINED function. Try tweaking your query as below:
SELECT c.empid,
c.empname,
IS_DEFINED(c.previousSchooldetail) ? c.previousSchooldetail : null
as previousSchooldetail,
c.schoolname,
c.class
FROM c
If you are looking to get the result as a flat structure, it can be tricky and would need to use two separate queries such as:
Query 1
SELECT c.empid,
c.empname,
c.schoolname,
c.class,
p.prevSchoolName,
p.YearLeft
FROM c JOIN c.previousSchooldetail p
Query 2
SELECT c.empid,
c.empname,
c.schoolname,
c.class,
null as prevSchoolName,
null as YearLeft
FROM c
WHERE not IS_DEFINED (c.previousSchooldetail) or
c.previousSchooldetail = null
Unfortunately, Cosmos DB does not support LEFT JOIN or UNION. Hence, I'm not sure if you can achieve this in a single query.
Alternatively, you can create a stored procedure to return the desired result.

How to validate large csv file either column wise or row wise in spark dataframe

I have a large data file of 10GB or more with 150 columns in which we need to validate each of its data (datatype/format/null/domain value/primary key ..) with different rule and finally create 2 output file one is having success data and another having error data with error details. we need to move the row in error file if any of column having error at very first time no need to validate further.
I am reading a file in spark data frame does we validate it column-wise or row-wise by which way we got the best performance?
To answer your question
I am reading a file in spark data frame do we validate it column-wise or row-wise by which way we got the best performance?
DataFrame is a distributed collection of data that is organized as set of rows distributed across the cluster and most of the transformation which is defined in spark is applied on the rows which work on Row object .
Psuedo code
import spark.implicits._
val schema = spark.read.csv(ip).schema
spark.read.textFile(inputFile).map(row => {
val errorInfo : Seq[(Row,String,Boolean)] = Seq()
val data = schema.foreach(f => {
// f.dataType //get field type and have custom logic on field type
// f.name // get field name i.e., column name
// val fieldValue = row.getAs(f.name) //get field value and have check's on field value on field type
// if any error in field value validation then populate #errorInfo info object i.e (row,"error_info",false)
// otherwise i.e (row,"",true)
})
data.filter(x => x._3).write.save(correctLoc)
data.filter(x => !x._3).write.save(errorLoc)
})

Split out JSON input and apply same JSON field name as column name in Alteryx workflow

I'm using Alteryx 2019.3 and looking to build a workflow which uses JSON as input. When it reads the JSON it puts the JSON key value pairs into columns called JSON_Name and JSON_ValueString
In an example I have mocked up, the field names in the JSON below looks like this in the JSON_Name column:
customer.0.name
customer.0.contactDetails.0.company
customer.0.contactDetails.0.addressDetails.0.address
customer.0.contactDetails.0.addressDetails.0.addressType
customer.0.departments.0.name
What I want to do is the split it out into different tables and have the last part of the JSON_Name value as the column name so it looks something like this (caps show table name):
CUSTOMER
customerId
CONTACTDETAILS
customerId
company
ADDRESSDETAILS
customerId
address
addressType
DEPARTMENTS
customerId
name
How do I do this in Alteryx and how can I get it to work when I'm there can be multiple entries in the JSON list?
Thanks for any help
JSON input (mock up for example)
{
"id": "1234",
"contactDetails": [{
"company": "company1",
"addressDetails":
[{
"address": "City1",
"addressType": "Business"
}]
}]
"departments":
[{
"name": "dept1
}]
}
You can do this with a Text to columns and then a series of filters to split it into different datasets (tables). You probably want to use crosstabs to get the format of the tables right.

Defining DataFrame Schema for a table with 1500 columns in Spark

I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Oracle DB.
What is the best way to define the schema for this type of table with more than 1500 columns in a table. Is there any other option than hard coding the column names along with the datatypes?
Using Case class
Using StructType.
Spark Version used is 1.4
For this type of requirements. I'd offer case class approach to prepare a dataframe
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here
Another option :
Use the StructType to define the schema and create the dataframe.(if you don't want to use spark csv api)
The options for reading a table with 1500 columns
1) Using Case class
Case class would not work because its limited to 22 fields( for scala version < 2.11).
2) Using StructType
You can use the StructType to define the schema and create the dataframe.
Third option
You can use the Spark-csv package . In this, you can use .option("inferschema","true"). This will automatically read the schema from the file.
You can have your schema with hundreds of columns in the json format. And then you can read this json file to construct you custom schema.
For example,
Your schema json be:
[
{
"columnType": "VARCHAR",
"columnName": "NAME",
"nullable": true
},
{
"columnType": "VARCHAR",
"columnName": "AGE",
"nullable": true
},
.
.
.
]
Now you can read the the json to parse it to some case class to form the StructType.
case class Field(name: String, dataType: String, nullable: Boolean)
You can create a Map to have spark DataTypes corresponding to column Type Oracle string in json schema.
val dataType = Map(
"VARCHAR" -> StringType,
"NUMERIC" -> LongType,
"TIMESTAMP" -> TimestampType,
.
.
.
)
def parseJsonForSchema(jsonFilePath: String) = {
val jsonString = Source.fromFile(jsonFilePath).mkString
val parsedJson = parse(jsonString)
val fields = parsedJson.extract[Field]
val schemaColumns = fields.map(field => StructField(field.name, getDataType(field), field.nullable))
StructType(schemaColumns)
}

cloudant-spark connector creates duplicate column name with nested JSON schema

I'm using the following JSON Schema in my cloudant database:
{...
departureWeather:{
temp:30,
otherfields:xyz
},
arrivalWeather:{
temp:45,
otherfields: abc
}
...
}
I'm then loading the data into a dataframe using the cloudant-spark connector. If I try to select fields like so:
df.select("departureWeather.temp", "arrivalWeather.temp")
I end up with a dataframe that has 2 columns with the same name e.g. temp. It looks like Spark datasource framework is flattening the name using only the last part.
Is there an easy to deduplicate the column names?
You can use aliases:
df.select(
col("departureWeather.temp").alias("departure_temp"),
col("arrivalWeather.temp").alias("arrival_temp")
)

Resources