cloudant-spark connector creates duplicate column name with nested JSON schema - apache-spark

I'm using the following JSON Schema in my cloudant database:
{...
departureWeather:{
temp:30,
otherfields:xyz
},
arrivalWeather:{
temp:45,
otherfields: abc
}
...
}
I'm then loading the data into a dataframe using the cloudant-spark connector. If I try to select fields like so:
df.select("departureWeather.temp", "arrivalWeather.temp")
I end up with a dataframe that has 2 columns with the same name e.g. temp. It looks like Spark datasource framework is flattening the name using only the last part.
Is there an easy to deduplicate the column names?

You can use aliases:
df.select(
col("departureWeather.temp").alias("departure_temp"),
col("arrivalWeather.temp").alias("arrival_temp")
)

Related

Need help in extracting the object from nested JSON in pysaprk

My JSON column values are as below
[{"item":"54509485","id":"1234","rule":"9383","issue_type":[],"rule_message":"this is json data.","sample_attributes":["shicode","measurement"],"impacted":[["Child"],[]],"type_of_blocker":[]}]
I want to get only object "item", "rule", "sample_attributes" using pyspark code using dataframe
If you got pyspark.sql.dataframe.DataFrame, you can do it by:
data.select(data.column_name.item, data.column_name.rule, data.column_name.sample_attributes)
Where data is your dataframe, and column_name is name of your column.

pyspark dataframe column name

what is limitation for pyspark dataframe column names. I have issue with following code ..
%livy.pyspark
df_context_spark.agg({'spatialElementLabel.value': 'count'})
It gives ...
u'Cannot resolve column name "spatialElementLabel.value" among (lightFixtureID.value, spatialElementLabel.value);'
The column name is evidently typed correctly. I got the dataframe by transformation from pandas dataframe. It there any issue with dot in the column name string?
Dots are used for nested fields inside a structure type. So if you had a column that was called "address" of type StructType, and inside that you had street1, street2, etc you would access it the individual fields like this:
df.select("address.street1", "address.street2", ..)
Because of that, if you want to used a dot in your field name you need to quote the field whenever you refer to it. For example:
from pyspark.sql.types import *
schema = StructType([StructField("my.field", StringType())])
rdd = sc.parallelize([('hello',), ('world',)])
df = sqlContext.createDataFrame(rdd, schema)
# Using backticks to quote the field name
df.select("`my.field`").show()

Defining DataFrame Schema for a table with 1500 columns in Spark

I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Oracle DB.
What is the best way to define the schema for this type of table with more than 1500 columns in a table. Is there any other option than hard coding the column names along with the datatypes?
Using Case class
Using StructType.
Spark Version used is 1.4
For this type of requirements. I'd offer case class approach to prepare a dataframe
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here
Another option :
Use the StructType to define the schema and create the dataframe.(if you don't want to use spark csv api)
The options for reading a table with 1500 columns
1) Using Case class
Case class would not work because its limited to 22 fields( for scala version < 2.11).
2) Using StructType
You can use the StructType to define the schema and create the dataframe.
Third option
You can use the Spark-csv package . In this, you can use .option("inferschema","true"). This will automatically read the schema from the file.
You can have your schema with hundreds of columns in the json format. And then you can read this json file to construct you custom schema.
For example,
Your schema json be:
[
{
"columnType": "VARCHAR",
"columnName": "NAME",
"nullable": true
},
{
"columnType": "VARCHAR",
"columnName": "AGE",
"nullable": true
},
.
.
.
]
Now you can read the the json to parse it to some case class to form the StructType.
case class Field(name: String, dataType: String, nullable: Boolean)
You can create a Map to have spark DataTypes corresponding to column Type Oracle string in json schema.
val dataType = Map(
"VARCHAR" -> StringType,
"NUMERIC" -> LongType,
"TIMESTAMP" -> TimestampType,
.
.
.
)
def parseJsonForSchema(jsonFilePath: String) = {
val jsonString = Source.fromFile(jsonFilePath).mkString
val parsedJson = parse(jsonString)
val fields = parsedJson.extract[Field]
val schemaColumns = fields.map(field => StructField(field.name, getDataType(field), field.nullable))
StructType(schemaColumns)
}

Spark SQL expand array to multiple columns

I am storing json messages for each row update from a oracle source in S3.
json structure is as below
{
"tableName": "ORDER",
"action": "UPDATE",
"timeStamp": "2016-09-04 20:05:08.000000",
"uniqueIdentifier": "31200477027942016-09-05 20:05:08.000000",
"columnList": [{
"columnName": "ORDER_NO",
"newValue": "31033045",
"oldValue": ""
}, {
"columnName": "ORDER_TYPE",
"newValue": "N/B",
"oldValue": ""
}]
}
I am using spark sql to find the latest record for each key based on max value for unique identifier.
columnList is a array with list of columns for the table .i want to join multiple tables and fetch the records which are latest.
How can i join the columns from the json array of one table with columns from another table. Is there a way to explode the json array to multiple columns . For example above json will have ORDER_NO as one column and ORDER_TYPE as another column. How can i create a data frame with multiple columns based on columnName field
For eg: new RDD should have the columns (tableName, action, timeStamp, uniqueIdentifier, ORDER_NO, ORDER_NO)
Value of ORDER_NO and ORDER_NO field should be mapped from newValue field in json.
Have found a solution for this by programmatically creating the schema by using the RDD apis
Dataset<Row> dataFrame = spark.read().json(inputPath);
dataFrame.printSchema();
JavaRDD<Row> rdd = dataFrame.toJavaRDD();
SchemaBuilder schemaBuilder = new SchemaBuilder();
// get the schema column names in appended format
String columnNames = schemaBuilder.populateColumnSchema(rdd.first(), dataFrame.columns());
SchemaBuilder is a custom class created which takes the rdd details and return a delimiter separated column names.
Then using RowFactory.create call, json values are mapped to the schema.
Doc reference http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

I have to insert records into Cassandra using a Pojo Object Mapping

I need to Insert records into Cassandra ,so I wrote a function whose input is a csv file. Say the csv file's name is test.csv. In Cassandra I have a table test. I need to store each row of the csv file into the test table. Since I am using spark java api , I am also creating a POJO class or DTO class for mapping the fields of the Pojo and Columns of Cassandra.
The Problem here is test.csv is having some 50 comma seperated values that has to be stored in 50 columns of test table in cassandra which having is total of 400 columns. So In my test POJO class I created a constructor of those 50 fields.
JavaRDD<String> fileRdd = ctx.textFile("home/user/test.csv");
JavaRDD fileObjectRdd = fileRdd.map(
new Function<String, Object>() {
//do some tranformation with data
switch(fileName){
case "test" :return new TestPojo(1,3,4,--50); //calling the constructor with 50 fields .
}
});
switch(fileName){
test : javaFunctions(fileObjectRdd).writerBuilder("testKeyspace", "test", mapToRow(TestPojo.class)).saveToCassandra();
}
So here I am always returning the Object of the TestPojo class of each row of the test.csv file to an Rdd of Objects . Once that is done I am saving that rdd to the Cassandra Table Test using the TestPojo Mapping.
My Problem is In future if the test.csv will have say 60 columns , that time my code will not work because I am invoking the Constructor with only 50 fields.
My Question is how do I create a constructor with all the 400 fields in the TestPojo, so that no matter how many fields the test.csv has My code should be able to handle it.
I tried to create a general Constructor with all 400 fields but ended up with a compilation error saying the limit is only 255 fields for the constructor params.
or is there any better way to handle this use case ??
Question 2 : what if the data from test.csv is going to multiple tables in cassandra say 5 cols of test.csv going to test table in cassandra and 5 other cols are going to test2 table in cassandra .
Problem here is when I am doing
JavaRDD fileObjectRdd = fileRdd.map(
new Function<String, Object>() {
//do some tranformation with data
switch(fileName){
case "test" :return new TestPojo(1,3,4,--50); //calling the constructor with 50 fields .
}
});
I am returning only one Object of TestPojo. In case the data from test.csv is going to test table and test2 table , I will need to return 2 objects one of TestPojo and another of Test2Pojo.

Resources