Defining DataFrame Schema for a table with 1500 columns in Spark

Defining DataFrame Schema for a table with 1500 columns in Spark - apache-spark

I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Oracle DB.
What is the best way to define the schema for this type of table with more than 1500 columns in a table. Is there any other option than hard coding the column names along with the datatypes?
Using Case class
Using StructType.
Spark Version used is 1.4

For this type of requirements. I'd offer case class approach to prepare a dataframe
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here
Another option :
Use the StructType to define the schema and create the dataframe.(if you don't want to use spark csv api)

The options for reading a table with 1500 columns
1) Using Case class
Case class would not work because its limited to 22 fields( for scala version < 2.11).
2) Using StructType
You can use the StructType to define the schema and create the dataframe.
Third option
You can use the Spark-csv package . In this, you can use .option("inferschema","true"). This will automatically read the schema from the file.

You can have your schema with hundreds of columns in the json format. And then you can read this json file to construct you custom schema.
For example,
Your schema json be:
[
{
"columnType": "VARCHAR",
"columnName": "NAME",
"nullable": true
},
{
"columnType": "VARCHAR",
"columnName": "AGE",
"nullable": true
},
.
.
.
]
Now you can read the the json to parse it to some case class to form the StructType.
case class Field(name: String, dataType: String, nullable: Boolean)
You can create a Map to have spark DataTypes corresponding to column Type Oracle string in json schema.
val dataType = Map(
"VARCHAR" -> StringType,
"NUMERIC" -> LongType,
"TIMESTAMP" -> TimestampType,
.
.
.
)
def parseJsonForSchema(jsonFilePath: String) = {
val jsonString = Source.fromFile(jsonFilePath).mkString
val parsedJson = parse(jsonString)
val fields = parsedJson.extract[Field]
val schemaColumns = fields.map(field => StructField(field.name, getDataType(field), field.nullable))
StructType(schemaColumns)
}

Related

Not able to convert from RDD to Dataset successfully at runtime [duplicate]

This question already has answers here:
How to convert a simple DataFrame to a DataSet Spark Scala with case class?
(2 answers)
Closed 4 years ago.
I am trying to run this small Spark program. Spark Version 2.1.1
val rdd = sc.parallelize(List((2012, "Tesla", "S"), (1997, "Ford", "E350"), (2015, "Chevy", "Volt")))
import spark.implicits._
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd).as[CarDetails] // Error Line
carDetails.map(car => {
val name = if (car.name == "Tesla") "S" else car.name
CarDetails(car.year, name, car.model)
}).collect().foreach(print)
It is throwing error on this line:
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd).as[CarDetails]
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`year`' given input columns: [_1, _2, _3];
There is no compilation error!
I tried by doing many changes like to use List instead of RDD. Also, tried to first convert to DS and then to as[CarDetails], but didn't work. Now I am clueless.
Why is it taking the columns as _1, _2 and _3 when I have already given the case class
case class CarDetails(year: Int, name: String, model: String)
I tried to change from Int to Long for year in case class. It still did not work.
Edit:
I changed this line after referring the probable duplicate question and it worked.
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd)
.withColumnRenamed("_1","year")
.withColumnRenamed("_2","name")
.withColumnRenamed("_3","model")
.as[CarDetails]
But, I am still not clear as to why I need to rename the columns even after explicitly mapping to a case class.

The rules of as conversion are explained in detail in the API docs:
The method used to map columns depend on the type of U:
When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive).
When U is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to _1).
When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.
If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
To explain this with code. Conversion from case class to Tuple* is valid (fields are matched structurally):
Seq(CarDetails(2012, "Tesla", "S")).toDF.as[(Int, String, String)]
but conversion from Tuple* to arbitrary case class is not (fields are matched by name). You have to rename fields first (ditto):
Seq((2012, "Tesla", "S")).toDF("year", "name", "model").as[CarDetails]
It has quite interesting practical implications:
Tuple typed object cannot contain extraneous fields:
case class CarDetailsWithColor(
year: Int, name: String, model: String, color: String)
Seq(
CarDetailsWithColor(2012, "Tesla", "S", "red")
).toDF.as[(Int, String, String)]
// org.apache.spark.sql.AnalysisException: Try to map struct<year:int,name:string,model:string,color:string> to Tuple3, but failed as the number of fields does not line up.;
While case class typed Dataset can:
Seq(
(2012, "Tesla", "S", "red")
).toDF("year", "name", "model", "color").as[CarDetails]
Of course, starting with case class typed variant would save you all the trouble:
sc.parallelize(Seq(CarDetails(2012, "Tesla", "S"))).toDS

Need help in filtering records according to set of rules with Apache Spark

I need help in one of the usecases that I have encountered of filtering records against a set of rules with Apache Spark.
As the actual data has too many fields, for example, you can think of data like below (for simplicity giving data in JSON format),
records : [{
"recordId": 1,
"messages": [{"name": "Tom","city": "Mumbai"},
{"name": "Jhon","address": "Chicago"}, .....]
},....]
rules : [{
ruleId: 1,
ruleName: "rule1",
criterias: {
name: "xyz",
address: "Chicago, Boston"
}
}, ....]
I want to match all records against all rules. Here is the pseudocode:
var matchedRecords = []
for(record <- records)
for(rule <- rules)
for(message <- record.message)
if(!isMatch(message, rule.criterias))
break;
if(allMessagesMatched) // If loop completed without break
matchedRecords.put((record.id, ruleId))
def isMatch(message, criteria) =
for(each field in crieteria)
if(field.value contains comma)
if(! message.field containsAny field.value)
return false
else if(!message.field equals field.value) // value doesnt contain comma
return false
return true // if loop completed that means all criterias are matched
There are thousands of records containing thousands of messages and there are hundreads of such rules.
What are the approaches to solve such kind of problem ? Any specific module would be helpful like (SparkSQL, Spark Mlib, Spark GraphX)? Do I need to use any third party lib ?
Approach 1 :
Have List[Rules] & RDD[Records]
Broadcast List[Rules] as they are less in number.
Match each record with all the rules.
Still in this case there is no parallize computation happening for matching each message with the criteria.

I think your suggested approach is good direction. If I have to solve this task I would start from implementing generic trait with method responsible for matching:
trait FilterRule extends Serializable {
def match(record: Record): Boolean
}
Then I would implement specific filters e.g.:
class EqualsRule extends FilterRule
class RegexRule extends FilterRule
Then I would implement composite filters e.g.:
class AndRule extends FilterRule
class OrRule extends FilterRule
...
Then you can filter your rdd or DataSet with:
// constructing rule - in reality reading json from configuration, parsing json and creating FilterRule object
val rule = AndRule(EqualsRule(...), EqualsRule(...), ...)
// applying rule
rdd.filter(record => rule.match(r))
Second option is to try using existing Spark SQL functions and DataFrame for filtering, where you can build pretty complex expressions using and, or, multiple columns. Drawback of this approach it is not type safe and unit testing would be more complex.

Processing json strings in a Spark dataframe column

When I look for ways to parse json within a string column of a dataframe, I keep running into results that more simply read json file sources. My source is actually a hive ORC table with some strings in one of the columns which is in a json format. I'd really like to convert that to something parsed like a map.
I'm having trouble finding a way to do this:
import java.util.Date
import org.apache.spark.sql.Row
import scala.util.parsing.json.JSON
val items = sql("select * from db.items limit 10")
//items.printSchema
val internal = items.map {
case Row(externalid: Long, itemid: Long, locale: String,
internalitem: String, version: Long,
createdat: Date, modifiedat: Date)
=> JSON.parseFull(internalitem)
}
I thought this should work, but maybe there's a more Spark way of doing this instead because I get the following error:
java.lang.ClassNotFoundException: scala.Any
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
Specifically, my input data looks approximately like this:
externalid, itemid, locale, internalitem, version, createdat, modifiedat
123, 321, "en_us", "{'name':'thing','attr':{
21:{'attrname':'size','attrval':'big'},
42:{'attrname':'color','attrval':'red'}
}}", 1, 2017-05-05…, 2017-05-06…
Yes it's not RFC 7158 exactly.
The attr keys can be 5 to 30 of any 80,000 values, so I wanted get to something like this instead:
externalid, itemid, locale, internalitem, version, createdat, modifiedat
123, 321, "en_us", "{"name':"thing","attr":[
{"id":21,"attrname':"size","attrval":"big"},
{"id":42,"attrname":"color","attrval":"red"}
]}", 1, 2017-05-05…, 2017-05-06…
Then flatten the internalitem to fields and explode the attr array:
externalid, itemid, locale, name, attrid, attrname attrval, version, createdat, modifiedat
123, 321, "en_us", "thing", 21, "size", "big", 1, 2017-05-05…, 2017-05-06…
123, 321, "en_us", "thing", 21, "color", "red", 1, 2017-05-05…, 2017-05-06…

I've never been using such computations, but I have an advice for you :
Before doing any operation on columns on your own, just check the sql.functions package which contain a whole bunch of helpful functions to work with columns like date extracting and formatting, string concatenation and spliting, ... and it also provide a couple of functions to work with json objects like : from_json and json_tuple.
To use those methods you simply need to import them and call them inside a select method like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val transofrmedDf = df.select($"externalid", $"itemid", … from_json($"internalitem", schema), $"version" …)
First of all you have to create a schema for your json column and put it in the schema variable
Hope it helps.

cloudant-spark connector creates duplicate column name with nested JSON schema

I'm using the following JSON Schema in my cloudant database:
{...
departureWeather:{
temp:30,
otherfields:xyz
},
arrivalWeather:{
temp:45,
otherfields: abc
}
...
}
I'm then loading the data into a dataframe using the cloudant-spark connector. If I try to select fields like so:
df.select("departureWeather.temp", "arrivalWeather.temp")
I end up with a dataframe that has 2 columns with the same name e.g. temp. It looks like Spark datasource framework is flattening the name using only the last part.
Is there an easy to deduplicate the column names?

You can use aliases:
df.select(
col("departureWeather.temp").alias("departure_temp"),
col("arrivalWeather.temp").alias("arrival_temp")
)

Pulling Name Out of Schema in Spark DataFrame

I am trying to make a function that will pull the name of the column out of a dataframe schema. So what I have is the initial function defined:
val df = sqlContext.parquetFile(inputVal.toString)
val dfSchema = df.schema
def schemaMatchP(schema: StructType) : Map[String,List[Int]] =
schema
// get the 1st word (column type) in upper cases
.map(columnDescr => columnDescr
If I do something like this:
.map(columnDescr => columnDescr.toString.split(',')(0).toUpperCase)
I will get STRUCTFIELD(HH_CUST_GRP_MBRP_ID,BINARYTYPE,TRUE)
How do you handle a StructField so I can grab the 1st element out of each column for the schema. So my Column names: HH_CUST_GRP_MBRP_ID, etc...

When in doubt look what the source does itself. DataFrame.toString has the answer :). StructField is a case class with a name property. So, just do:
schema.map(f => s"${f.name}")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Defining DataFrame Schema for a table with 1500 columns in Spark - apache-spark

Related

Not able to convert from RDD to Dataset successfully at runtime [duplicate]

Need help in filtering records according to set of rules with Apache Spark

Processing json strings in a Spark dataframe column

cloudant-spark connector creates duplicate column name with nested JSON schema

Pulling Name Out of Schema in Spark DataFrame

Categories

Resources