spark read schema from separate file - apache-spark

I have my data in HDFS and it's schema in MySQL. I'm able to fetch the schema to a DataFrame and it is as below :
col1,string
col2,date
col3,int
col4,string
How to read this schema and assign it to data while reading from HDFS?
I will be reading schema from MySql . It will be different for different datasets . I require a dynamic approach , where for any dataset I can fetch schema details from MySQL -> convert it into schema -> and then apply to dataset.

You can use the built-in pyspark function _parse_datatype_string:
from pyspark.sql.types import _parse_datatype_string
df = spark.createDataFrame([
["col1,string"],
["col3,int"],
["col3,int"]
], ["schema"])
str_schema = ",".join(map(lambda c: c["schema"].replace(",", ":") , df.collect()))
# col1:string,col3:int,col3:int
final_schema = _parse_datatype_string(str_schema)
# StructType(List(StructField(col1,StringType,true),StructField(col3,IntegerType,true),StructField(col3,IntegerType,true)))
_parse_datatype_string expects a DDL-formatted string i.e: col1:string, col2:int hence we need first to replace , with : then join all together seperated by comma. The function will return an instance of StructType which will be your final schema.

Related

how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption

I am using Pyspark to transform JSON in a Dataframe. And I am successfully able to transform it. But the problem I am facing is there is a key which will be present in some JSON file and will not be present in another. When I flatten the JSON with Pyspark SQL context and the key is not present in some JSON file, it gives error in creating my Pyspark data frame, throwing SQL Analysis Exception.
for example my sample JSON
{
"_id" : ObjectId("5eba227a0bce34b401e7899a"),
"origin" : "inbound",
"converse" : "72412952",
"Start" : "2020-04-20T06:12:20.89Z",
"End" : "2020-04-20T06:12:53.919Z",
"ConversationMos" : 4.88228940963745,
"ConversationRFactor" : 92.4383773803711,
"participantId" : "bbe4de4c-7b3e-49f1-8",
}
The above JSON participant id will be available in some JSON and not in another JSON files
My pysaprk code snippet:
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
When, in some JSON file participantId is not present, an exception is coming. How to handle that kind of exception that if the key is not present so column will contain null or any other ways to handle it
You can simply check if the column is not there then add it will empty values.
The code for the same goes like:
from pyspark.sql import functions as f
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
if not 'participantId' in df.columns:
df = df.withColumn('participantId', f.lit(''))
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
I think you're calling Spark to read one file at a time and inferring the schema at the same time.
What Spark is telling you with the SQL Analysis exception is that your file and your inferred schema doesn't have the key you're looking for. What you have to do is get to your good schema and apply it to all of the files you want to process. Ideally, processing all of your files at once.
There are three strategies:
Infer your schema from lots of files. You should get the aggregate of all of the keys. Spark will run two passes over the data.
df = spark.read.json('/path/to/your/directory/full/of/json/files')
schema = df.schema
print(schema)
Create a schema object
I find this tedious to do, but will speed up your code. Here is a reference: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType
Read the schema from a well formed file then use that to read your whole directory. Also, by printing the schema object, you can copy paste that back into your code for option #2.
schema = spark.read.json('path/to/well/formed/file.json')
print(schema)
my_df = spark.read.schema(schema).json('path/to/entire/folder/full/of/json')

Expressing spark `StructType` in avro schema

How would you describe spark StructType data type in an avro schema? I am generating a parquet file, the format of which is described in an avro schema. This file is then loaded from S3 into spark. There is an array and map data types but these do not correspond to the StructType.
Using the package org.apache.spark.sql.avro (Spark 2.4) you can convert sparkSQL schemas to avro schemas and viceversa.
You cant try this way:
import org.apache.spark.sql.avro.SchemaConverters
val sqlType = SchemaConverters.toSqlType(avroSchema)
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
Here you can find more answers too: Code

Spark infer schema with limit during a read.csv

I'd like to infer a Spark.DataFrame schema from a directory of CSV files using a small subset of the rows (say limit(100)).
However, setting inferSchema to True means that the Input Size / Records for the FileScanRDD seems to always be equal to the number of rows in all the CSV files.
Is there a way to make the FileScan more selective, such that Spark looks at fewer rows when inferring a schema?
Note: setting the samplingRatio option to be < 1.0 does not have the desired behaviour, though it is clear that inferSchema uses only the sampled subset of rows.
You could read a subset of your input data into a dataSet of String.
The CSV method allows you to pass this as a parameter.
Here is a simple example (I'll leave reading the sample of rows from the input file to you):
val data = List("1,2,hello", "2,3,what's up?")
val csvRDD = sc.parallelize(data)
val df = spark.read.option("inferSchema","true").csv(csvRDD.toDS)
df.schema
When run in spark-shell, the final line from the above prints (I reformatted it for readability):
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_c0,IntegerType,true),
StructField(_c1,IntegerType,true),
StructField(_c2,StringType,true)
)
Which is the correct Schema for my limited input data set.
Assuming you are only interested in the schema, here is a possible approach based on cipri.l's post in this link
import org.apache.spark.sql.execution.datasources.csv.{CSVOptions, TextInputCSVDataSource}
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, sampleSize: Int, isFirstRowHeader: Boolean): StructType = {
// Build a Dataset composed of the first sampleSize lines from the input files as plain text strings
val dataSample: Array[String] = sparkSession.read.textFile(fileLocation).head(sampleSize)
import sparkSession.implicits._
val sampleDS: Dataset[String] = sparkSession.createDataset(dataSample)
// Provide information about the CSV files' structure
val firstLine = dataSample.head
val extraOptions = Map("inferSchema" -> "true", "header" -> isFirstRowHeader.toString)
val csvOptions: CSVOptions = new CSVOptions(extraOptions, sparkSession.sessionState.conf.sessionLocalTimeZone)
// Infer the CSV schema based on the sample data
val schema = TextInputCSVDataSource.inferFromDataset(sparkSession, sampleDS, Some(firstLine), csvOptions)
schema
}
Unlike GMc's answer from above, this approach tries to directly infer the schema the same way the DataFrameReader.csv() does in the background (but without going through the effort of building an additional Dataset with that schema, that we would then only use to retrieve the schema back from it)
The schema is inferred based on a Dataset[String] containing only the first sampleSize lines from the input files as plain text strings.
When trying to retrieve samples from data, Spark has only 2 types of methods:
Methods that retrieve a given percentage of the data. This operation takes random samples from all partitions. It benefits from higher parallelism, but it must read all the input files.
Methods that retrieve a specific number of rows. This operation must collect the data on the driver, but it could read a single partition (if the required row count is low enough)
Since you mentioned you want to use a specific small number of rows and since you want to avoid touching all the data, I provided a solution based on option 2
PS: The DataFrameReader.textFile method accepts paths to files, folders and it also has a varargs variant, so you could pass in one or more files or folders.

Convert JSON ArrayType to MapType in Spark

I have a JSON file in the below format that I am trying to read/query from spark job.
{
"dimensions": [
{"id1":"val1"},
{"id2":"val2"},
{"id3":"val"}
...
]
}
val schema = (new StructType).
add("dimensions",
new ArrayType(MapType(StringType,StringType),true))
val df = sparkContext.read.schema(schema).json(file)
'dimensions' is a JSON array and contains key value pair. I want to read it as just key-value pair (map) so that its easy to query in Spark SQL.
I tried above above schema, but it gives me array with each item being of MapType.
Is there a way to read the above json array as MapType?
In the end, I want to be able to write spark sql something like below where I can select individual keys:
val result = spark.sql("SELECT dimensions.id1, dimensions.id3 FROM table")
Thanks!

Spark SQL - Exact difference between Creating schema implicitly & Programmatically

I am trying to understand the exact difference and which Method can be used in what particular Scenario between Creating Schema Implicitly & Programmatically.
On Databricks site the information is not that much elborative & explanatory.
As we can see that when using Reflection(implicit RDD to DF) way we can create a Case Class by choosing specific columns from a textfile by using the Map function.
And in Programmatic Style - we are loading the Dataset a textfile (similar to reflection)
Creating a SchemaString (String) = "Knowing the file we can specify the columns we need " (Similar to case class in Reflection way)
Importing the ROW API - which will again Map to the Specific Columns & data types used in Schema String (Similar to case classes)
Then we create DataFrame & after this everything is same..
So what is the exact difference in these two approaches.
http://spark.apache.org/docs/1.5.2/sql-programming-guide.html#inferring-the-schema-using-reflection
http://spark.apache.org/docs/1.5.2/sql-programming-guide.html#programmatically-specifying-the-schema
Please Explain...
The produced schemas are the same, so from that point of view, there's no difference. In both cases, you're supplying a schema for your data, but in one case, you're doing it from a case class, in the other you can use collections, since a schema is built as a StructType(Array[StructField]).
So it's basically a choice between tuples and collections. The way I see it, the biggest difference is that cases classes have to be in the code, while programmatically specifying the schema can be done at runtime, so you could, for instance, build a schema based on another DataFrame that you're reading at runtime.
As an example, I wrote a generic tool to "nest" data, reading from CSV, and transforming a set of prefixed field into an array of structs.
Since the tool is generic, and the schema is known only at runtime, I used the programmatic approach.
On the other hand, it's generally easier to code it with reflection, since you don't have to deal with all the StructField objects, since they are derived from the hive metastore their data type has to be mapped to your scala types.
Programmatically Specifying the Schema
When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.
Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
For example:
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
Inferring the Schema Using Reflection
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Tables can be used in subsequent SQL statements.
For example:
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

Resources