Writing Spark SQL query on data without header or schema - apache-spark

I want to write a generic script that can run SQL queries on a file that doesn't have a header or pre-defined schema. For example, a file could look like:
Bob,32
Alice, 24
Jane,65
Doug,33
Peter,19
And the SQL query might be:
SELECT COUNT(DISTINCT ??)
FROM temp_table
WHERE ?? > 32
I am wondering what to put in the ??.

you can define 'custom schema' while reading like
val schema = StructType(
StructField("field1", StringType, true) ::
StructField("field2", IntegerType, true) :: Nil
)
val df = spark.read.format("csv")
.option("sep", ",")
.option("header", "false")
.schema(schema)
.load("examples/src/main/resources/people.csv")
also you can ignore the schema part that would end-up in default names (not-preferred)
val df = spark.read.format("csv")
.option("sep", ",")
.option("header", "false")
.load("examples/src/main/resources/people.csv")
+-----+-----+
| _c0| _c1|
+-----+-----+
| Bob| 32 |
| .. | ... |
+-----+-----+
with that you can fill the column names in your spark-sql.

It seems default schema has column names _c0, _c1 etc.
val df = spark.read.format("csv").load("test.txt")
scala> df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
In Spark 2.0,
df.createOrReplaceTempView("temp_table")
spark.sql("SELECT COUNT(DISTINCT _c1) FROM temp_table WHERE cast(_c1 as int) > 32")

Related

Spark: load csv file with a different schema

I have a csv file like this :
product price,product origin,phone number
20,US,200200
I would like to load the csv file using a new schema so that my dataset should look like this:
|price | origin | number |
|20 | US | 200200 |
I tried to create a schema using structfield :
sparkSession.read().format("csv")
.option("header", "false")
.option("delimiter", ",")
.schema(myScheme).load(csv)
but what I got is like this:
|price | origin | number |
|200200 | US | 20 |
What is the correct way to load the csv with a new scheme with correct column orders ?
Using a csv file with the exact contents that you posted in your question:
product price,product origin,phone number
20,US,200200
You should be able to create a schema by using types from org.apache.spark.sql.types._. You could do something like this:
import org.apache.spark.sql.types._
val mySchema = new StructType()
.add("product price", IntegerType)
.add("product origin", StringType)
.add("phone number", StringType)
val df = spark
.read
.option("header", "true")
.schema(mySchema)
.csv("./simpleCSV.csv")
df.show
+-------------+--------------+------------+
|product price|product origin|phone number|
+-------------+--------------+------------+
| 20| US| 200200|
+-------------+--------------+------------+
df.printSchema
root
|-- product price: integer (nullable = true)
|-- product origin: string (nullable = true)
|-- phone number: string (nullable = true)
Hope this helps!

Reading a .txt file with colon (:) in spark 2.4

i am trying to read .txt file in Spark 2.4 and load it to dataframe.
FILE data looks like :-
under a single Manager there is many employee
Manager_21: Employee_575,Employee_2703,
Manager_11: Employee_454,Employee_158,
Manager_4: Employee_1545,Employee_1312
Code i have written in Scala Spark 2.4 :-
val df = spark.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.load("D:/path/myfile.txt")
df.printSchema()
Unfortunately while printing schema it is visible all Employee under single Manager_21.
root
|-- Manager_21: servant_575: string (nullable = true)
|-- Employee_454: string (nullable = true)
|-- Employee_1312 string (nullable = true)
.......
...... etc
I am not sure if it is possible in spark scala....
Expected Output:
all employee of a manager in same column.
for ex: Manager 21 has 2 employee and all are in same column.
Or How can we see which all employee are under a particular manager.
Manager_21 |Manager_11 |Manager_4
Employee_575 |Employee_454 |Employee_1545
Employee_2703|Employee_158|Employee_1312
is it possible to do some other way..... please suggest
Thanks
Try using spark.read.text then using groupBy and .pivot to get the desired result.
Example:
val df=spark.read.text("<path>")
df.show(10,false)
//+--------------------------------------+
//|value |
//+--------------------------------------+
//|Manager_21: Employee_575,Employee_2703|
//|Manager_11: Employee_454,Employee_158 |
//|Manager_4: Employee_1545,Employee_1312|
//+--------------------------------------+
import org.apache.spark.sql.functions._
df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))").
select("col.*").
show()
//+-------------+-------------+--------------+
//| Manager_11| Manager_21| Manager_4|
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//| Employee_158|Employee_2703| Employee_1312|
//+-------------+-------------+--------------+
UPDATE:
val df=spark.read.text("<path>")
val df1=df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))")
//create temp table
df1.createOrReplaceTempView("tmp_table")
sql("select col.* from tmp_table").show(10,false)
//+-------------+-------------+--------------+
//|Manager_11 |Manager_21 |Manager_4 |
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//|Employee_158 |Employee_2703|Employee_1312 |
//+-------------+-------------+--------------+

Parquet data and partition issue in Spark Structured streaming

I am using Spark Structured streaming; My DataFrame has the following schema
root
|-- data: struct (nullable = true)
| |-- zoneId: string (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- timeSinceLast: long (nullable = true)
|-- date: date (nullable = true)
How can I do a writeStream with Parquet format and write the data
(containing zoneId, deviceId, timeSinceLast; everything except date) and partition the data by date ? I tried the following code and the partition by clause did
not work
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("data.zoneId")
.start()
If you want to partition by date then you have to use it in partitionBy() method.
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("date")
.start()
In case if you want to partition data structured by <year>/<month>/<day> you should make sure that the date column is of DateType type and then create columns appropriately formatted:
val df = dataset.withColumn("date", dataset.col("date").cast(DataTypes.DateType))
df.withColumn("year", functions.date_format(df.col("date"), "YYYY"))
.withColumn("month", functions.date_format(df.col("date"), "MM"))
.withColumn("day", functions.date_format(df.col("date"), "dd"))
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("year", "month", "day")
.start()
I think you should try the method repartition which can take two kinds of arguments:
column name
number of wanted partitions.
I suggest using repartition("date") to partition your data by date.
A great link on the subject: https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

inferSchema using spark.read.format("com.crealytics.spark.excel") is inferring double for a date type column

I am working on PySpark (Python 3.6 and Spark 2.1.1) and trying to fetch data from an excel file using spark.read.format("com.crealytics.spark.excel"), but it is inferring double for a date type column.
Example:
Input -
df = spark.read.format("com.crealytics.spark.excel").\
option("location", "D:\\Users\\ABC\\Desktop\\TmpData\\Input.xlsm").\
option("spark.read.simpleMode","true"). \
option("treatEmptyValuesAsNulls", "true").\
option("addColorColumns", "false").\
option("useHeader", "true").\ option("inferSchema", "true").\
load("com.databricks.spark.csv")
Result:
Name | Age | Gender | DateOfApplication
________________________________________
X | 12 | F | 5/20/2015
Y | 15 | F | 5/28/2015
Z | 14 | F | 5/29/2015
Printing Schema -
df.printSchema()
root
|-- Name: string (nullable = true)
|-- Age: double (nullable = true)
|-- Gender: string (nullable = true)
|-- DateOfApplication: double (nullable = true)
Doing .show -
df.show()
Name | Age | Gender | DateOfApplication
________________________________________
X | 12.0 | F | 42144.0
Y | 15.0 | F | 16836.0
Z | 14.0 | F | 42152.0
While the reading of the data-set the dates or any other numeric value is being converted to double (special problem with date is that it totally changes the value which is hard to be reverted back to original dates.
Can I please be helped?
Author of the plugin here :)
Inferring column types is done in the plugin itself.
That code was taken from spark-csv. As you can see from the code, only String, Numeric, Boolean and Blank cell types are currently inferred.
The best option would be to create a PR which properly infers date columns by using the corresponding DateUtil API.
The second-best option would be to specify the schema manually similar to how #addmeaning described. Note that I've just released version 0.9.0 which makes some required parameters optional and changes the way the path to the file needs to be specified.
yourSchema = StructType()
.add("Name", StringType(), True)
.add("Age", DoubleType(), True)
.add("Gender", StringType(), True)
.add("DateOfApplication", DateType(), True)
df = spark.read.format("com.crealytics.spark.excel").
schema(yourSchema).
option("useHeader", "true").\
load("D:\\Users\\ABC\\Desktop\\TmpData\\Input.xlsm")
Spark can't infer Date type. You can specify schema manually and read DateOfApplication as a string and then convert it to date. Read your df this way:
yourSchema = StructType()
.add("Name", StringType(), True)
.add("Age", DoubleType(), True)
.add("Gender", StringType(), True)
.add("DateOfApplication", StringType(), True)
df = spark.read.format("com.crealytics.spark.excel").
schema(yourSchema).
option("location", "D:\\Users\\ABC\\Desktop\\TmpData\\Input.xlsm").\
option("spark.read.simpleMode","true"). \
option("treatEmptyValuesAsNulls", "true").\
option("addColorColumns", "false").\
option("useHeader", "true").\ //no infer schema
load("com.databricks.spark.csv")
Specifying schema might fix this issue.
from pyspark.sql.types import *
schema = StructType([StructField("Name", StringType(), False),
StructField("Age", DoubleType(), False),
StructField("Gender", StringType(), False),
StructField("DateOfApplication", DateType(), True)])
Add schema to spark.read.
df_excel= spark.read.
format("com.crealytics.spark.excel").
schema(schema).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "false").
option("inferSchema", "false").
option("addColorColumns", "false").
load(file_path)
display(df_excel)

How to parse Nested Json string from DynamoDB table in spark? [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html

Resources