How to access nested schema column? - apache-spark

I have a Kafka streaming source with JSONs, e.g. {"type":"abc","1":"23.2"}.
The query gives the following exception:
org.apache.spark.sql.catalyst.parser.ParseException: extraneous
input '.1' expecting {<EOF>, .......}
== SQL ==
person.1
What is the correct syntax to access "person.1"?
I have even changed DoubleType to StringType, but that didn't work either. Example works fine with just by keeping person.type and removing person.1 in selectExpr:
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)")
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.DoubleType)
val personNestedDf = personJsonDf
.select(from_json($"value", struct).as("person"))
val personFlattenedDf = personNestedDf
.selectExpr("person.type", "person.1")
val consoleOutput = personNestedDf.writeStream
.outputMode("update")
.format("console")
.start()

Interesting, since select($"person.1") should work fine (but you used selectExpr which could've confused Spark SQL).
StructField(1,DoubleType,true) won't work however since the type should actually be StringType.
Let's see...
$ cat input.json
{"type":"abc","1":"23.2"}
val input = spark.read.text("input.json")
scala> input.show(false)
+-------------------------+
|value |
+-------------------------+
|{"type":"abc","1":"23.2"}|
+-------------------------+
import org.apache.spark.sql.types._
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.StringType)
val q = input.select(from_json($"value", struct).as("person"))
scala> q.show
+-----------+
| person|
+-----------+
|[abc, 23.2]|
+-----------+
val q = input.select(from_json($"value", struct).as("person")).select($"person.1")
scala> q.show
+----+
| 1|
+----+
|23.2|
+----+

I have solved this problem by using person.*
+-----+--------+
|type | 1 |
+-----+--------+
|abc |23.2 |
+-----+--------+

Related

how to handle this in spark

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information.
If columns year === prev_year then I need to join with different table i.e. exchange_rates.
If columns year =!= prev_year then I need to return the base dataset itself
How to do this in spark-sql ?
You can refer below approach for your case.
scala> Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2016| 2017| 12|
| 1|2017| 2017|21.4|
| 2|2018| 2017|11.7|
| 2|2018| 2018|44.6|
| 3|2016| 2017|34.5|
| 4|2017| 2017| 56|
+---------+----+---------+----+
scala> exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
| 1|12.3|
| 2|12.5|
| 3|22.3|
| 4|34.6|
| 5|45.2|
+---------+----+
scala> val equaldf = Input_df.filter(col("year") === col("prev_year"))
scala> val notequaldf = Input_df.filter(col("year") =!= col("prev_year"))
scala> val joindf = notequaldf.alias("n").drop("rate").join(exch_rates.alias("e"), List("companyId"), "left")
scala> val finalDF = equaldf.union(joindf)
scala> finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2017| 2017|21.4|
| 2|2018| 2018|44.6|
| 4|2017| 2017| 56|
| 1|2016| 2017|12.3|
| 2|2018| 2017|12.5|
| 3|2016| 2017|22.3|
+---------+----+---------+----+

How to deal with white space in column names to use spark coalesce function in expr method

I am working on spark coalesce functionality in my project.Code works fine on columns with no spaces but fails on spaced columns.
e1.csv
id,code,type,no root
1,,A,1
2,,,0
3,123,I,1
e2.csv
id,code,type,no root
1,456,A,1
2,789,A1,0
3,,C,0
logic code
Dataset<Row> df1 = spark.read().format("csv").option("header", "true").load("/home/user/Videos/<folder>/e1.csv");
Dataset<Row> df2 = spark.read().format("csv").option("header", "true").load("/home/user/Videos/<folder>/e2.csv");
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").selectExpr("coalesce(`a.no root`,`b.no root`) AS `a.no root`");
newDS.show();
What I have tried
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").selectExpr("""coalesce(`a.no root`,`b.no root`) AS `a.no root`""");
The espexted result would be like
no root
1
0
1
Using the following criteria
val newDS = df1.as("a").join(df2.as("b")).where("a.id==b.id").selectExpr("coalesce(a.`no root`,b.`no root`) AS `a.no root`")
will generate the expected output
+---------+
|a.no root|
+---------+
| 1|
| 0|
| 1|
+---------+

Spark sql query giving data type miss match error

I have small sql query which working perfectly fine in sql, but the same query working in hive as expected.
Table has user information and below is the query
spark.sql("select * from users where (id,id_proof) not in ((1232,345))").show;
I am getting below exception in spark
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('age', deleted_inventory.`age`, 'id_proof', deleted_inventory.`id_proof`) IN (named_struct('col1',1232, 'col2', 345)))' due to data type mismatch: Arguments must be same type but were: StructType(StructField(id,IntegerType,true), StructField(id_proof,IntegerType,true)) != StructType(StructField(col1,IntegerType,false), StructField(col2,IntegerType,false));
I id and id_proof are of integer types.
Try using the with() table, it works.
scala> val df = Seq((101,121), (1232,345),(222,2242)).toDF("id","id_proof")
df: org.apache.spark.sql.DataFrame = [id: int, id_proof: int]
scala> df.show(false)
+----+--------+
|id |id_proof|
+----+--------+
|101 |121 |
|1232|345 |
|222 |2242 |
+----+--------+
scala> df.createOrReplaceTempView("girish")
scala> spark.sql("with t1( select 1232 id,345 id_proof ) select id, id_proof from girish where (id,id_proof) not in (select id,id_proof from t1) ").show(false)
+---+--------+
|id |id_proof|
+---+--------+
|101|121 |
|222|2242 |
+---+--------+
scala>

Spark DataFrame losing string data in yarn-client mode

By some reason if I'm adding new column, appending string to existing data/column or creating new DataFrame from code, it misinterpreting string data, so show() doesn't work properly, filters (such as withColumn, where, when, etc.) doesn't work ether.
Here is example code:
object MissingValue {
def hex(str: String): String = str.getBytes("UTF-8").map(f => Integer.toHexString((f&0xFF)).toUpperCase).mkString("-")
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MissingValue")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
val schema = StructType(StructField("COL1",IntegerType,true)::StructField("COL2",StringType,true)::Nil)
val df = sqlContext.createDataFrame(rdd,schema)
df.show()
val str = df.first().getString(1)
println(s"${str} == ${hex(str)}")
sc.stop()
}
}
If I run it in local mode then everything works as expected:
+----+----+
|COL1|COL2|
+----+----+
| 101| ABC|
| 102| BCD|
| 103| CDE|
+----+----+
ABC == 41-42-43
But when I run the same code in yarn-client mode it produces:
+----+----+
|COL1|COL2|
+----+----+
| 101| ^E^#^#|
| 102| ^E^#^#|
| 103| ^E^#^#|
+----+----+
^E^#^# == 5-0-0
This problem exists only for string values, so first column (Integer) is fine.
Also if I'm creating rdd from the dataframe then everything is fine i.e. df.rdd.take(1).apply(0).getString(1)
I'm using Spark 1.5.0 from CDH 5.5.2
EDIT:
It seems that this happens when the difference between driver memory and executor memory is too high --driver-memory xxG --executor-memory yyG i.e. when I decreasing executor memory or increasing driver memory then the problem disappears.
This is a bug related to executor memory and Oops size:
https://issues.apache.org/jira/browse/SPARK-9725
https://issues.apache.org/jira/browse/SPARK-10914
https://issues.apache.org/jira/browse/SPARK-17706
It is fixed in Spark version 1.5.2

Spark csv to dataframe skip first row

I am loading csv to dataframe using -
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").
option("delimiter", ",").load("file.csv")
but my input file contains date in the first row and header from second row.
example
20160612
id,name,age
1,abc,12
2,bcd,33
How can i skip this first row while converting csv to dataframe?
Here are several options that I can think of since the data bricks module doesn't seem to provide a skip line option:
Option one: Add a "#" character in front of the first line, and the line will be automatically considered as comment and ignored by the data.bricks csv module;
Option two: Create your customized schema and specify the mode option as DROPMALFORMED which will drop the first line since it contains less token than expected in the customSchema:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val customSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("mode", "DROPMALFORMED").
schema(customSchema).load("test.txt")
df.show
16/06/12 21:24:05 WARN CsvRelation$: Number format exception. Dropping
malformed line: id,name,age
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+
Note the warning message here which says dropped malformed line:
Option three: Write your own parser to drop the line that doesn't have length of three:
val file = sc.textFile("pathToYourCsvFile")
val df = file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "id").
map(row => (row(0), row(1), row(2))).
toDF("id", "name", "age")
df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+

Resources