Spark SQL not equals not working on column created using lag - apache-spark

I am trying to find missing sequences in a CSV File and here is the code i have
val customSchema = StructType(
Array(
StructField("MessageId", StringType, false),
StructField("msgSeqID", LongType, false),
StructField("site", StringType, false),
StructField("msgType", StringType, false)
)
)
val logFileDF = sparkSession.sqlContext.read.format("csv")
.option("delimiter",",")
.option("header", false)
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load(logFilePath)
.toDF()
logFileDF.printSchema()
logFileDF.repartition(5000)
logFileDF.createOrReplaceTempView("LogMessageData")
sparkSession.sqlContext.cacheTable("LogMessageData")
val selectQuery: String = "SELECT MessageId,site,msgSeqID,msgType,lag(msgSeqID,1) over (partition by site,msgType order by site,msgType,msgSeqID) as prev_val FROM LogMessageData order by site,msgType,msgSeqID"
val logFileLagRowNumDF = sparkSession.sqlContext.sql(selectQuery).toDF()
logFileLagRowNumDF.repartition(1000)
logFileLagRowNumDF.printSchema()
logFileLagRowNumDF.createOrReplaceTempView("LogMessageDataUpdated")
sparkSession.sqlContext.cacheTable("LogMessageDataUpdated")
val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val!=null and msgSeqID != prev_val+1 order by site,msgType,msgSeqID";
val errorRecordQueryDF = sparkSession.sqlContext.sql(errorRecordQuery).toDF()
logger.info("Total. No.of Missing records =[" + errorRecordQueryDF.count() + "]")
val noOfMissingRecords = errorRecordQueryDF.count()
Here is the sample Data i have
Msg_1,S1,A,10000000
Msg_2,S1,A,10000002
Msg_3,S2,A,10000003
Msg_4,S3,B,10000000
Msg_5,S3,B,10000001
Msg_6,S3,A,10000003
Msg_7,S3,A,10000001
Msg_8,S3,A,10000002
Msg_9,S4,A,10000000
Msg_10,S4,A,10000001
Msg_11,S4,A,10000000
Msg_12,S4,A,10000005
And here is the output i am getting.
root
|-- MessageId: string (nullable = true)
|-- site: string (nullable = true)
|-- msgType: string (nullable = true)
|-- msgSeqID: long (nullable = true)
INFO EimComparisonProcessV2 - Total No.Of Records =[12]
+---------+----+-------+--------+
|MessageId|site|msgType|msgSeqID|
+---------+----+-------+--------+
| Msg_1| S1| A|10000000|
| Msg_2| S1| A|10000002|
| Msg_3| S2| A|10000003|
| Msg_4| S3| B|10000000|
| Msg_5| S3| B|10000001|
| Msg_6| S3| A|10000003|
| Msg_7| S3| A|10000001|
| Msg_8| S3| A|10000002|
| Msg_9| S4| A|10000000|
| Msg_10| S4| A|10000001|
| Msg_11| S4| A|10000000|
| Msg_12| S4| A|10000005|
+---------+----+-------+--------+
root
|-- MessageId: string (nullable = true)
|-- site: string (nullable = true)
|-- msgSeqID: long (nullable = true)
|-- msgType: string (nullable = true)
|-- prev_val: long (nullable = true)
+---------+----+--------+-------+--------+
|MessageId|site|msgSeqID|msgType|prev_val|
+---------+----+--------+-------+--------+
| Msg_1| S1|10000000| A| null|
| Msg_2| S1|10000002| A|10000000|
| Msg_3| S2|10000003| A| null|
| Msg_7| S3|10000001| A| null|
| Msg_8| S3|10000002| A|10000001|
+---------+----+--------+-------+--------+
only showing top 5 rows
INFO TestProcess - Total No.Of Records Updated DF=[12]
INFO TestProcess - Total. No.of Missing records =[0]

Just replace your query with this line of code it will show you the results.
val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val is not null and msgSeqID <> prev_val+1 order by site,msgType,msgSeqID";

Related

Can i create a dataframe from another dataframes rows

Can I create a dataframe from below's rows , as columns of the new dataframe using Pyspark?
+------------+
| col|
+------------|
|created_meta|
| updated_at|
|updated_meta|
| meta|
| Year|
| First Name|
| County|
| Sex|
| Count|
+------------
Two ways.
Using pivot:
df1 = df.groupBy().pivot('col').agg(F.lit(None)).limit(0)
df1.show()
+-----+------+---------+---+----+------------+----+----------+------------+
|Count|County|FirstName|Sex|Year|created_meta|meta|updated_at|updated_meta|
+-----+------+---------+---+----+------------+----+----------+------------+
+-----+------+---------+---+----+------------+----+----------+------------+
Creating it from scratch:
df2 = df.select([F.lit(r[0]) for r in df.collect()]).limit(0)
df2.show()
+------------+----------+------------+----+----+---------+------+---+-----+
|created_meta|updated_at|updated_meta|meta|Year|FirstName|County|Sex|Count|
+------------+----------+------------+----+----+---------+------+---+-----+
+------------+----------+------------+----+----+---------+------+---+-----+
// sorry in Scala + Spark
import spark.implicits._
import org.apache.spark.sql.functions._
val lst = List("created_meta",
"updated_at",
"updated_meta",
"meta",
"Year",
"First Name",
"County",
"Sex",
"Count")
val source = lst.toDF("col")
source.show(false)
// +------------+
// |col |
// +------------+
// |created_meta|
// |updated_at |
// |updated_meta|
// |meta |
// |Year |
// |First Name |
// |County |
// |Sex |
// |Count |
// +------------+
val l = source.select('col).as[String].collect.toList
val df1 = l.foldLeft(source)((acc, col) => {
acc.withColumn(col, lit(""))
})
val df2 = df1.drop("col")
df2.printSchema()
// root
// |-- created_meta: string (nullable = false)
// |-- updated_at: string (nullable = false)
// |-- updated_meta: string (nullable = false)
// |-- meta: string (nullable = false)
// |-- Year: string (nullable = false)
// |-- First Name: string (nullable = false)
// |-- County: string (nullable = false)
// |-- Sex: string (nullable = false)
// |-- Count: string (nullable = false)
df2.show(1, false)
// +------------+----------+------------+----+----+----------+------+---+-----+
// |created_meta|updated_at|updated_meta|meta|Year|First Name|County|Sex|Count|
// +------------+----------+------------+----+----+----------+------+---+-----+
// | | | | | | | | | |
// +------------+----------+------------+----+----+----------+------+---+-----+

using clause where within nested structure

I'm creating a Dataframe of struct.
I want to create another 2 structs depending on the value of my field x2.field3 The idea is if x2.field3==4 Then my struct will be created("struct_1"), if x2.field3==3 Then my struct will be created("struct_2")
when(col("x2.field3").cast(IntegerType())== lit(4), struct(col("x1.field1").alias("index2")).alias("struct_1"))\
.when(col("x2.field3").cast(IntegerType())==lit(3), struct(col("x1.field1").alias("Index1")).alias("struct_2"))
I tried different solutions and didn't succeed because I have always the same error :
Py4JJavaError: An error occurred while calling o21058.withColumn. :
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN
(CAST(x2.field3 AS INT) = 4) THEN named_struct('index2',
x1.field1) WHEN (CAST(x2.field3 AS INT) = 3) THEN
named_struct('Index1', x1.field1) END' due to data type mismatch:
THEN and ELSE expressions should all be same type or coercible to a
common type;; 'Project [x1#5751, x2#5752, named_struct(gen1,
x1#5751.field1, gen2, x1#5751.field1, NamePlaceholder,
named_struct(gen3.1, x1#5751.field1, gen3.2, x1#5751.field1, gen3.3,
x1#5751.field1, gen3.4, x1#5751.field1, gen3.5, x1#5751.field1,
gen3.6, x1#5751.field1, NamePlaceholder, named_struct(gen3.7.1,
named_struct(gen3.7.1.1, 11, gen3.7.1.2, 40), col2, CASE WHEN
(cast(x2#5752.field3 as int) = 4) THEN named_struct(index2,
x1#5751.field1) WHEN (cast(x2#5752.field3 as int) = 3) THEN
named_struct(Index1, x1#5751.field1) END))) AS General#5772]
+- LogicalRDD [x1#5751, x2#5752], false
My entire code is below
schema = StructType(
[
StructField('x1',
StructType([
StructField('field1', IntegerType(),True),
StructField('field2', IntegerType(),True),
StructField('x12',
StructType([
StructField('field5', IntegerType(),True)
])
),
])
),
StructField('x2',
StructType([
StructField('field3', IntegerType(),True),
StructField('field4', BooleanType(),True)
])
)
])
df1 = sqlCtx.createDataFrame([Row(Row(1, 3, Row(23)), Row(3,True))], schema)
df1.printSchema()
df = df1.withColumn("General",
struct(
col("x1.field1").alias("gen1"),
col("x1.field1").alias("gen2"),
struct(col("x1.field1").alias("gen3.1"),
col("x1.field1").alias("gen3.2"),
col("x1.field1").alias("gen3.3"),
col("x1.field1").alias("gen3.4"),
col("x1.field1").alias("gen3.5"),
col("x1.field1").alias("gen3.6"),
struct(struct(lit(11).alias("gen3.7.1.1"),
lit(40).alias("gen3.7.1.2")).alias("gen3.7.1"),
when(col("x2.field3").cast(IntegerType())== lit(4), struct(col("x1.field1").alias("index2")).alias("struct_1"))\
.when(col("x2.field3").cast(IntegerType())==lit(3), struct(col("x1.field1").alias("Index1")).alias("struct_2"))
).alias("gen3.7")).alias("gen3")
)).drop('x1','x2')
df.printSchema()
Over, struct_1 and struct_2 are exclusive, so I recommend you the following piece of code :
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *
schema = StructType(
[
StructField('x1',
StructType([
StructField('field1', IntegerType(),True),
StructField('field2', IntegerType(),True),
StructField('x12',
StructType([
StructField('field5', IntegerType(),True)
])
),
])
),
StructField('x2',
StructType([
StructField('field3', IntegerType(),True),
StructField('field4', BooleanType(),True)
])
)
])
df1 = sqlCtx.createDataFrame([Row(Row(1, 3, Row(23)), Row(1,True))], schema)
df = df1.withColumn("General",
struct(
col("x1.field1").alias("gen1"),
col("x1.field1").alias("gen2"),
struct(col("x1.field1").alias("gen3.1"),
col("x1.field1").alias("gen3.2"),
col("x1.field1").alias("gen3.3"),
col("x1.field1").alias("gen3.4"),
col("x1.field1").alias("gen3.5"),
col("x1.field1").alias("gen3.6"),
struct(struct(lit(11).alias("gen3.7.1.1"),
lit(40).alias("gen3.7.1.2")).alias("gen3.7.1"),
when(col("x2.field3").cast(IntegerType())== lit(4), struct(col("x1.field1").alias("index2"))).alias("struct_1"),
when(col("x2.field3").cast(IntegerType())==lit(3), struct(col("x1.field1").alias("Index1"))).alias("struct_2")
).alias("gen3.7")).alias("gen3")
)).drop('x1','x2')
df.printSchema()
Output :
root
|-- General: struct (nullable = false)
| |-- gen1: integer (nullable = true)
| |-- gen2: integer (nullable = true)
| |-- gen3: struct (nullable = false)
| | |-- gen3.1: integer (nullable = true)
| | |-- gen3.2: integer (nullable = true)
| | |-- gen3.3: integer (nullable = true)
| | |-- gen3.4: integer (nullable = true)
| | |-- gen3.5: integer (nullable = true)
| | |-- gen3.6: integer (nullable = true)
| | |-- gen3.7: struct (nullable = false)
| | | |-- gen3.7.1: struct (nullable = false)
| | | | |-- gen3.7.1.1: integer (nullable = false)
| | | | |-- gen3.7.1.2: integer (nullable = false)
| | | |-- struct_1: struct (nullable = true)
| | | | |-- index2: integer (nullable = true)
| | | |-- struct_2: struct (nullable = true)
| | | | |-- Index1: integer (nullable = true)

How to convert RDD[Array[Any]] to DataFrame?

I have RDD[Array[Any]] as follows,
1556273771,Mumbai,1189193,1189198,0.56,-1,India,Australia,1571215104,1571215166
8374749403,London,1189193,1189198,0,1,India,England,4567362933,9374749392
7439430283,Dubai,1189193,1189198,0.76,-1,Pakistan,Sri Lanka,1576615684,4749383749
I need to convert this to a data frame of 10 columns, but I am new to spark. Please let me know how to do this in the simplest way.
I am trying something similar to this code:
rdd_data.map{case Array(a,b,c,d,e,f,g,h,i,j) => (a,b,c,d,e,f,g,h,i,j)}.toDF()
When you create a dataframe, Spark needs to know the data type of each column. "Any" type is just a way of saying that you don't know the variable type. A possible solution is to cast each value to a specific type. This will of course fail if the specified cast is invalid.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd1 = spark.sparkContext.parallelize(
Array(
Array(1556273771L,"Mumbai",1189193,1189198 ,0.56,-1,"India", "Australia",1571215104L,1571215166L),
Array(8374749403L,"London",1189193,1189198 ,0 , 1,"India", "England", 4567362933L,9374749392L),
Array(7439430283L,"Dubai" ,1189193,1189198 ,0.76,-1,"Pakistan","Sri Lanka",1576615684L,4749383749L)
),1)
//rdd1: org.apache.spark.rdd.RDD[Array[Any]]
val rdd2 = rdd1.map(r => Row(
r(0).toString.toLong,
r(1).toString,
r(2).toString.toInt,
r(3).toString.toInt,
r(4).toString.toDouble,
r(5).toString.toInt,
r(6).toString,
r(7).toString,
r(8).toString.toLong,
r(9).toString.toLong
))
val schema = StructType(
List(
StructField("col0", LongType, false),
StructField("col1", StringType, false),
StructField("col2", IntegerType, false),
StructField("col3", IntegerType, false),
StructField("col4", DoubleType, false),
StructField("col5", IntegerType, false),
StructField("col6", StringType, false),
StructField("col7", StringType, false),
StructField("col8", LongType, false),
StructField("col9", LongType, false)
)
)
val df = spark.createDataFrame(rdd2, schema)
df.show
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
| col0| col1| col2| col3|col4|col5| col6| col7| col8| col9|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56| -1| India|Australia|1571215104|1571215166|
|8374749403|London|1189193|1189198| 0.0| 1| India| England|4567362933|9374749392|
|7439430283| Dubai|1189193|1189198|0.76| -1|Pakistan|Sri Lanka|1576615684|4749383749|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
df.printSchema
root
|-- col0: long (nullable = false)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: integer (nullable = false)
|-- col4: double (nullable = false)
|-- col5: integer (nullable = false)
|-- col6: string (nullable = false)
|-- col7: string (nullable = false)
|-- col8: long (nullable = false)
|-- col9: long (nullable = false)
Hope it helps
As the other posts mention, a DataFrame requires explicit types for each column, so you can't use Any. The easiest way I can think of would be to turn each row into a tuple of the right types then use implicit DF creation to convert to a DataFrame. You were pretty close in your code, you just need to cast the elements to an acceptable type.
Basically toDF knows how to convert tuples (with accepted types) into a DF Row, and you can pass the column names into the toDF call.
For example:
val data = Array(1556273771, "Mumbai", 1189193, 1189198, 0.56, -1, "India,Australia", 1571215104, 1571215166)
val rdd = sc.parallelize(Seq(data))
val df = rdd.map {
case Array(a,b,c,d,e,f,g,h,i) => (
a.asInstanceOf[Int],
b.asInstanceOf[String],
c.asInstanceOf[Int],
d.asInstanceOf[Int],
e.toString.toDouble,
f.asInstanceOf[Int],
g.asInstanceOf[String],
h.asInstanceOf[Int],
i.asInstanceOf[Int]
)
}.toDF("int1", "city", "int2", "int3", "float1", "int4", "country", "int5", "int6")
df.printSchema
df.show(100, false)
scala> df.printSchema
root
|-- int1: integer (nullable = false)
|-- city: string (nullable = true)
|-- int2: integer (nullable = false)
|-- int3: integer (nullable = false)
|-- float1: double (nullable = false)
|-- int4: integer (nullable = false)
|-- country: string (nullable = true)
|-- int5: integer (nullable = false)
|-- int6: integer (nullable = false)
scala> df.show(100, false)
+----------+------+-------+-------+------+----+---------------+----------+----------+
|int1 |city |int2 |int3 |float1|int4|country |int5 |int6 |
+----------+------+-------+-------+------+----+---------------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56 |-1 |India,Australia|1571215104|1571215166|
+----------+------+-------+-------+------+----+---------------+----------+----------+
Edit for 0 -> Double:
As André pointed out, if you start off with 0 as an Any it will be a java Integer, not a scala Int, and therefore not castable to a scala Double. Converting it to a string first lets you then convert it into a double as desired.
You can try below approach, it's a bit tricky but without bothering with schema.
Map Any to String using toDF(), create DataFrame of arrays then create new columns by selecting each element from array column.
val rdd: RDD[Array[Any]] = spark.range(5).rdd.map(s => Array(s,s+1,s%2))
val size = rdd.first().length
def splitCol(col: Column): Seq[(String, Column)] = {
(for (i <- 0 to size - 1) yield ("_" + i, col(i)))
}
import spark.implicits._
rdd.map(s=>s.map(s=>s.toString()))
.toDF("x")
.select(splitCol('x).map(_._2):_*)
.toDF(splitCol('x).map(_._1):_*)
.show()
+---+---+---+
| _0| _1| _2|
+---+---+---+
| 0| 1| 0|
| 1| 2| 1|
| 2| 3| 0|
| 3| 4| 1|
| 4| 5| 0|
+---+---+---+

Parse JSON in Spark containing reserve character

I have a JSON input.txt file with data as follows:
2018-05-30.txt:{"Message":{"eUuid":"6e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1527539376,"id":"XYZ","location":{"dim":{"x":2,"y":-7},"towards":121.0},"source":"a","UniqueId":"test123","code":"del","signature":"xyz","":{},"vel":{"ground":15},"height":{},"next":{"dim":{}},"sub":"del1"}}
2018-05-30.txt:{"Message":{"eUuid":"5e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1627539376,"id":"ABC","location":{"dim":{"x":1,"y":-8},"towards":132.0},"source":"b","UniqueId":"hello123","code":"fra","signature":"abc","":{},"vel":{"ground":16},"height":{},"next":{"dim":{}},"sub":"fra1"}}
.
.
I tried to load the JSON into a DataFrame as follows:
>>val df = spark.read.json("<full path of input.txt file>")
I am receiving
_corrupt_record
dataframe
I am aware that json character contains "." (2018-05-30.txt) as reserve character which is causing the issue. How may I resolve this?
val rdd = sc.textFile("/Users/kishore/abc.json")
val jsonRdd= rdd.map(x=>x.split("txt:")(1))
scala> df.show
+--------------------+
| Message|
+--------------------+
|[test123,del,6e7d...|
|[hello123,fra,5e7...|
+--------------------+
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// val df = sqlContext.read.json(jsonRdd)
// df.show(false)
val df = sqlContext.read.json(jsonRdd).withColumn("eUuid", $"Message"("eUuid"))
.withColumn("schemaVersion", $"Message"("schemaVersion"))
.withColumn("timestamp", $"Message"("timestamp"))
.withColumn("id", $"Message"("id"))
.withColumn("source", $"Message"("source"))
.withColumn("UniqueId", $"Message"("UniqueId"))
.withColumn("location", $"Message"("location"))
.withColumn("dim", $"location"("dim"))
.withColumn("x", $"dim"("x"))
.withColumn("y", $"dim"("y"))
.drop("dim")
.withColumn("vel", $"Message"("vel"))
.withColumn("ground", $"vel"("ground"))
.withColumn("sub", $"Message"("sub"))
.drop("Message")
df.show()
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
| eUuid|schemaVersion| timestamp| id|source|UniqueId| location| x| y| vel|ground| sub|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
|6e7d4890-9279-491...| 1.0-AB1|1527539376|XYZ| a| test123|[[2,-7],121]| 2| -7|[15]| 15|del1|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
The problem is not a reserved character it is that the file does not contain valid JSON
so you can
val df=spark.read.textFile(...)
val json=spark.read.json(df.map(v=>v.drop(15)))
json.printSchema()
root
|-- Message: struct (nullable = true)
| |-- UniqueId: string (nullable = true)
| |-- code: string (nullable = true)
| |-- eUuid: string (nullable = true)
| |-- id: string (nullable = true)
| |-- location: struct (nullable = true)
| | |-- dim: struct (nullable = true)
| | | |-- x: long (nullable = true)
| | | |-- y: long (nullable = true)
| | |-- towards: double (nullable = true)
| |-- schemaVersion: string (nullable = true)
| |-- signature: string (nullable = true)
| |-- source: string (nullable = true)
| |-- sub: string (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- vel: struct (nullable = true)
| | |-- ground: long (nullable = true)

org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);

I am using spark sql to select a column along with sum of another column:
Below is my query:
scala> spark.sql("select distinct _c3,sum(_c9).as(sumAadhar) from aadhar group by _c3 order by _c9 desc LIMIT 3").show
And my schema is :
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: double (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: string (nullable = true)
And I a getting below error:
org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:613)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
Any idea, what am i doing wrong or is there any other way to sum the values of a column
Check below which is tried on a reduced schema:
scala> val df = Seq(("a", 2), ("a", 3), ("b", 4), ("a", 9), ("b", 1), ("c", 100)).toDF("_c3", "_c9") df: org.apache.spark.sql.DataFrame = [_c3: string, _c9: int]
scala> df.createOrReplaceTempView("aadhar")
scala> spark.sql("select _c3,sum(_c9) as sumAadhar from aadhar group by _c3 order by sumAadhar desc LIMIT 3").show
+---+---------+
|_c3|sumAadhar|
+---+---------+
| c| 100|
| a| 14|
| b| 5|
+---+---------+
Removed distinct since its not necessary as your original query already groups by _c3.
Changed sum(_c9).as(sumAadhar) to sum(_c9) as sumAadhar as I think that syntax was leading spark sql to do some unintended casting.

Resources