How to convert RDD[Array[Any]] to DataFrame? - apache-spark

I have RDD[Array[Any]] as follows,
1556273771,Mumbai,1189193,1189198,0.56,-1,India,Australia,1571215104,1571215166
8374749403,London,1189193,1189198,0,1,India,England,4567362933,9374749392
7439430283,Dubai,1189193,1189198,0.76,-1,Pakistan,Sri Lanka,1576615684,4749383749
I need to convert this to a data frame of 10 columns, but I am new to spark. Please let me know how to do this in the simplest way.
I am trying something similar to this code:
rdd_data.map{case Array(a,b,c,d,e,f,g,h,i,j) => (a,b,c,d,e,f,g,h,i,j)}.toDF()

When you create a dataframe, Spark needs to know the data type of each column. "Any" type is just a way of saying that you don't know the variable type. A possible solution is to cast each value to a specific type. This will of course fail if the specified cast is invalid.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd1 = spark.sparkContext.parallelize(
Array(
Array(1556273771L,"Mumbai",1189193,1189198 ,0.56,-1,"India", "Australia",1571215104L,1571215166L),
Array(8374749403L,"London",1189193,1189198 ,0 , 1,"India", "England", 4567362933L,9374749392L),
Array(7439430283L,"Dubai" ,1189193,1189198 ,0.76,-1,"Pakistan","Sri Lanka",1576615684L,4749383749L)
),1)
//rdd1: org.apache.spark.rdd.RDD[Array[Any]]
val rdd2 = rdd1.map(r => Row(
r(0).toString.toLong,
r(1).toString,
r(2).toString.toInt,
r(3).toString.toInt,
r(4).toString.toDouble,
r(5).toString.toInt,
r(6).toString,
r(7).toString,
r(8).toString.toLong,
r(9).toString.toLong
))
val schema = StructType(
List(
StructField("col0", LongType, false),
StructField("col1", StringType, false),
StructField("col2", IntegerType, false),
StructField("col3", IntegerType, false),
StructField("col4", DoubleType, false),
StructField("col5", IntegerType, false),
StructField("col6", StringType, false),
StructField("col7", StringType, false),
StructField("col8", LongType, false),
StructField("col9", LongType, false)
)
)
val df = spark.createDataFrame(rdd2, schema)
df.show
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
| col0| col1| col2| col3|col4|col5| col6| col7| col8| col9|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56| -1| India|Australia|1571215104|1571215166|
|8374749403|London|1189193|1189198| 0.0| 1| India| England|4567362933|9374749392|
|7439430283| Dubai|1189193|1189198|0.76| -1|Pakistan|Sri Lanka|1576615684|4749383749|
+----------+------+-------+-------+----+----+--------+---------+----------+----------+
df.printSchema
root
|-- col0: long (nullable = false)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: integer (nullable = false)
|-- col4: double (nullable = false)
|-- col5: integer (nullable = false)
|-- col6: string (nullable = false)
|-- col7: string (nullable = false)
|-- col8: long (nullable = false)
|-- col9: long (nullable = false)
Hope it helps

As the other posts mention, a DataFrame requires explicit types for each column, so you can't use Any. The easiest way I can think of would be to turn each row into a tuple of the right types then use implicit DF creation to convert to a DataFrame. You were pretty close in your code, you just need to cast the elements to an acceptable type.
Basically toDF knows how to convert tuples (with accepted types) into a DF Row, and you can pass the column names into the toDF call.
For example:
val data = Array(1556273771, "Mumbai", 1189193, 1189198, 0.56, -1, "India,Australia", 1571215104, 1571215166)
val rdd = sc.parallelize(Seq(data))
val df = rdd.map {
case Array(a,b,c,d,e,f,g,h,i) => (
a.asInstanceOf[Int],
b.asInstanceOf[String],
c.asInstanceOf[Int],
d.asInstanceOf[Int],
e.toString.toDouble,
f.asInstanceOf[Int],
g.asInstanceOf[String],
h.asInstanceOf[Int],
i.asInstanceOf[Int]
)
}.toDF("int1", "city", "int2", "int3", "float1", "int4", "country", "int5", "int6")
df.printSchema
df.show(100, false)
scala> df.printSchema
root
|-- int1: integer (nullable = false)
|-- city: string (nullable = true)
|-- int2: integer (nullable = false)
|-- int3: integer (nullable = false)
|-- float1: double (nullable = false)
|-- int4: integer (nullable = false)
|-- country: string (nullable = true)
|-- int5: integer (nullable = false)
|-- int6: integer (nullable = false)
scala> df.show(100, false)
+----------+------+-------+-------+------+----+---------------+----------+----------+
|int1 |city |int2 |int3 |float1|int4|country |int5 |int6 |
+----------+------+-------+-------+------+----+---------------+----------+----------+
|1556273771|Mumbai|1189193|1189198|0.56 |-1 |India,Australia|1571215104|1571215166|
+----------+------+-------+-------+------+----+---------------+----------+----------+
Edit for 0 -> Double:
As André pointed out, if you start off with 0 as an Any it will be a java Integer, not a scala Int, and therefore not castable to a scala Double. Converting it to a string first lets you then convert it into a double as desired.

You can try below approach, it's a bit tricky but without bothering with schema.
Map Any to String using toDF(), create DataFrame of arrays then create new columns by selecting each element from array column.
val rdd: RDD[Array[Any]] = spark.range(5).rdd.map(s => Array(s,s+1,s%2))
val size = rdd.first().length
def splitCol(col: Column): Seq[(String, Column)] = {
(for (i <- 0 to size - 1) yield ("_" + i, col(i)))
}
import spark.implicits._
rdd.map(s=>s.map(s=>s.toString()))
.toDF("x")
.select(splitCol('x).map(_._2):_*)
.toDF(splitCol('x).map(_._1):_*)
.show()
+---+---+---+
| _0| _1| _2|
+---+---+---+
| 0| 1| 0|
| 1| 2| 1|
| 2| 3| 0|
| 3| 4| 1|
| 4| 5| 0|
+---+---+---+

Related

Pyspark 'from_json', data frame return null for the all json values

I have below logs which contains text and json string
2020-09-24T08:03:01.633Z 11.21.23.1 {"EventTime":"2020-09-24 13:33:01","Hostname":"abc-cde.india.local","Keywords":-1234}
created DF for the above logs as seen below
| Date |Source IP | Event Type
|2020-09-24|11.21.23.1 | {"EventTime":"202|
crated schema for converting json string to another data frame
json_schema = StructType([
StructField("EventTime", StringType()),
StructField("Hostname", StringType()),
StructField("Keywords", IntegerType())
])
json_converted_df= df.select(F.from_json(F.col('Event Type'), json_schema).alias("data")).select("data.*").show()
but the Data Frame rerun null for all new json schema
+---------+--------+--------
|EventTime|Hostname|Keywords|
+---------+--------+--------
| null| null|null |
+---------+--------+--------
How to resolve this issue?
Works fine with me ...
# Preparation of test dataset
a = [
(
"2020-09-24T08:03:01.633Z",
"11.21.23.1",
'{"EventTime":"2020-09-24 13:33:01","Hostname":"abc-cde.india.local","Keywords":-1234}',
),
]
b = ["Date", "Source IP", "Event Type"]
df = spark.createDataFrame(a, b)
df.show()
#+--------------------+----------+--------------------+
#| Date| Source IP| Event Type|
#+--------------------+----------+--------------------+
#|2020-09-24T08:03:...|11.21.23.1|{"EventTime":"202...|
#+--------------------+----------+--------------------+
df.printSchema()
#root
# |-- Date: string (nullable = true)
# |-- Source IP: string (nullable = true)
# |-- Event Type: string (nullable = true)
# Your code executed
from pyspark.sql.types import *
json_schema = StructType(
[
StructField("EventTime", StringType()),
StructField("Hostname", StringType()),
StructField("Keywords", IntegerType()),
]
)
json_converted_df = df.select(
F.from_json(F.col("Event Type"), json_schema).alias("data")
).select("data.*")
json_converted_df.show()
#+-------------------+-------------------+--------+
#| EventTime| Hostname|Keywords|
#+-------------------+-------------------+--------+
#|2020-09-24 13:33:01|abc-cde.india.local| -1234|
#+-------------------+-------------------+--------+
json_converted_df.printSchema()
#root
# |-- EventTime: string (nullable = true)
# |-- Hostname: string (nullable = true)
# |-- Keywords: integer (nullable = true)

Aggregate one column, but show all columns in select

I try to show maximum value from column while I group rows by date column.
So i tried this code
maxVal = dfSelect.select('*')\
.groupBy('DATE')\
.agg(max('CLOSE'))
But output looks like that:
+----------+----------+
| DATE|max(CLOSE)|
+----------+----------+
|1987-05-08| 43.51|
|1987-05-29| 39.061|
+----------+----------+
I wanna have output like below
+------+---+----------+------+------+------+------+------+---+----------+
|TICKER|PER| DATE| TIME| OPEN| HIGH| LOW| CLOSE|VOL|max(CLOSE)|
+------+---+----------+------+------+------+------+------+---+----------+
| CDG| D|1987-01-02|000000|50.666|51.441|49.896|50.666| 0| 50.666|
| ABC| D|1987-01-05|000000|51.441| 52.02|51.441|51.441| 0| 51.441|
+------+---+----------+------+------+------+------+------+---+----------+
So my question is how to change the code to have output with all columns and aggregated 'CLOSE' column?
Scheme of my data looks like below:
root
|-- TICKER: string (nullable = true)
|-- PER: string (nullable = true)
|-- DATE: date (nullable = true)
|-- TIME: string (nullable = true)
|-- OPEN: float (nullable = true)
|-- HIGH: float (nullable = true)
|-- LOW: float (nullable = true)
|-- CLOSE: float (nullable = true)
|-- VOL: integer (nullable = true)
|-- OPENINT: string (nullable = true)
If you want the same aggregation all your columns in the original dataframe, then you can do something like,
import pyspark.sql.functions as F
expr = [F.max(coln).alias(coln) for coln in df.columns if 'date' not in coln] # df is your datafram
df_res = df.groupby('date').agg(*expr)
If you want multiple aggregations, then you can do like,
sub_col1 = # define
sub_col2=# define
expr1 = [F.max(coln).alias(coln) for coln in sub_col1 if 'date' not in coln]
expr2 = [F.first(coln).alias(coln) for coln in sub_col2 if 'date' not in coln]
expr=expr1+expr2
df_res = df.groupby('date').agg(*expr)
If you want only one of the columns aggregated and added to your original dataframe, then you can do a selfjoin after aggregating
df_agg = df.groupby('date').agg(F.max('close').alias('close_agg')).withColumn("dummy",F.lit("dummmy")) # dummy column is needed as a workaround in spark issues of self join
df_join = df.join(df_agg,on='date',how='left')
or you can use a windowing function
from pyspark.sql import Window
w= Window.partitionBy('date')
df_res = df.withColumn("max_close",F.max('close').over(w))

Spark SQL not equals not working on column created using lag

I am trying to find missing sequences in a CSV File and here is the code i have
val customSchema = StructType(
Array(
StructField("MessageId", StringType, false),
StructField("msgSeqID", LongType, false),
StructField("site", StringType, false),
StructField("msgType", StringType, false)
)
)
val logFileDF = sparkSession.sqlContext.read.format("csv")
.option("delimiter",",")
.option("header", false)
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load(logFilePath)
.toDF()
logFileDF.printSchema()
logFileDF.repartition(5000)
logFileDF.createOrReplaceTempView("LogMessageData")
sparkSession.sqlContext.cacheTable("LogMessageData")
val selectQuery: String = "SELECT MessageId,site,msgSeqID,msgType,lag(msgSeqID,1) over (partition by site,msgType order by site,msgType,msgSeqID) as prev_val FROM LogMessageData order by site,msgType,msgSeqID"
val logFileLagRowNumDF = sparkSession.sqlContext.sql(selectQuery).toDF()
logFileLagRowNumDF.repartition(1000)
logFileLagRowNumDF.printSchema()
logFileLagRowNumDF.createOrReplaceTempView("LogMessageDataUpdated")
sparkSession.sqlContext.cacheTable("LogMessageDataUpdated")
val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val!=null and msgSeqID != prev_val+1 order by site,msgType,msgSeqID";
val errorRecordQueryDF = sparkSession.sqlContext.sql(errorRecordQuery).toDF()
logger.info("Total. No.of Missing records =[" + errorRecordQueryDF.count() + "]")
val noOfMissingRecords = errorRecordQueryDF.count()
Here is the sample Data i have
Msg_1,S1,A,10000000
Msg_2,S1,A,10000002
Msg_3,S2,A,10000003
Msg_4,S3,B,10000000
Msg_5,S3,B,10000001
Msg_6,S3,A,10000003
Msg_7,S3,A,10000001
Msg_8,S3,A,10000002
Msg_9,S4,A,10000000
Msg_10,S4,A,10000001
Msg_11,S4,A,10000000
Msg_12,S4,A,10000005
And here is the output i am getting.
root
|-- MessageId: string (nullable = true)
|-- site: string (nullable = true)
|-- msgType: string (nullable = true)
|-- msgSeqID: long (nullable = true)
INFO EimComparisonProcessV2 - Total No.Of Records =[12]
+---------+----+-------+--------+
|MessageId|site|msgType|msgSeqID|
+---------+----+-------+--------+
| Msg_1| S1| A|10000000|
| Msg_2| S1| A|10000002|
| Msg_3| S2| A|10000003|
| Msg_4| S3| B|10000000|
| Msg_5| S3| B|10000001|
| Msg_6| S3| A|10000003|
| Msg_7| S3| A|10000001|
| Msg_8| S3| A|10000002|
| Msg_9| S4| A|10000000|
| Msg_10| S4| A|10000001|
| Msg_11| S4| A|10000000|
| Msg_12| S4| A|10000005|
+---------+----+-------+--------+
root
|-- MessageId: string (nullable = true)
|-- site: string (nullable = true)
|-- msgSeqID: long (nullable = true)
|-- msgType: string (nullable = true)
|-- prev_val: long (nullable = true)
+---------+----+--------+-------+--------+
|MessageId|site|msgSeqID|msgType|prev_val|
+---------+----+--------+-------+--------+
| Msg_1| S1|10000000| A| null|
| Msg_2| S1|10000002| A|10000000|
| Msg_3| S2|10000003| A| null|
| Msg_7| S3|10000001| A| null|
| Msg_8| S3|10000002| A|10000001|
+---------+----+--------+-------+--------+
only showing top 5 rows
INFO TestProcess - Total No.Of Records Updated DF=[12]
INFO TestProcess - Total. No.of Missing records =[0]
Just replace your query with this line of code it will show you the results.
val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val is not null and msgSeqID <> prev_val+1 order by site,msgType,msgSeqID";

Convert datatype of cloumn from StringType to StructType in dataframe in spark scala

| ID|CO_ID| DATA|
+--------------------+--------------------+----+
|ABCD123|abc12|[{"month":"Jan","day":"monday"}] |
|BCHG345|wed34|[{"month":"Jul","day":"tuessay"}]|
I have dataframe above in which column DATA is of StringType.I want it to convert to StructType. How can I do this?
Use from_json
df.withColumn("data_struct",from_json($"data",StructType(Array(StructField("month", StringType),StructField("day", StringType)))))
On Spark 2.4.0, I get the following
import org.apache.spark.sql.types.{StructType, StructField, StringType}
val df = List ( ("[{\"month\":\"Jan\",\"day\":\"monday\"}]")).toDF("data")
val df2 = df.withColumn("data_struct",from_json($"data",StructType(Array(StructField("month", StringType),StructField("day", StringType)))))
df2.show
+--------------------+-------------+
| data| data_struct|
+--------------------+-------------+
|[{"month":"Jan","...|[Jan, monday]|
+--------------------+-------------+
df2.printSchema
root
|-- data: string (nullable = true)
|-- data_struct: struct (nullable = true)
| |-- month: string (nullable = true)
| |-- day: string (nullable = true)

org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);

I am using spark sql to select a column along with sum of another column:
Below is my query:
scala> spark.sql("select distinct _c3,sum(_c9).as(sumAadhar) from aadhar group by _c3 order by _c9 desc LIMIT 3").show
And my schema is :
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: double (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: string (nullable = true)
And I a getting below error:
org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:613)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
Any idea, what am i doing wrong or is there any other way to sum the values of a column
Check below which is tried on a reduced schema:
scala> val df = Seq(("a", 2), ("a", 3), ("b", 4), ("a", 9), ("b", 1), ("c", 100)).toDF("_c3", "_c9") df: org.apache.spark.sql.DataFrame = [_c3: string, _c9: int]
scala> df.createOrReplaceTempView("aadhar")
scala> spark.sql("select _c3,sum(_c9) as sumAadhar from aadhar group by _c3 order by sumAadhar desc LIMIT 3").show
+---+---------+
|_c3|sumAadhar|
+---+---------+
| c| 100|
| a| 14|
| b| 5|
+---+---------+
Removed distinct since its not necessary as your original query already groups by _c3.
Changed sum(_c9).as(sumAadhar) to sum(_c9) as sumAadhar as I think that syntax was leading spark sql to do some unintended casting.

Resources