I need a help with nested structure in SparkSQL using sql method. I created a data frame on top of existing RDD (dataRDD) with a structure like this:
schema=StructType([ StructField("m",LongType()) ,
StructField("field2", StructType([
StructField("st",StringType()),
StructField("end",StringType()),
StructField("dr",IntegerType()) ]) )
])
printSchema() returns this:
root
|-- m: long (nullable = true)
|-- field2: struct (nullable = true)
| |-- st: string (nullable = true)
| |-- end: string (nullable = true)
| |-- dr: integer (nullable = true)
Creating the data frame from the data RDD and applying the schema works well.
df= sqlContext.createDataFrame( dataRDD, schema )
df.registerTempTable( "logs" )
But retrieving the data is not working:
res = sqlContext.sql("SELECT m, field2.st FROM logs") # <- This fails
...org.apache.spark.sql.AnalysisException: cannot resolve 'field.st' given input columns msisdn, field2;
res = sqlContext.sql("SELECT m, field2[0] FROM logs") # <- Also fails
...org.apache.spark.sql.AnalysisException: unresolved operator 'Project [field2#1[0] AS c0#2];
res = sqlContext.sql("SELECT m, st FROM logs") # <- Also not working
...cannot resolve 'st' given input columns m, field2;
So how can I access the nested structure in the SQL syntax?
Thanks
You had something else happening in your testing, because field2.st is the correct syntax:
case class field2(st: String, end: String, dr: Int)
val schema = StructType(
Array(
StructField("m",LongType),
StructField("field2", StructType(Array(
StructField("st",StringType),
StructField("end",StringType),
StructField("dr",IntegerType)
)))
)
)
val df2 = sqlContext.createDataFrame(
sc.parallelize(Array(Row(1,field2("this","is",1234)),Row(2,field2("a","test",5678)))),
schema
)
/* df2.printSchema
root
|-- m: long (nullable = true)
|-- field2: struct (nullable = true)
| |-- st: string (nullable = true)
| |-- end: string (nullable = true)
| |-- dr: integer (nullable = true)
*/
val results = sqlContext.sql("select m,field2.st from df2")
/* results.show
m st
1 this
2 a
*/
Look back at your error message: cannot resolve 'field.st' given input columns msisdn, field2 -- field vs. field2. Check your code again -- the names are not lining up.
Related
Schema
root
|-- userId: string (nullable = true)
|-- languageknowList: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- code: string (nullable = false)
| | |-- description: string (nullable = false)
| | |-- name: string (nullable = false)
The df has userId and languageknownList. Every user should know English, so English language is not present in languageknowList I have to add.
English
code: 10
description: English Language
name: English
Any one please help me.
You can create a new array of structs column and concat to the existing column:
import pyspark.sql.functions as F
english = F.struct(F.lit('10').alias('code'),
F.lit('English Language').alias('description'),
F.lit('English').alias('name')
)
df2 = df.withColumn(
'languageknowList',
F.when(
~F.array_contains(F.col('languageknowList'), english),
F.concat(
F.col('languageknowList'),
F.array(english)
)
).otherwise(
F.col('languageknowList')
)
)
I have a column that contains array of structs. It looks like this:
|-- Network: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Code: string (nullable = true)
| | |-- Signal: string (nullable = true)
This is just a small sample, there are many more columns inside the struct than this. Is there a way to take the arrays in the column for each row, concatenate them and make them into one string? For example, we could have something like this:
[["example", 2], ["example2", 3]]
Is there a way to make into:
"example2example3"?
Assuming having a dataframe df with the following schema:
df.printSchema
df with sample data:
df.show(false)
You need to first explode the Network array to select the struct elements Code and signal.
var myDf = df.select(explode($"Network").as("Network"))
Then you need to concat the two columns using the concat() function and then pass the output to the collect_list() function which will aggregate all rows into one row of type array<string>
myDf = myDf.select(collect_list(concat($"Network.code",$"Network.signal")).as("data"))
Finally, you need to concat into the required format which can be done using concat_ws() function which takes two arguments, the first being the separator to be placed between two string and the second argument being a column with array<string> type which is our output from our previous step. As per your use case, we don't need any separator to be placed between two concatenates strings hence we keep the separator argument as an empty quote.
myDf = myDf.select(concat_ws("",$"data").as("data"))
All the above steps can be done in one line
myDf= myDf.select(explode($"Network").as("Network")).select(concat_ws("",collect_list(concat($"Network.code",$"Network.signal"))).as("data")).show(false)
If you want the output directly into a String variable then use:
val myStr = myDf.first.get(0).toString
print(myStr)
There is a library called spark-hats (Github, small article) that you might find very useful in these situations.
With its use, you can map the array easily and output the concatenation next to the elements or even somewhere else if you provide a fully qualified name.
Setup
import org.apache.spark.sql.functions._
import za.co.absa.spark.hats.Extensions._
scala> df.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
scala> df.show(false)
+---------------------------+------------------------------+
|info |teamName |
+---------------------------+------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb], [Monster, mt]]|
+---------------------------+------------------------------+
Command you are looking for
scala> val dfOut = df.nestedMapColumn(inputColumnName = "teamName", outputColumnName = "nextElementInArray", expression = a => concat(a.getField("team1"), a.getField("team2")) )
dfOut: org.apache.spark.sql.DataFrame = [info: struct<drivers: struct<carName: string, carNumbers: string ... 1 more field>>, teamName: array<struct<team1:string,team2:string,nextElementInArray:string>>]
Output
scala> dfOut.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
| | |-- nextElementInArray: string (nullable = true)
scala> dfOut.show(false)
+---------------------------+----------------------------------------------------+
|info |teamName |
+---------------------------+----------------------------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb, Redbullrb], [Monster, mt, Monstermt]]|
+---------------------------+----------------------------------------------------+
I'm working on project where everyday I need to deal with tons of AVRO files. To extract the data from AVRO I use sparkSQL. To achieve this first I need to printSchema and then I need to select the fields to see the data. I want to automate this process. Given any input AVRO I want to write a script which will automatically generated SparkSQL query(considering the struct and arrays in avsc file). I'm okay to write a script in Java or Python.
-- Sample input AVRO
root
|-- identifier: struct (nullable = true)
| |-- domain: string (nullable = true)
| |-- id: string (nullable = true)
| |-- version: long (nullable = true)
alternativeIdentifiers: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- identifier: struct (nullable = true)
| | | | |-- domain: string (nullable = true)
| | | | |-- id: string (nullable = true)
-- Output I'm expecting
SELECT identifier.domain, identifier.id, identifier.version
You can use something like this to generate list of columns based on the schema:
import org.apache.spark.sql.types.{StructField, StructType}
def getStructFieldName(f: StructField, baseName: String = ""): Seq[String] = {
val bname = if (baseName.isEmpty) "" else baseName + "."
f.dataType match {
case StructType(s) =>
s.flatMap(x => getStructFieldName(x, bname + f.name))
case _ => Seq(bname + f.name)
}
}
Then it could be used on the real dataframe, like this:
val data = spark.read.json("some_data.json")
val cols = data.schema.flatMap(x => getStructFieldName(x))
as result, we're getting the sequence of strings, that we can use either to do a select:
import org.apache.spark.sql.functions.col
data.select(cols.map(col): _*)
or we can generate a comma-separated list that we can use in the spark.sql:
spark.sql(s"select ${cols.mkString(", ")} from table")
I currently had created a struct field in this way:
df = df.withColumn('my_struct', struct(
col('id').alias('id_test')
col('value').alias('value_test')
).alias('my_struct'))
The think is that now I need to add and extra field to my_struct called "optional". This field must be there when it exits and remove it when it's not. Sadly values like null/none not an option.
So far I have two different dataframes, one with the desired value and the column by id and another one without the value/column and all the information.
df_optional = df_optional.select('id','optional')
df = df.select('id','value','my_struct')
I want to add into df.my_struct the optional value when df_optional.id join df.join plus the rest.
Till this point I have this:
df_with_option = df.join(df_optional,on=['id'],how='inner') \
.withColumn('my_struct', struct(
col('id').alias('id_test')
col('value').alias('value_test')
col(optional).alias('optional')
).alias('my_struct')).drop('optional')
df_without = df.join(df_optional,on=['id'],how='leftanti') # it already have my_struct
But union should have similar columns so my code breaks.
df_result = df_without .unionByName(df_with_option)
I want to union both dataframes because at the end I write a json file partitioned by id:
df_result.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('overwrite').save('my_path')
Those json files should have the 'optional' column when it has values, otherwise it should be out of the schema.
Any help will be appreciate.
--ADITIONAL INFO.
Schema input:
df_root
|-- id: string (nullable = true)
|-- optional: string (nullable = true)
df_optional
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: string (nullable = true)
Schema output:
df_result
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: string (nullable = true)
| |-- optional: string (nullable = true) (*)
(*) Only when it exists.
--UPDATE
I think that it jsut not posible in that way. I probably need to keep both dataframes appart and just write it two times. Something like this:
df_without.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('overwrite').save('my_path')
df_with_option.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('append').save('my_path')
Then I will had in my path the files by it's own way.
I have following dataframes:
accumulated_results_df
|-- company_id: string (nullable = true)
|-- max_dd: string (nullable = true)
|-- min_dd: string (nullable = true)
|-- count: string (nullable = true)
|-- mean: string (nullable = true)
computed_df
|-- company_id: string (nullable = true)
|-- min_dd: date (nullable = true)
|-- max_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
Trying to do a join using spark-sql as below
val resultDf = accumulated_results_df.as("a").join(computed_df.as("c"),
( $"a.company_id" === $"c.company_id" ) && ( $"c.min_dd" > $"a.max_dd" ), "left")
Giving error as :
org.apache.spark.sql.AnalysisException: Reference 'company_id' is ambiguous, could be: a.company_id, c.company_id.;
What am i doing wrong here and How to fix this ?
Should work using the col function to reference correctly the alias dataframes and columns
val resultDf = (accumulated_results_df.as("a")
.join(
computed_df.as("c"),
(col("a.company_id") === col("c.company_id")) && (col("c.min_dd") > col("a.max_dd")),
"left"
)
I have fixed it something like below.
val resultDf = accumulated_results_df.join(computed_df.withColumnRenamed("company_id", "right_company_id").as("c"),
( accumulated_results_df("company_id" ) === $"c.right_company_id" && ( $"c.min_dd" > accumulated_results_df("max_dd") ) )
, "left")