Accessing column names with periods - Spark SQL 1.3 - apache-spark

I have a DataFrame with fields which contain a period. When I attempt to use select() on them the Spark cannot resolve them, likely because '.' is used for accessing nested fields.
Here's the error:
enrichData.select("google.com")
org.apache.spark.sql.AnalysisException: cannot resolve 'google.com' given input columns google.com, yahoo.com, ....
Is there a way to access these columns? Or an easy way to change the column names (as I can't select them, how can I change the names?).

Having a period in column name makes spark assume it as Nested field, field in a field. To counter that, you need to use a backtick "`". This should work:
scala> val df = Seq(("yr", 2000), ("pr", 12341234)).toDF("x.y", "e")
df: org.apache.spark.sql.DataFrame = [x.y: string, e: int]
scala> df.select("`x.y`").show
+---+
|x.y|
+---+
| yr|
| pr|
+---+
you need to put a backtick(`)

You can drop the schema and recreate it without the periods like this:
val newEnrichData = sqlContext.createDataFrame(
enrichData.rdd,
StructType(enrichData.schema.fields.map(sf =>
StructField(sf.name.replace(".", ""), sf.dataType, sf.nullable)
))
)

Related

SPARK Parquet file loading with different schema

We have parquet files generated with two different schemas where we have ID and Amount fields.
File:
file1.snappy.parquet
ID: INT
AMOUNT: DECIMAL(15,6)
Content:
1,19500.00
2,198.34
file2.snappy.parquet
ID: INT
AMOUNT: DECIMAL(15,2)
Content:
1,19500.00
3,198.34
When I am loading both the files together df3 = spark.read.parquet("output/"), and tried to get the data it is inferring the schema of Decimal(15,6) to the file which has amount with Decimal(16,2) and that files data is getting manipulated wrongly. Is there is a way that I can retrieve the data properly for this case.
Final output I could see after executing df3.show()
+---+-----------------+
|ID|       AMOUNT|
+---+-----------------+
| 1|        1.950000|
| 3|        0.019834|
| 1|19500.000000|
| 2|    198.340000|
+---+-----------------+
Here if you see for 1st and 2nd row the amount got manipulated incorrectly.
Looking for some suggestions on this. I know if we regenerate the files with same schema this issue will go away, this requires regeneration and replacing of the files which were delivered, is there any other way temporary which we can use and mean while we will work on regenerating those files.
~R,
Krish
You can try by using mergeSchema property as true.
So instead of
df3 = spark.read.parquet("output/")
Try this:
df3 = spark.read.option("mergeSchema","true").parquet("output/")
But this will give inconsistency records if the version of spark is different for both the parquet. in this case the new version of spark should set the below property to true.
spark.sql.parquet.writeLegacyFormat
Try to read this as a string and provide the schema manually while reading the file
schema = StructType([
StructField("flag_piece", StringType(), True)
])
spark.read.format("parquet").schema(schema).load(path)
The following worked for me:
df = spark.read.parquet("data_file/")
for col in df.columns:
df = df.withColumn(col, df[col].cast("string"))

Spark-Scala Try Select Statement

I'm trying to incorporate a Try().getOrElse() statement in my select statement for a Spark DataFrame. The project I'm working on is going to be applied to multiple environments. However, each environment is a little different in terms of the naming of the raw data for ONLY one field. I do not want to write several different functions to handle each different field. Is there a elegant way to handle exceptions, like this below, in a DataFrame select statement?
val dfFilter = dfRaw
.select(
Try($"some.field.nameOption1).getOrElse($"some.field.nameOption2"),
$"some.field.abc",
$"some.field.def"
)
dfFilter.show(33, false)
However, I keep getting the following error, which makes sense because it does not exist in this environments raw data, but I'd expect the getOrElse statement to catch that exception.
org.apache.spark.sql.AnalysisException: No such struct field nameOption1 in...
Is there a good way to handle exceptions in Scala Spark for select statements? Or will I need to code up different functions for each case?
val selectedColumns = if (dfRaw.columns.contains("some.field.nameOption1")) $"some.field.nameOption2" else $"some.field.nameOption2"
val dfFilter = dfRaw
.select(selectedColumns, ...)
So I'm revisiting this question after a year. I believe this solution to be much more elegant to implement. Please let me know anyone else's thoughts:
// Generate a fake DataFrame
val df = Seq(
("1234", "A", "AAA"),
("1134", "B", "BBB"),
("2353", "C", "CCC")
).toDF("id", "name", "nameAlt")
// Extract the column names
val columns = df.columns
// Add a "new" column name that is NOT present in the above DataFrame
val columnsAdd = columns ++ Array("someNewColumn")
// Let's then "try" to select all of the columns
df.select(columnsAdd.flatMap(c => Try(df(c)).toOption): _*).show(false)
// Let's reduce the DF again...should yield the same results
val dfNew = df.select("id", "name")
dfNew.select(columnsAdd.flatMap(c => Try(dfNew(c)).toOption): _*).show(false)
// Results
columns: Array[String] = Array(id, name, nameAlt)
columnsAdd: Array[String] = Array(id, name, nameAlt, someNewColumn)
+----+----+-------+
|id |name|nameAlt|
+----+----+-------+
|1234|A |AAA |
|1134|B |BBB |
|2353|C |CCC |
+----+----+-------+
dfNew: org.apache.spark.sql.DataFrame = [id: string, name: string]
+----+----+
|id |name|
+----+----+
|1234|A |
|1134|B |
|2353|C |
+----+----+

INSERT IF NOT EXISTS ELSE UPDATE in Spark SQL

Is there any provision of doing "INSERT IF NOT EXISTS ELSE UPDATE" in Spark SQL.
I have Spark SQL table "ABC" that has some records.
And then i have another batch of records that i want to Insert/update in this table based on whether they exist in this table or not.
is there a SQL command that i can use in SQL query to make this happen?
In regular Spark this could be achieved with a join followed by a map like this:
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List(("id1", "orginal"), ("id2", "original"))).toDF("df1_id", "df1_status")
val df2 = spark.sparkContext.parallelize(List(("id1", "new"), ("id3","new"))).toDF("df2_id", "df2_status")
val df3 = df1
.join(df2, 'df1_id === 'df2_id, "outer")
.map(row => {
if (row.isNullAt(2))
(row.getString(0), row.getString(1))
else
(row.getString(2), row.getString(3))
})
This yields:
scala> df3.show
+---+--------+
| _1| _2|
+---+--------+
|id3| new|
|id1| new|
|id2|original|
+---+--------+
You could also use select with udfs instead of map, but in this particular case with null-values, I personally prefer the map variant.
you can use spark sql like this :
select * from (select c.*, row_number() over (partition by tac order by tag desc) as
TAG_NUM from (
select
a.tac
,a.name
,0 as tag
from tableA a
union all
select
b.tac
,b.name
,1 as tag
from tableB b) c ) d where TAG_NUM=1
tac is column you want to insert/update by.
I know it's a bit late to share my code, but to add or update my database, i did a fuction that looks like this :
import pandas as pd
#Returns a spark dataframe with added and updated datas
#key parameter is the primary key of the dataframes
#The two parameters dfToUpdate and dfToAddAndUpdate are spark dataframes
def AddOrUpdateDf(dfToUpdate,dfToAddAndUpdate,key):
#Cast the spark dataframe dfToUpdate to pandas dataframe
dfToUpdatePandas = dfToUpdate.toPandas()
#Cast the spark dataframe dfToAddAndUpdate to pandas dataframe
dfToAddAndUpdatePandas = dfToAddAndUpdate.toPandas()
#Update the table records with the latest records, and adding new records if there are new records.
AddOrUpdatePandasDf = pd.concat([dfToUpdatePandas,dfToAddAndUpdatePandas]).drop_duplicates([key], keep = 'last').sort_values(key)
#Cast back to get a spark dataframe
AddOrUpdateDf = spark.createDataFrame(AddOrUpdatePandasDf)
return AddOrUpdateDf
As you can see, we need to cast the spark dataframes to pandas dataframe to be able to do the pd.concat and especially the drop_duplicates with the "keep = 'last'", then we cast back to spark dataframe and return it.
I don't think this is the best way to handle the AddOrUpdate, but at least, it works.

How to write just the `row` value of a DataFrame to a file in spark?

I have a dataframe that has just one column, whose value is a JSON string. I'm trying to write just the values to a file with one record per line.
scala> selddf.printSchema
root
|-- raw_event: string (nullable = true)
The data looks like this:
scala> selddf.show(1)
+--------------------+
| raw_event|
+--------------------+
|{"event_header":{...|
+--------------------+
only showing top 1 row
I am running the following to save it to file:
selddf.select("raw_event").write.json("/data/test")
The output looks like:
{"raw_event":"{\"event_header\":{\"version\":\"1.0\"...}"}
I would like the output to just say:
{\"event_header\":{\"version\":\"1.0\"...}
What am I missing?
The reason this happens is that when you write a json you are writing the dataframe in which the column is raw_event.
Your first option is to simply write it as text:
df.write.text(filename)
Another option (if your json schema is constant to all elements) is using the from_json function to convert this to a legal dataframe. Select the elements (the content of the column which would include all members of the json) and only then save it:
val df = Seq("{\"a\": \"str\", \"b\": [1,2,3], \"c\": {\"d\": 1, \"e\": 2}}").toDF("raw_event")
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("a", StringType), StructField("b", ArrayType(IntegerType)), StructField("c", StructType(Seq(StructField("d", IntegerType), StructField("e", IntegerType))))))
df.withColumn("jsonData", from_json($"raw_event", schema)).select("jsonData.*").write.json("bla.json")
The advantage of the second option is that you can test for maleformed rows (which would result in null) and therefore you can add a filter to remove them.
Note that in both cases you don't have escaping for the ". If you want that you would need to use the first option and first do a UDF which adds the escaping.

How to read a nested collection in Spark

I have a parquet table with one of the columns being
, array<struct<col1,col2,..colN>>
Can run queries against this table in Hive using LATERAL VIEW syntax.
How to read this table into an RDD, and more importantly how to filter, map etc this nested collection in Spark?
Could not find any references to this in Spark documentation. Thanks in advance for any information!
ps. I felt might be helpful to give some stats on the table.
Number of columns in main table ~600. Number of rows ~200m.
Number of "columns" in nested collection ~10. Avg number of records in nested collection ~35.
There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])].
Reading such nested collection from Parquet files can be tricky, though.
Let's take an example from the spark-shell (1.3.1):
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Inner(a: String, b: String)
defined class Inner
scala> case class Outer(key: String, inners: Seq[Inner])
defined class Outer
Write the parquet file:
scala> val outers = sc.parallelize(List(Outer("k1", List(Inner("a", "b")))))
outers: org.apache.spark.rdd.RDD[Outer] = ParallelCollectionRDD[0] at parallelize at <console>:25
scala> outers.toDF.saveAsParquetFile("outers.parquet")
Read the parquet file:
scala> import org.apache.spark.sql.catalyst.expressions.Row
import org.apache.spark.sql.catalyst.expressions.Row
scala> val dataFrame = sqlContext.parquetFile("outers.parquet")
dataFrame: org.apache.spark.sql.DataFrame = [key: string, inners: array<struct<a:string,b:string>>]
scala> val outers = dataFrame.map { row =>
| val key = row.getString(0)
| val inners = row.getAs[Seq[Row]](1).map(r => Inner(r.getString(0), r.getString(1)))
| Outer(key, inners)
| }
outers: org.apache.spark.rdd.RDD[Outer] = MapPartitionsRDD[8] at map at DataFrame.scala:848
The important part is row.getAs[Seq[Row]](1). The internal representation of a nested sequence of struct is ArrayBuffer[Row], you could use any super-type of it instead of Seq[Row]. The 1 is the column index in the outer row. I used the method getAs here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.
Now that you have a RDD[Outer], you can apply any wanted transformation or action.
// Filter the outers
outers.filter(_.inners.nonEmpty)
// Filter the inners
outers.map(outer => outer.copy(inners = outer.inners.filter(_.a == "a")))
Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.
dataFrame.select('col1, 'col2).map { row => ... }
I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.
The explode function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.
Create a test dataframe:
from pyspark.sql import Row
df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3]), Row(a=2, intlist=[4,5,6])])
df.show()
## +-+--------------------+
## |a| intlist|
## +-+--------------------+
## |1|ArrayBuffer(1, 2, 3)|
## |2|ArrayBuffer(4, 5, 6)|
## +-+--------------------+
Use explode to flatten the list column:
from pyspark.sql.functions import explode
df.select(df.a, explode(df.intlist)).show()
## +-+---+
## |a|_c0|
## +-+---+
## |1| 1|
## |1| 2|
## |1| 3|
## |2| 4|
## |2| 5|
## |2| 6|
## +-+---+
Another approach would be using pattern matching like this:
val rdd: RDD[(String, List[(String, String)]] = dataFrame.map(_.toSeq.toList match {
case List(key: String, inners: Seq[Row]) => key -> inners.map(_.toSeq.toList match {
case List(a:String, b: String) => (a, b)
}).toList
})
You can pattern match directly on Row but it is likely to fail for a few reasons.
Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.
Here's example how to use explode() in SQL directly to query nested collection.
SELECT hholdid, tsp.person_seq_no
FROM ( SELECT hholdid, explode(tsp_ids) as tsp
FROM disc_mrt.unified_fact uf
)
tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.
Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.
Have a look also on following JIRAs for Hive-compatible way to query nested data using LATERAL VIEW OUTER syntax, since Spark 2.2 also supports OUTER explode (e.g. when a nested collection is empty, but you still want to have attributes from a parent record):
SPARK-13721: Add support for LATERAL VIEW OUTER explode()
Noticable not resolved JIRA on explode() for SQL access:
SPARK-7549: Support aggregating over nested fields

Resources