How to convert a table into a Spark Dataframe - apache-spark

In Spark SQL, a dataframe can be queried as a table using this:
sqlContext.registerDataFrameAsTable(df, "mytable")
Assuming what I have is mytable, how can I get or access this as a DataFrame?

The cleanest way:
df = sqlContext.table("mytable")
Documentation

Well you can query it and save the result into a variable. Check that SQLContext's method sql returns a DataFrame.
df = sqlContext.sql("SELECT * FROM mytable")

Related

How to convert sql output to Dataframe?

I have a Dataframe, from which a create a temporary view in order to run sql queries. After a couple of sql queries, I'd like to convert the output of sql query to a new Dataframe. The reason I want data back in Dataframe is so that I can save it to blob storage.
So, the question is: what is the proper way to convert sql query output to Dataframe?
Here's the code I have so far:
%scala
//read data from Azure blob
...
var df = spark.read.parquet(some_path)
// create temp view
df.createOrReplaceTempView("data_sample")
%sql
//have some sqlqueries, the one below is just an example
SELECT
date,
count(*) as cnt
FROM
data_sample
GROUP BY
date
//Now I want to have a dataframe that has the above sql output. How to do that?
Preferably the code would be in python or scala.
Scala:
var df = spark.sql(s"""
SELECT
date,
count(*) as cnt
FROM
data_sample
GROUP BY
date
""")
PySpark:
df = spark.sql(f'''
SELECT
date,
count(*) as cnt
FROM
data_sample
GROUP BY
date
''')
You can create temporary view in %%sql code, and then reference it from pyspark or scala code like this:
%sql
create temporary view sql_result as
SELECT ...
%scala
var df = spark.sql("SELECT * FROM sql_result")
Since May 2022nd, results of SQL queries in the notebooks are available as _sqldf variable that is corresponding DataFrame object accessible from Python code. See documentation for more details & description of behavior.

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

spark Dataframe string to Hive varchar

I read data from Oracle via spark JDBC connection to a DataFrame. I have a column which is obviously StringType in dataframe.
Now I want to persist this in Hive, but as datatype Varchar(5). I know the string would be truncated but it is ok.
I tried using UDFs which didn't work since dataframe does not have varchar or char types. I also created a temporary view in Hive using:
val tv = df.createOrReplaceTempView("t_name")
val df = spark.sql("select cast(col_name as varchar(5)) from tv")
But then when i printSchema, i still see a string type.
How can I make I save it as a varchar column in Hive table ?
Try creating Hive table("dbName.tableName") with required schema (varchar(5) in this case) and insert into the table directly from Dataframe like below.
df.write.insertInto("dbName.tableName" ,overwrite = False)

INSERT IF NOT EXISTS ELSE UPDATE in Spark SQL

Is there any provision of doing "INSERT IF NOT EXISTS ELSE UPDATE" in Spark SQL.
I have Spark SQL table "ABC" that has some records.
And then i have another batch of records that i want to Insert/update in this table based on whether they exist in this table or not.
is there a SQL command that i can use in SQL query to make this happen?
In regular Spark this could be achieved with a join followed by a map like this:
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List(("id1", "orginal"), ("id2", "original"))).toDF("df1_id", "df1_status")
val df2 = spark.sparkContext.parallelize(List(("id1", "new"), ("id3","new"))).toDF("df2_id", "df2_status")
val df3 = df1
.join(df2, 'df1_id === 'df2_id, "outer")
.map(row => {
if (row.isNullAt(2))
(row.getString(0), row.getString(1))
else
(row.getString(2), row.getString(3))
})
This yields:
scala> df3.show
+---+--------+
| _1| _2|
+---+--------+
|id3| new|
|id1| new|
|id2|original|
+---+--------+
You could also use select with udfs instead of map, but in this particular case with null-values, I personally prefer the map variant.
you can use spark sql like this :
select * from (select c.*, row_number() over (partition by tac order by tag desc) as
TAG_NUM from (
select
a.tac
,a.name
,0 as tag
from tableA a
union all
select
b.tac
,b.name
,1 as tag
from tableB b) c ) d where TAG_NUM=1
tac is column you want to insert/update by.
I know it's a bit late to share my code, but to add or update my database, i did a fuction that looks like this :
import pandas as pd
#Returns a spark dataframe with added and updated datas
#key parameter is the primary key of the dataframes
#The two parameters dfToUpdate and dfToAddAndUpdate are spark dataframes
def AddOrUpdateDf(dfToUpdate,dfToAddAndUpdate,key):
#Cast the spark dataframe dfToUpdate to pandas dataframe
dfToUpdatePandas = dfToUpdate.toPandas()
#Cast the spark dataframe dfToAddAndUpdate to pandas dataframe
dfToAddAndUpdatePandas = dfToAddAndUpdate.toPandas()
#Update the table records with the latest records, and adding new records if there are new records.
AddOrUpdatePandasDf = pd.concat([dfToUpdatePandas,dfToAddAndUpdatePandas]).drop_duplicates([key], keep = 'last').sort_values(key)
#Cast back to get a spark dataframe
AddOrUpdateDf = spark.createDataFrame(AddOrUpdatePandasDf)
return AddOrUpdateDf
As you can see, we need to cast the spark dataframes to pandas dataframe to be able to do the pd.concat and especially the drop_duplicates with the "keep = 'last'", then we cast back to spark dataframe and return it.
I don't think this is the best way to handle the AddOrUpdate, but at least, it works.

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

Resources