I execute "ALTER TABLE table_name CHANGE COLUMN col_name col_name column_type COMMENT col_comment;" to add comment to my table, and I succeeded.In hive I desc my table and I got like this:
hive> desc mytable;
+---------+----------+---------+
|col_name |data_type |comment |
|---------+----------+---------+
|col1 |string |name |
+---------+----------+---------+
but in spark-sql the comment is gone:
spark-sql> desc mytable;
+---------+----------+---------+
|col_name |data_type |comment |
|---------+----------+---------+
|col1 |string |null |
+---------+----------+---------+
By the way, I use MySQL for metastore.
How can I get the comment in spark-sql?
Related
I have a bunch of csv files for which I am using Pyspark for faster processing. However, am a total noob with Spark (Pyspark). So far I have been able to create a RDD, a subsequent data frame and a temporary view (country_name) to easily query the data.
Input Data
+---+--------------------------+-------+--------------------------+-------------------+
|ID |NAME |COUNTRY|ADDRESS |DESCRIPTION |
+---+--------------------------+-------+--------------------------+-------------------+
|1 | |QAT | |INTERIOR DECORATING|
|2 |S&T |QAT |AL WAAB STREET |INTERIOR DECORATING|
|3 | |QAT | |INTERIOR DECORATING|
|4 |THE ROSA BERNAL COLLECTION|QAT | |INTERIOR DECORATING|
|5 | |QAT |AL SADD STREET |INTERIOR DECORATING|
|6 |AL MANA |QAT |SALWA ROAD |INTERIOR DECORATING|
|7 | |QAT |SUHAIM BIN HAMAD STREET |INTERIOR DECORATING|
|8 |INTERTEC |QAT |AL MIRQAB AL JADEED STREET|INTERIOR DECORATING|
|9 | |EGY | |HOTELS |
|10 | |EGY |QASIM STREET |HOTELS |
|11 |AIRPORT HOTEL |EGY | |HOTELS |
|12 | |EGY |AL SOUQ |HOTELS |
+---+--------------------------+-------+--------------------------+-------------------+
I am stuck trying to convert this particular PostgreSQL query into sparksql.
select country,
name as 'col_name',
description,
ct,
ct_desc,
(ct*100/ct_desc)
from
(select description,
country,
count(name) over (PARTITION by description) as ct,
count(description) over (PARTITION by description) as ct_desc
from country_table
) x
group by 1,2,3,4,5,6
Correct output from PostgreSQL -
+-------+--------+-------------------+--+-------+----------------+
|country|col_name|description |ct|ct_desc|(ct*100/ct_desc)|
+-------+--------+-------------------+--+-------+----------------+
|QAT |name |INTERIOR DECORATING|7 |14 |50.0 |
+-------+--------+-------------------+--+-------+----------------+
Here is the sparksql query I am using -
df_fill_by_col = spark.sql("select country,
name as 'col_name',
description,
ct,
ct_desc,
(ct*100/ct_desc)
from
( Select description,
country,
count(name) over (PARTITION by description) as ct,
count(description) over (PARTITION by description) as ct_desc
from country_name
)x
group by 1,2,3,4,5,6 ")
df_fill_by_col.show()
From SparkSQL -
+-------+--------+-------------------+--+-------+----------------+
|country|col_name|description |ct|ct_desc|(ct*100/ct_desc)|
+-------+--------+-------------------+--+-------+----------------+
|QAT |name |INTERIOR DECORATING|14|14 |100.0 |
+-------+--------+-------------------+--+-------+----------------+
The sparksql query is giving odd outputs especially where few values are null in the dataframe.
For the same file and record the ct column is giving double value 7 v/s 14.
Below is the entire code, from reading the csv file to creating dataframe and querying data.
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
import csv, copy, os, sys, unicodedata, string, time, glob
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType
if __name__ == "__main__":
spark = SparkSession.builder.appName("PythonSQL").config("spark.some.config.option", "some-value").getOrCreate()
sc = spark.sparkContext
lines = sc.textFile("path_to_csvfiles")
parts = lines.map(lambda l: l.split("|"))
country_name = parts.map(lambda p: (p[0], p[1], p[2], p[3], p[4].strip()))
schemaString = "ID NAME COUNTRY ADDRESS DESCRIPTION"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
df_schema = StructType(fields)
df_schema1 = spark.createDataFrame(country_name, df_schema)
df_schema1.createOrReplaceTempView("country_name")
df_schema1.cache()
df_fill_by_col = spark.sql("select country, name as 'col_name', description, ct, ct_desc, (ct*100/ct_desc) from ( Select description, country, count(name) over (PARTITION by description) as ct, count(description) over (PARTITION by description) as ct_desc from country_name )x group by 1,2,3,4,5,6 ")
df_fill_by_col.show()
Please let me know if there is a way of getting the sparksql query to work.
Thanks,
Pankaj
Edit - This code will run on multiple countries and columns
I am running a pretty simple query in databricks notebook which involves a subquery.
select recorddate, count(*)
from( select record_date as recorddate, column1
from table1
where record_date >= date_sub(current_date(), 1)
)t
group by recorddate
order by recorddate
I get the following exception:
Error in SQL statement: package.TreeNodeException: Binding attribute, tree: recorddate
And when remove the order by clause, the query runs fine. I see some posts talking about similar issues but exactly the same. Is this a known behavior? Any workaround/fix for this?
Working well for me, (spark = 2.4.5) I think problem is something different-
val df = spark.sql("select current_date() as record_date, '1' column1")
df.show(false)
/**
* +-----------+-------+
* |record_date|column1|
* +-----------+-------+
* |2020-07-29 |1 |
* +-----------+-------+
*/
df.createOrReplaceTempView("table1")
spark.sql(
"""
|select recorddate, count(*)
|from( select record_date as recorddate, column1
| from table1
| where record_date >= date_sub(current_date(), 1)
| )t
|group by recorddate
|order by recorddate
|
""".stripMargin)
.show(false)
/**
* +----------+--------+
* |recorddate|count(1)|
* +----------+--------+
* |2020-07-29|1 |
* +----------+--------+
*/
I have a dataset of SQL queries in raw text and another with a regular expression of all the possible table names:
# queries
+-----+----------------------------------------------+
| id | query |
+-----+----------------------------------------------+
| 1 | select * from table_a, table_b |
| 2 | select * from table_c join table_d... |
+-----+----------------------------------------------+
# regexp
'table_a|table_b|table_c|table_d'
And I wanted the following result:
# expected result
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | [table_a, table_b] |
| 2 | [table_c, table_d] |
+-----+----------------------------------------------+
But using the following SQL in Spark, all I get is the first match...
select
id,
regexp_extract(query, 'table_a|table_b|table_c|table_d') as tables
from queries
# actual result
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | table_a |
| 2 | table_c |
+-----+----------------------------------------------+
Is there any way to do this using only Spark SQL? This is the function I am using https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#regexp_extract
EDIT
I would also accept a solution that returned the following:
# alternative solution
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | table_a |
| 1 | table_b |
| 2 | table_c |
| 2 | table_d |
+-----+----------------------------------------------+
SOLUTION
#chlebek solved this below. I reformatted his SQL using CTEs for better readability:
with
split_queries as (
select
id,
explode(split(query, ' ')) as col
from queries
),
extracted_tables as (
select
id,
regexp_extract(col, 'table_a|table_b|table_c|table_d', 0) as rx
from split_queries
)
select
id,
collect_set(rx) as tables
from extracted_tables
where rx != ''
group by id
Bear in mind that the split(query, ' ') part of the query will split your SQL only by spaces. If you have other things such as tabs, line breaks, comments, etc., you should deal with these before or when splitting.
If you have only a few values to check you can achieve it using contains function instead of regexp:
val names = Seq("table_a","table_b","table_c","table_d")
def c(col: Column) = names.map(n => when(col.contains(n),n).otherwise(""))
df.select('id,array_remove(array(c('query):_*),"").as("result")).show(false)
but using regexp it will looks like below (Spark SQL API):
df.select('id,explode(split('query," ")))
.select('id,regexp_extract('col,"table_a|table_b|table_c|table_d",0).as("rx"))
.filter('rx=!="")
.groupBy('id)
.agg(collect_list('rx))
and it could be translated to below SQL query:
select id, collect_list(rx) from
(select id, regexp_extract(col,'table_a|table_b|table_c|table_d',0) as rx from
(select id, explode(split(query,' ')) as col from df) q1
) q2
where rx != '' group by id
so output will be:
+---+------------------+
| id| collect_list(rx)|
+---+------------------+
| 1|[table_a, table_b]|
| 2|[table_c, table_d]|
+---+------------------+
As you are using spark-sql, you can use sql parser & it will do job for you.
def getTables(query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}
val query = "select * from table_1 as a left join table_2 as b on
a.id=b.id"
scala> getTables(query).foreach(println)
table_1
table_2
You can register 'getTables' as udf & use in query
You can use another SQL function available in Spark called collect_list https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#collect_list. You can find another sample https://mungingdata.com/apache-spark/arraytype-columns/
Basically, applying to your code it should be
val df = spark.sql("select 1 id, 'select * from table_a, table_b' query" )
val df1 = spark.sql("select 2 id, 'select * from table_c join table_d' query" )
val df3 = df.union(df1)
df3.createOrReplaceTempView("tabla")
spark.sql("""
select id, collect_list(tables) from (
select id, explode(split(query, ' ')) as tables
from tabla)
where tables like 'table%' group by id""").show
The output will be
+---+--------------------+
| id|collect_list(tables)|
+---+--------------------+
| 1| [table_a,, table_b]|
| 2| [table_c, table_d]|
+---+--------------------+
Hope this helps
If you are on spark>=2.4 then you can remove exploding and collecting the same operations by using higher order functions on array and without any subqueries-
Load the test data
val data =
"""
|id | query
|1 | select * from table_a, table_b
|2 | select * from table_c join table_d on table_c.id=table_d.id
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(";"))
.toSeq.toDS()
val df = spark.read
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- id: integer (nullable = true)
* |-- query: string (nullable = true)
*
* +---+-----------------------------------------------------------+
* |id |query |
* +---+-----------------------------------------------------------+
* |1 |select * from table_a, table_b |
* |2 |select * from table_c join table_d on table_c.id=table_d.id|
* +---+-----------------------------------------------------------+
*/
Extract the tables from query
// spark >= 2.4.0
df.createOrReplaceTempView("queries")
spark.sql(
"""
|select id,
| array_distinct(
| FILTER(
| split(query, '\\.|=|\\s+|,'), x -> x rlike 'table_a|table_b|table_c|table_d'
| )
| )as tables
|FROM
| queries
""".stripMargin)
.show(false)
/**
* +---+------------------+
* |id |tables |
* +---+------------------+
* |1 |[table_a, table_b]|
* |2 |[table_c, table_d]|
* +---+------------------+
*/
I would like to maintain a streaming dataframe that get "update".
To do so I will use dropDuplicates.
But dropDuplicates drop the latest change.
How can I retain the last only?
Assuming you need to select the last record on id column by removing other duplicates, you can use the window functions and filter on row_number = count. Check this out
scala> val df = Seq((120,34.56,"2018-10-11"),(120,65.73,"2018-10-14"),(120,39.96,"2018-10-20"),(122,11.56,"2018-11-20"),(122,24.56,"2018-10-20")).toDF("id","amt","dt")
df: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]
scala> val df2=df.withColumn("dt",'dt.cast("date"))
df2: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]
scala> df2.show(false)
+---+-----+----------+
|id |amt |dt |
+---+-----+----------+
|120|34.56|2018-10-11|
|120|65.73|2018-10-14|
|120|39.96|2018-10-20|
|122|11.56|2018-11-20|
|122|24.56|2018-10-20|
+---+-----+----------+
scala> df2.createOrReplaceTempView("ido")
scala> spark.sql(""" select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido """).show(false)
+---+-----+----------+---+---+
|id |amt |dt |rw |cw |
+---+-----+----------+---+---+
|122|24.56|2018-10-20|1 |2 |
|122|11.56|2018-11-20|2 |2 |
|120|34.56|2018-10-11|1 |3 |
|120|65.73|2018-10-14|2 |3 |
|120|39.96|2018-10-20|3 |3 |
+---+-----+----------+---+---+
scala> spark.sql(""" select id,amt,dt from (select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido) where rw=cw """).show(false)
+---+-----+----------+
|id |amt |dt |
+---+-----+----------+
|122|11.56|2018-11-20|
|120|39.96|2018-10-20|
+---+-----+----------+
scala>
If you want to sort on dt descending you can just give "order by dt desc" in the over(0 clause.. Does this help?
I have a table that has data distribution like :
sqlContext.sql( """ SELECT
count(to_Date(PERIOD_DT)), to_date(PERIOD_DT)
from dbname.tablename group by to_date(PERIOD_DT) """).show
+-------+----------+
| _c0| _c1|
+-------+----------+
|1067177|2016-09-30|
|1042566|2017-07-07|
|1034333|2017-07-31|
+-------+----------+
However, when I run a query like the following :
sqlContext.sql(""" SELECT COUNT(*)
from dbname.tablename
where PERIOD_DT = '2017-07-07' """).show
Surprisingly, it returns :
+-------+
| _c0|
+-------+
|3144076|
+-------+
But if I changed PERIOD_DT to lowercase, i.e., period_dt , it returns the correct result
sqlContext.sql("""
SELECT COUNT(*)
from dbname.table
where period_dt='2017-07-07' """).show
+-------+
| _c0|
+-------+
|1042566|
+-------+
period_dt is the column on which the table is partitioned and it's type is char(10)
The table data is stored as Parquet :
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
What might be causing this inconsistency?
It is a case sensitive issue . Because of limitations of hive meta store schema, table is always lowercase. Parquet should resolve the issue