Spark sql count changes on changing case

Spark sql count changes on changing case - apache-spark

I have a table that has data distribution like :
sqlContext.sql( """ SELECT
count(to_Date(PERIOD_DT)), to_date(PERIOD_DT)
from dbname.tablename group by to_date(PERIOD_DT) """).show
+-------+----------+
| _c0| _c1|
+-------+----------+
|1067177|2016-09-30|
|1042566|2017-07-07|
|1034333|2017-07-31|
+-------+----------+
However, when I run a query like the following :
sqlContext.sql(""" SELECT COUNT(*)
from dbname.tablename
where PERIOD_DT = '2017-07-07' """).show
Surprisingly, it returns :
+-------+
| _c0|
+-------+
|3144076|
+-------+
But if I changed PERIOD_DT to lowercase, i.e., period_dt , it returns the correct result
sqlContext.sql("""
SELECT COUNT(*)
from dbname.table
where period_dt='2017-07-07' """).show
+-------+
| _c0|
+-------+
|1042566|
+-------+
period_dt is the column on which the table is partitioned and it's type is char(10)
The table data is stored as Parquet :
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
What might be causing this inconsistency?

It is a case sensitive issue . Because of limitations of hive meta store schema, table is always lowercase. Parquet should resolve the issue

Related

Dataframe in Pyspark

I was just dropping a column from dataframe. it was dropped. after calling show method, it seems like column is not dropped in dataframe.
Code:
df.drop('Salary').show()
+-----+
| Name|
+-----+
| Arun|
| Joe|
|Jerry|
+-----+
df.show()
+-----+------+
| Name|Salary|
+-----+------+
| Arun| 5000|
| Joe| 6300|
|Jerry| 9600|
+-----+------+
I am using spark 2.4.4 version. could you please tell why its not dropped? And I thought that its like a dropping column form table in oracle database.

The drop method returns a new DataFrame. The original df is not changed by this transformation, so calling df.show() a second time will return the original data with your Salary column.

You need to save the dataframe after dropping.
df2 = df.drop('Salary')
df2.show()

How to CREATE TABLE USING delta with Spark 2.4.4?

This is Spark 2.4.4 and Delta Lake 0.5.0.
I'm trying to create a table using delta data source and seems I'm missing something. Although the CREATE TABLE USING delta command worked fine neither the table directory is created nor insertInto works.
The following CREATE TABLE USING delta worked fine, but insertInto failed.
scala> sql("""
create table t5
USING delta
LOCATION '/tmp/delta'
""").show
scala> spark.catalog.listTables.where('name === "t5").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
| t5| default| null| EXTERNAL| false|
+----+--------+-----------+---------+-----------+
scala> spark.range(5).write.option("mergeSchema", true).insertInto("t5")
org.apache.spark.sql.AnalysisException: `default`.`t5` requires that the data to be inserted have the same number of columns as the target table: target table has 0 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).;
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:341)
...
I thought I'd create with columns defined, but that didn't work either.
scala> sql("""
create table t6
(id LONG, name STRING)
USING delta
LOCATION '/tmp/delta'
""").show
org.apache.spark.sql.AnalysisException: delta does not allow user-specified schemas.;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:194)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3370)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
... 54 elided

The OSS version of Delta does not have the SQL Create Table syntax as of yet. This will be implemented the future versions using Spark 3.0.
To create a Delta table, you must write out a DataFrame in Delta format. An example in Python being
df.write.format("delta").save("/some/data/path")
Here's a link to the create table documentation for Python, Scala, and Java.

An example with pyspark 3.0.0 & delta 0.7.0
print(f"LOCATION '{location}")
spark.sql(f"""
CREATE OR REPLACE TABLE {TABLE_NAME} (
CD_DEVICE INT,
FC_LOCAL_TIME TIMESTAMP,
CD_TYPE_DEVICE STRING,
CONSUMTION DOUBLE,
YEAR INT,
MONTH INT,
DAY INT )
USING DELTA
PARTITIONED BY (YEAR , MONTH , DAY, FC_LOCAL_TIME)
LOCATION '{location}'
""")
Where "location" is a dir HDFS for spark cluster mode save de delta table.

tl;dr CREATE TABLE USING delta is not supported by Spark before 3.0.0 and Delta Lake before 0.7.0.
Delta Lake 0.7.0 with Spark 3.0.0 (both just released) do support CREATE TABLE SQL command.
Be sure to "install" Delta SQL using spark.sql.catalog.spark_catalog configuration property with org.apache.spark.sql.delta.catalog.DeltaCatalog.
$ ./bin/spark-submit \
--packages io.delta:delta-core_2.12:0.7.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
scala> spark.version
res0: String = 3.0.0
scala> sql("CREATE TABLE delta_101 (id LONG) USING delta").show
++
||
++
++
scala> spark.table("delta_101").show
+---+
| id|
+---+
+---+
scala> sql("DESCRIBE EXTENDED delta_101").show(truncate = false)
+----------------------------+---------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+---------------------------------------------------------+-------+
|id |bigint | |
| | | |
|# Partitioning | | |
|Not partitioned | | |
| | | |
|# Detailed Table Information| | |
|Name |default.delta_101 | |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/delta_101| |
|Provider |delta | |
|Table Properties |[] | |
+----------------------------+---------------------------------------------------------+-------+

Spark Dataframe: Convert bigint to timestamp

I have a Dataframe with a bigint column. How to convert a bigint column to timestamp in scala spark

You can use from_unixtime/to_timestamp function in spark to convert Bigint column to timestamp.
Example:
spark.sql("select timestamp(from_unixtime(1563853753,'yyyy-MM-dd HH:mm:ss')) as ts").show(false)
+-------------------+
|ts |
+-------------------+
|2019-07-22 22:49:13|
+-------------------+
(or)
spark.sql("select to_timestamp(1563853753) as ts").show(false)
+-------------------+
|ts |
+-------------------+
|2019-07-22 22:49:13|
+-------------------+
Schema:
spark.sql("select to_timestamp(1563853753) as ts").printSchema
root
|-- ts: timestamp (nullable = false)
Refer this link for more details regards to converting different formats of timestamps in spark.

hive comment get NULL in spark-sql

I execute "ALTER TABLE table_name CHANGE COLUMN col_name col_name column_type COMMENT col_comment;" to add comment to my table, and I succeeded.In hive I desc my table and I got like this:
hive> desc mytable;
+---------+----------+---------+
|col_name |data_type |comment |
|---------+----------+---------+
|col1 |string |name |
+---------+----------+---------+
but in spark-sql the comment is gone:
spark-sql> desc mytable;
+---------+----------+---------+
|col_name |data_type |comment |
|---------+----------+---------+
|col1 |string |null |
+---------+----------+---------+
By the way, I use MySQL for metastore.
How can I get the comment in spark-sql?

Why pyspark sql does not count correctly with group by clause?

I load parquet file into sql context like this:
sqlCtx = SQLContext(sc)
rdd_file = sqlCtx.read.parquet("hdfs:///my_file.parquet")
rdd_file.registerTempTable("type_table")
Then I run this simple query:
sqlCtx.sql('SELECT count(name), name from type_table group by name order by count(name)').show()
The result:
+----------------+----------+
|count(name) |name |
+----------------+----------+
| 0| null|
| 226307| x|
+----------------+----------+
However, if I use groupBy from rdd set. I got a different result:
sqlCtx.sql("SELECT name FROM type_table").groupBy("name").count().show()
+----------+------+
| name | count|
+----------+------+
| x|226307|
| null|586822|
+----------+------+
The count of x is the same for the two methods but null is quite different. It seems like the sql statement doesn't count null with group by correctly. Can you point out what I did wrong?
Thanks,

count(name) will exclude null values , if you give count(*) it will give you the null values as well .
Try below.
sqlCtx.sql('SELECT count(*), name from type_table group by name order by count(*)').show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark sql count changes on changing case - apache-spark

It is a case sensitive issue . Because of limitations of hive meta store schema, table is always lowercase. Parquet should resolve the issue

Related

Dataframe in Pyspark

How to CREATE TABLE USING delta with Spark 2.4.4?

Spark Dataframe: Convert bigint to timestamp

hive comment get NULL in spark-sql

Why pyspark sql does not count correctly with group by clause?

Categories

Resources