Writing Hive table from Spark specifying CSV as the format - apache-spark

I'm having an issue writing a Hive table from Spark. The following code works just fine; I can write the table (which defaults to the Parquet format) and read it back in Hive:
df.write.mode('overwrite').saveAsTable("db.table")
hive> describe table;
OK
val string
Time taken: 0.021 seconds, Fetched: 1 row(s)
However, if I specify the format should be csv:
df.write.mode('overwrite').format('csv').saveAsTable("db.table")
then I can save the table, but Hive doesn't recognize the schema:
hive> describe table;
OK
col array<string> from deserializer
Time taken: 0.02 seconds, Fetched: 1 row(s)
It's also worth noting that I can create a Hive table manually and then insertInto it:
spark.sql("create table db.table(val string)")
df.select('val').write.mode("overwrite").insertInto("db.table")
Doing so, Hive seems to recognize the schema. But that's clunky and I can't figure a way to automate the schema string anyway.

That is because Hive SerDe do not support csv by default.
If you insist on using csv format, creating table as below:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
And insert data through df.write.insertInto
For more info:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

You are creating a table with text format and trying to insert CSV data into it, which may run in to problems. So as suggested in the answer by Zhang Tong, create the hive table using hive OpenCSVSerde.
After that, if you are more comfortable with Hive query language than dataframes, you can try this.
df.registerTempTable("temp")
spark.sql("insert overwrite db.table select * from temp")

This happens because HiveSerde is different for csv than what is used by Spark. Hive by default use TEXTFORMAT and the delimiter has to be specified while creating the table.
One Option is to use the insertInto API instead of saveAsTable while writing from spark. While using insertInto, Spark writes the contents of the Dataframe to the specified table. But it requires the schema of the dataframe to be same as the schema of the table. Position of the columns is important here as it ignores the column names.
Seq((5, 6)).toDF("a", "b").write.insertInto("t1")

Related

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

Spark: Record count mismatch

I am quite confused because I am facing a weird situation.
My spark application reads data from an Oracle database and load it into a dataframe using this instruction:
private val df = spark.read.jdbc(
url = [the jdbc url],
table="(" + [the query] + ") qry",
properties= [the oracle driver]
)
Then, I save in a variable the number of records in this dataframe:
val records = df.count()
The I create a hive table ([my table]) with the dataframe schema, and I dump the content of the dataframe on it:
df.write
.mode(SaveMode.Append)
.insertInto([my hive db].[my table])
Well, here is where I am lost; When I perform a select count(*) to the hive table where the dataframe is being loaded, "sometimes" there are a few records more in hive than in the records variable.
Can you think on what could be the source of this mismatch??
*Related to the possible duplicate, my question is different. I am not counting my dataframe many times with different values. I count the records on my dataframe once. I dump the dataframe into hive, and I count the records in the hive table, and sometimes there are more in hive than in my count.*
Thank you very much in advance.

drop column in a table/view using spark sql only

i have 30 columns in a table i.e table_old
i want to use 29 columns in that table except one . that column is dynamic.
i am using string interpolation.
the below sparksql query i am using
drop_column=now_current_column
var table_new=spark.sql(s"""alter table table_old drop $drop_column""")
but its throwing error
mismatched input expecting 'partition'
i dont want to drop the column using dataframe. i requirement is to drop the column in a table using sparksql only
As mentioned in previous answer, DROP COLUMN is not supported by spark yet.
But, there is a workaround to achieve the same, without much overhead. This trick works for both EXTERNAL and InMemory tables. The code snippet below works for EXTERNAL table, you can easily modify it and use it for InMemory tables as well.
val dropColumnName = "column_name"
val tableIdentifier = "table_name"
val tablePath = "table_path"
val newSchema=StructType(spark.read.table(tableIdentifier).schema.filter(col => col.name != dropColumnName))
spark.sql(s"drop table ${tableIdentifier}")
spark.catalog.createTable(tableIdentifier, "orc", newSchema, Map("path" -> tablePath))
orc is the file format, it should be replaced with the required format. For InMemory tables, remove the tablePath and you are good to go. Hope this helps.
DROP COLUMN (and in general majority of ALTER TABLE commands) are not supported in Spark SQL.
If you want to drop column you should create a new table:
CREATE tmp_table AS
SELECT ... -- all columns without drop TABLE
FROM table_old
and then drop the old table or view, and reclaim the name.
Now drop columns is supported by Spark if you´re using v2 tables. You can check this link
https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Impala table from spark partitioned parquet files

I have generated some partitioned parquet data using Spark, and I'm wondering how to map it to an Impala table... Sadly, I haven't found any solution yet.
The schema of parquet is like :
{ key: long,
value: string,
date: long }
and I partitioned it with key and date, that gives me this kind of directories on my hdfs :
/data/key=1/date=20170101/files.parquet
/data/key=1/date=20170102/files.parquet
/data/key=2/date=20170101/files.parquet
/data/key=2/date=20170102/files.parquet
...
Do you know how I could tell Impala to create a table from this dataset with corresponding partitions (and without having to loop on each partition as I could have read) ? Is it possible ?
Thank you in advance
Assuming by schema of parquet , you meant the schema of the dataset and then using the columns to partition , you will have only the key column in the actual files.parquet files . Now you can proceed as follows
The solution is to use an impala external table .
create external table mytable (key BIGINT) partitioned by (value String ,
date BIGINT) stored as parquet location '....../data/'
Note that in above statement , you have to give path till the data folder
alter table mytable recover partitions'
refresh mytable;
The above 2 commands will automatically detect the partitions based on the schema of the table and get to know about the parquet files present in the sub directories.
Now , you can start querying the data .
Hope it helps

Resources