Keeping Special Characters in Spark Table Column Name - apache-spark

Is there any way to keep special characters for a column in a spark 3.0 table?
I need to do something like
CREATE TABLE schema.table
AS
SELECT id=abc
FROM tbl1
I was reading in Hadoop that you would put back ticks around the column name but this does not work in spark.
If there is a way to do this in PySpark that would work as well

Turns out you parquet and delta formats do not accept special characters under any circumstance. You must use Row Format Delimited
spark.sql("""CREATE TABLE schema.test
ROW FORMAT DELIMITED
SELECT 1 AS `brand=one` """)

Related

Reading from DB2 tables with columns with names having special characters into Spark

I need to read data from a DB2 table into a spark dataframe.
However, the DB2 table named 'TAB#15' has 2 columns with special characters with names such as MYCRED# and MYCRED$.
My pyspark code looks like this:
query = '''select count(1) as cnt from {table} as T'''.format(table=table)
my_val = spark.read.jdbc(url, table=query, properties).collect()
My spark-submit however, throws an error that looks like this:
"ERROR: u"\nextraneous input '#' expecting... "
My questions/ ask is:
Is it possible to read data into a Spark dataframe, from a DB2 table whose table name and column names have special characters like '#' and '$'?
If there are any code samples/ similar questions to this one, that can illustrate the above requirement of reading DB2 table data from columns that have special characters in their column names, please point me out to them..
Try to use something like
table = '"MYDB2Specifier.TAB#15"'
Identifiers have double quotes. If you leave them out, everything will be uppercase. If the string has special characters like a $, you might need to escape the character.

Spark SQL ignoring dynamic partition filter value

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.
I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.
In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show
However, in this instance, If I hardcode the value, it only reads the proper directory.
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show
Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).
I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.
Workaround by changing configuration
sql("set spark.sql.hive.convertMetastoreParquet = false")
Spark docs
"When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default."

Spark SQL Insert Select with a column list?

As I read Spark/Hive SQL documentation is appears that Insert into table with a Column List is not supported in Spark 2.4 and earlier versions.
I have a source table and a destination table with different number of columns and different column names which I need to copy.
Does this mean I have to code this in PySpark to do this job as Spark SQL will not be able to do it ??
Example:
input_table( cola, colb, colc, cold, cole)
output_table(fieldx, fieldy, fieldz)
In SQL (assuming RDBMS such as MS-SQL, PostgreSQL etc) I would do the following:
insert into output_table(fieldx, fieldy, fieldz) select cola, colb, colc from input_table
Spark SQL does not allow this, it does not accept a column list in Insert SQL statement.
Question: how can I do this task with minimum of code and maximum performance in either PySpark or (ideally) in Spark-SQL (I am using Spark 2.4) ?
thank you
Specify the columns in output that won't be copied from input_table as null in select. (This is what would happen when only a set of columns, not all, would be inserted with a column list, if it were allowed)
insert into output_table
select cola, colb, colc,null as other1,--..specify non-copied column values as null
from input_table

drop column in a table/view using spark sql only

i have 30 columns in a table i.e table_old
i want to use 29 columns in that table except one . that column is dynamic.
i am using string interpolation.
the below sparksql query i am using
drop_column=now_current_column
var table_new=spark.sql(s"""alter table table_old drop $drop_column""")
but its throwing error
mismatched input expecting 'partition'
i dont want to drop the column using dataframe. i requirement is to drop the column in a table using sparksql only
As mentioned in previous answer, DROP COLUMN is not supported by spark yet.
But, there is a workaround to achieve the same, without much overhead. This trick works for both EXTERNAL and InMemory tables. The code snippet below works for EXTERNAL table, you can easily modify it and use it for InMemory tables as well.
val dropColumnName = "column_name"
val tableIdentifier = "table_name"
val tablePath = "table_path"
val newSchema=StructType(spark.read.table(tableIdentifier).schema.filter(col => col.name != dropColumnName))
spark.sql(s"drop table ${tableIdentifier}")
spark.catalog.createTable(tableIdentifier, "orc", newSchema, Map("path" -> tablePath))
orc is the file format, it should be replaced with the required format. For InMemory tables, remove the tablePath and you are good to go. Hope this helps.
DROP COLUMN (and in general majority of ALTER TABLE commands) are not supported in Spark SQL.
If you want to drop column you should create a new table:
CREATE tmp_table AS
SELECT ... -- all columns without drop TABLE
FROM table_old
and then drop the old table or view, and reclaim the name.
Now drop columns is supported by Spark if you´re using v2 tables. You can check this link
https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Resources