Spark SQL Insert Select with a column list? - apache-spark

As I read Spark/Hive SQL documentation is appears that Insert into table with a Column List is not supported in Spark 2.4 and earlier versions.
I have a source table and a destination table with different number of columns and different column names which I need to copy.
Does this mean I have to code this in PySpark to do this job as Spark SQL will not be able to do it ??
Example:
input_table( cola, colb, colc, cold, cole)
output_table(fieldx, fieldy, fieldz)
In SQL (assuming RDBMS such as MS-SQL, PostgreSQL etc) I would do the following:
insert into output_table(fieldx, fieldy, fieldz) select cola, colb, colc from input_table
Spark SQL does not allow this, it does not accept a column list in Insert SQL statement.
Question: how can I do this task with minimum of code and maximum performance in either PySpark or (ideally) in Spark-SQL (I am using Spark 2.4) ?
thank you

Specify the columns in output that won't be copied from input_table as null in select. (This is what would happen when only a set of columns, not all, would be inserted with a column list, if it were allowed)
insert into output_table
select cola, colb, colc,null as other1,--..specify non-copied column values as null
from input_table

Related

SparkSQL cannot find existing rows in a Hive table that Trino(Presto) does

We are using Trino(Presto) and SparkSQL for querying Hive tables on s3 but they give different results with the same query and on the same tables. We found the main problem. There are existing rows in a problematic Hive table which can be found with a simple where filter on a specific column with Trino but cannot be found with SparkSQL. The sql statements are the same in both.
On the other hand, SparkSQL can find these rows in the source table of that problematic table, filtering on the same column.
Create sql statement:
CREATE TABLE problematic_hive_table AS SELECT c1,c2,c3 FROM source_table
The select sql that can be used to find missing rows in Trino but not in SparkSQL
SELECT * FROM problematic_hive_table WHERE c1='missing_rows_value_in_column'
And this is the select query which can find these missing rows in SparkSQL:
SELECT * FROM source_table WHERE c1='missing_rows_value_in_column'
We execute the CTAS in Trino(Presto). If we are using ...WHERE trim(c1) = 'missing_key'' then spark can also find the missing rows but the fields do not contain trailing spaces (the length of these fields are the same in the source table as in the problematic table). In the source table spark can find these missing rows without trim.

Computing the size of a derived table in Spark SQL query

Is it possible to approximate the size of a derived table (in kb/mb/gb etc) in a Spark SQL query ? I don't need the exact size but an approximate value will do, which would allow me to plan my queries better by determining if a table could be broadcast in a join, or if using a filtered subquery in a Join will be better than using the entire table etc.
For e.g. in the following query, is it possible to approximate the size (in MB) of the derived table named b ? This will help me figure out if it will be better to use the derived table in the Join vs using the entire table with the filter outside -
select
a.id, b.name, b.cust
from a
left join (select id, name, cust
from tbl
where size > 100
) b
on a.id = b.id
We use Spark SQL 2.4. Any comments appreciated.
I have had to something similar before (to work out how many partitions to split to when writing).
What we ended up doing was working out an average row size and doing a count on the DataFrame then multiplying it by the row count.

How to add new column in to partition by clause in Hive External Table

I have external Hive Table which is filled by spark job and partitioned by(event_date date) now I have modified the spark code and added one extra column 'country'.In earlier written data country column will have null values as it is newly added. now I want to Alter 'partitioned by' clause as partition by(event_date date,country string) how can I achieve this.Thank you!!
Please try to alter the partition using below commnad-
ALTER TABLE table_name PARTITION part_spec SET LOCATION path
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Try this databricks spark-sql language manual for alter command

Fetch distinct field values from frozen set column in Cassandra columnfamily

Hi please help me to get cql query for below requirement
- Column family contains columns: deptid (datatype:uuid emplList (datatype: set frozen(employee) )
How would I get all distinct employees name from employee object where it is stored at set as column value for emplList.
Such queries couldn't be expressed in the pure CQL - Cassandra is optimized to read data by primary key, and aggregation operations are very limited. You have 2 choices:
Read all data from table by your program, and extract distinct values
Use Spark with Spark Cassandra Connector - it will read all the data from table, but you'll have higher level abstraction to work with data, and it could perform more optimized scanning of your table.

Spark SQL ignoring dynamic partition filter value

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.
I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.
In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show
However, in this instance, If I hardcode the value, it only reads the proper directory.
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show
Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).
I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.
Workaround by changing configuration
sql("set spark.sql.hive.convertMetastoreParquet = false")
Spark docs
"When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default."

Resources