in the Azure databricks I have been using the following syntax for identifying the definition of a spark table.But I have a struct column with more than 50 columns as a result the output is hidden like the screenshot below. Can you help me with the correct syntax please. Thanks in advance.
%sql
DESCRIBE table mak.g1
enter image description here
You need to save the data to a dataframe. The code below shows the actual table definition of the hive table.
%python
#
# 11 - Show table definition
#
df1 = spark.sql("show create table dim.product")
df1.first().createtab_stmt.replace("\n", " ").replace(" ", "")
I am executing the code on the adventure works product dimension.
This same technique can be done with "describe table".
Related
I think I understand the concepts but still something is unclear for me:
Lets says we have structured streaming reading the following data from some source:
id,name,age
1, Joe, 34
2, Frank,69
3,Eva,62
..etc
As far I understand Spark reads it from the source , put it in the unbounded table, run some logic on it and write to the result table.
My questions is that if I have a logic something like getting only the name column :
df.select("name")
Will the Spark read all columns from the input table then make the selection or drop all the non-name columns from the input table then read everything(which is only the name at this point) from the table?
It depends on the word "some" in your phrase "data from some source".
If it's some column base storage, like Parquet, it should read only the needed columns.
If it's raw files/records at HDFS/S3/Kafka/etc, Spark has to read a whole record, do split it by column, and drop non-necessary columns after that.
We are using Trino(Presto) and SparkSQL for querying Hive tables on s3 but they give different results with the same query and on the same tables. We found the main problem. There are existing rows in a problematic Hive table which can be found with a simple where filter on a specific column with Trino but cannot be found with SparkSQL. The sql statements are the same in both.
On the other hand, SparkSQL can find these rows in the source table of that problematic table, filtering on the same column.
Create sql statement:
CREATE TABLE problematic_hive_table AS SELECT c1,c2,c3 FROM source_table
The select sql that can be used to find missing rows in Trino but not in SparkSQL
SELECT * FROM problematic_hive_table WHERE c1='missing_rows_value_in_column'
And this is the select query which can find these missing rows in SparkSQL:
SELECT * FROM source_table WHERE c1='missing_rows_value_in_column'
We execute the CTAS in Trino(Presto). If we are using ...WHERE trim(c1) = 'missing_key'' then spark can also find the missing rows but the fields do not contain trailing spaces (the length of these fields are the same in the source table as in the problematic table). In the source table spark can find these missing rows without trim.
I am new to Azure and am trying to see if the below result is achievable with data factory / mapping data flow without Databricks.
I have my csv file with this sample data :
I have following data in my table :
My expected data/ result:
Which transformations would be helpful to achieve this?
Thanks.
Now, you have the RowNumber column, you can use pivot activity to do row-column pivoting.
I used your sample data to made a test as follows:
My Projection tab is like this:
My DataPreview is like this:
In the Pivot1 activity, we select Table_Name and Row_Number columns to group by. If you don't want Table_Name column, you can delete it here.
At Pivote key tab, we select Col_Name column.
At Pivoted columns, we must select a agrregate function to aggregate the Value column, here I use max().
The result shows:
Please correct me if I understand you wrong in the answer.
update:
The data source like this:
The result shows as you saied, ADF sorts the column alphabetically.It seems no way to customize sorting:
But when we done the sink activity, it will auto mapping into your sql result table.
i have 30 columns in a table i.e table_old
i want to use 29 columns in that table except one . that column is dynamic.
i am using string interpolation.
the below sparksql query i am using
drop_column=now_current_column
var table_new=spark.sql(s"""alter table table_old drop $drop_column""")
but its throwing error
mismatched input expecting 'partition'
i dont want to drop the column using dataframe. i requirement is to drop the column in a table using sparksql only
As mentioned in previous answer, DROP COLUMN is not supported by spark yet.
But, there is a workaround to achieve the same, without much overhead. This trick works for both EXTERNAL and InMemory tables. The code snippet below works for EXTERNAL table, you can easily modify it and use it for InMemory tables as well.
val dropColumnName = "column_name"
val tableIdentifier = "table_name"
val tablePath = "table_path"
val newSchema=StructType(spark.read.table(tableIdentifier).schema.filter(col => col.name != dropColumnName))
spark.sql(s"drop table ${tableIdentifier}")
spark.catalog.createTable(tableIdentifier, "orc", newSchema, Map("path" -> tablePath))
orc is the file format, it should be replaced with the required format. For InMemory tables, remove the tablePath and you are good to go. Hope this helps.
DROP COLUMN (and in general majority of ALTER TABLE commands) are not supported in Spark SQL.
If you want to drop column you should create a new table:
CREATE tmp_table AS
SELECT ... -- all columns without drop TABLE
FROM table_old
and then drop the old table or view, and reclaim the name.
Now drop columns is supported by Spark if you´re using v2 tables. You can check this link
https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html
I have created a HIVE table through pyspark in ORC format and everything is working as per the requirement.
However, when I observed the details o fthe HIVE table, I see below
describe formatted <tbl_name>;
I get below output
Table Parameters:
COLUMN_STATS_ACCURATE false
EXTERNAL FALSE
numFiles 99
numRows -1
rawDataSize -1
How can I change the value of "COLUMN_STATS_ACCURATE" while writing the code in pyspark? IS there any way to do that? If no, then is there a way to change it after the table has been created?
You can call ANALYZE TABLE:
spark.sql("ANALYZE TABLE foo COMPUTE STATISTICS")
but please remember, that Spark output in general, provides only partial Hive compatibility.