Correct syntax for creating a parquet table with CTAS at a specified location - apache-spark

I'm trying to create a table stored as parquet with spark.sql with a pre-specified external location, but I appear to be missing something, or something is omitted from the documentation.
My reading of the documentation suggests the following should work:
create table if not exists schema.test
using PARQUET
partitioned by (year, month)
location 's3a://path/to/location'
as select '1' test1, true test2, '2017' year, '01' month
But this returns the error:
mismatched input 'location' expecting (line 4, pos 0)
The documentation suggests external is automatically implied by setting location, but anyway adding create external table gives the same error.
I was able to successfully create an empty table with similar syntax:
create external table if not exists schema.test
(test1 string, test2 boolean)
partitioned by (year string, month string)
stored as PARQUET
location 's3a://path/to/location'
My alternative is to save the select results to a parquet at /path/to/location directly first, then create a table pointing to this, but this seems roundabout when the original syntax seems valid and designed for this purpose.
What's wrong with my approach?

Related

Time function in Azure Data Factory - Expression Builder

I only need to take the time part from the 'Timestamp type source attribute' and load it into a dedicated SQL pool table (Time datatype column). But I don't find a time function within the expression builder in ADF, is there a way I can do it?
-What did I do?
-I took the time part from the source attribute using substring and then tried to load the same into the destination table, when I do the destination table inserted null values as the column at the destination table is set to time datatype.
I tried to reproduce this and got the same issue. The following is a demonstration of the same. I have a table called mydemo as shown below.
CREATE TABLE [dbo].[mydemo]
(
id int NOT NULL,
my_date date,
my_time time
)
WITH
(
DISTRIBUTION = HASH (id),
CLUSTERED COLUMNSTORE INDEX
)
GO
The following is my source data in my dataflow.
time is not a recognized datatype in azure dataflow (date and timestamp are accepted). Therefore, dataflow fails to convert string (substring(<timestamp_col>,12,5)) into time type.
For better understanding, you can load your sink table as source in dataflow. The time column will be read as 1900-01-01 12:34:56 when time value in the table row is 12:34:56.
#my table row
insert into mydemo values(200,'2022-08-18','12:34:56')
So, instead of using substring(<timestamp_col>,12,5) to return 00:01, use concat('1900-01-01 ',substring(<timestamp_col>,12,8)) which returns 1900-01-01 00:01:00.
Configure the sink, mapping and look at the resulting data in data preview. Now, azure dataflow will be able to successfully insert the values and give desired results.
The following is the output after successful insertion of record into dedicated pool table.
NOTE: You can construct valid yyyy-MM-dd hh:mm:ss as a value using concat('yyyy-MM-dd ',substring(<timestamp_col>,12,8)) in place of 1900-01-01 hh:mm:ss in derived column transformation.

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

U-Sql Create table statement failing

I'm trying to create a U-sql table from two tables using Create table as select (CTA's) as below -
DROP TABLE IF EXISTS tpch_query2_result;
CREATE TABLE tpch_query2_result
(
INDEX idx_query2
CLUSTERED(P_PARTKEY ASC)
DISTRIBUTED BY HASH(P_PARTKEY)
) AS
SELECT
a.P_PARTKEY
FROM part AS a INNER JOIN partsupp AS b ON a.P_PARTKEY == b.PS_PARTKEY;
But while running the U-sql query im getting the below error -
E_CSC_USER_QUALIFIEDCOLUMNNOTFOUND: Column 'P_PARTKEY' not found in rowset 'a'.
Line 11
E_CSC_USER_QUALIFIEDCOLUMNNOTFOUND: Column 'PS_PARTKEY' not found in rowset 'b'.
Not sure about the error. Can someone provide some insights on this error.Thanks
The error normally indicates that the specified column does not exists in the specified rowset referenced by a (i.e., part) or b (i.e., partsupp). What is the schema of either of these tables? do they have columns of the expected names?

Impala table from spark partitioned parquet files

I have generated some partitioned parquet data using Spark, and I'm wondering how to map it to an Impala table... Sadly, I haven't found any solution yet.
The schema of parquet is like :
{ key: long,
value: string,
date: long }
and I partitioned it with key and date, that gives me this kind of directories on my hdfs :
/data/key=1/date=20170101/files.parquet
/data/key=1/date=20170102/files.parquet
/data/key=2/date=20170101/files.parquet
/data/key=2/date=20170102/files.parquet
...
Do you know how I could tell Impala to create a table from this dataset with corresponding partitions (and without having to loop on each partition as I could have read) ? Is it possible ?
Thank you in advance
Assuming by schema of parquet , you meant the schema of the dataset and then using the columns to partition , you will have only the key column in the actual files.parquet files . Now you can proceed as follows
The solution is to use an impala external table .
create external table mytable (key BIGINT) partitioned by (value String ,
date BIGINT) stored as parquet location '....../data/'
Note that in above statement , you have to give path till the data folder
alter table mytable recover partitions'
refresh mytable;
The above 2 commands will automatically detect the partitions based on the schema of the table and get to know about the parquet files present in the sub directories.
Now , you can start querying the data .
Hope it helps

Loading PIG output files into Hive table with some blank cells

I have successfully loaded a 250000 record CSV file into HDFS and I have performed some ETL functions on it such as removing any characters in a string other than 0-9, a-z and A-Z so that it's nice and clean.
I've saved the output of this ETL to the HDFS for loading into Hive. While in Hive I created the schema for the table and set the appropriate data types for each column.
create external table pigOutputHive (
id string,
Score int,
ViewCount int,
OwnerUserId string,
Body string,
Rank int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
location '/user/admin/PigOutputETL';
When I run a simple query on the data such as:
SELECT * FROM pigoutputhive LIMIT 100000;
The data looks as it should, and when i download it to my local machine and view it in Excel as a CSV it also looks good.
When I try and run the following query on the same table I get every field being returned as an integer even for the string columns. See the screenshot below.
Can anyone see where I am going wrong? Of the original 250000 rows there are some blanks in a particular fields such as the OwnerUserId, do I need to tell Pig or Hive how to handle these?

Resources