I'm converting hql scripts to pyspark.
HQL code : show tblproperties tblName ('transient_lastDdlTime')
I want "transient_lastDdlTime" property equivalent for parquet files. I know there is a way for delta tabes using delta lake APIs, but is there a way for parquet files?
I don't think there is any such metadata in parquet files which will be equivalent of transient_lastDdlTime you can check the same by writing some parquet files and reading its metadata using below code
import pyarrow.parquet as pq
pq.read_metadata('<file_path>')
As you see there is no external metadata created like delta tables and parquet metadata only has some basic row and column information.
you will need to implement your own code for capturing timestamp of changes
Related
I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)
I use ADF to ingest the data from SQL server to ADLS GEN2 in a Parquet Snappy format, But the size of the file in sink goes upto 120 GB, The size causes me a lot of problem when I read this file in Spark and join the data from this file with many other Parquet files.
I am thinking to use Delta lake's unmanage table with the location pointing to the ADLS location, I am able to create an UnManaged table if I don't specify any partition using this
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S)"
But if I would want to partition this file for query optimization
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S), PARTITIONED_COLUMN DATATYPE"
It gives me error like the one mentioned in the screenshot (find the attachment).
Error in Text :-
org.apache.spark.sql.AnalysisException: Expecting 1 partition column(s): [<PARTITIONED_COLUMN>], but found 0 partition column(s): [] from parsing the file name: abfss://mydirectory#myADLS.dfs.core.windows.net/level1/Level2/Table1.parquet.snappy;
There is no way that I can create this Parquet file using ADF with partition details (Am open for suggestions)
Am I giving a wrong Syntax or this can be even done?
Ok, I found the answer to this. While you convert parquet files to delta using the above approach, Delta would look for the correct directory structure with partition information along with the name of the column mentioned in "Partitioned By" clause.
For E.g, I have a folder called /Parent, inside this I have a directory structure with partition information, the partitioned parquet files are kept one level further inside the partitioned folders, the folder names are like this
/Parent/Subfolder=0/part-00000-62ef2efd-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=1/part-00000-fsgvfabv-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=2/part-00000-fbfdfbfe-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=3/part-00000-gbgdbdtb-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
in this case, subfolder is the partitions created inside parent.
CONVERT TO DELTA parquet./Parent/ partitioned by (Subfolder INT)
will just take this directory structure and convert the whole partitioned data to delta and will store the partitioned information in metastore.
Summary:- This command is only to utilize already created partitioned Parquet files. To create partition on single Parquet file you would have to take different route, Which I can explain you later if you are interested ;)
I started going through DELTA LAKE file format, is hive capable of reading data from this newly introduced delta file format? If so could you please let me know the serde you were using.
Hive support is available with Delta Lake file format. First, step is to add the jars from https://github.com/delta-io/connectors, in our hive path. And then create a table using following format.
CREATE EXTERNAL TABLE test.dl_attempts_stream
(
...
)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION
Delta Format picks up partition by default, so no need to mention partition while creating a table.
NOTE: If data is being inserted via a Spark job, please provide hive-site.xml, and enableHiveSupport in Spark Job, to create Delta Lake table in Hive.
I have some orc files produced by spark job.
Is there some easy way to create an external table directly from those files?
The way I have done this is to first register a temp table in Spark job itself and then leverage the sql method of the HiveContext to create a new table in hive using the data from the temp table. For example if I have a dataframe df and HiveContext hc the general process is:
df.registerTempTable("my_temp_table")
hc.sql("CREATE TABLE new_table_name STORED AS ORC AS SELECT * from my_temp_table")
I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,
Kafka-->Spark-->Parquet<--Presto
I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?
If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.
Or could you shed light on it if there is any better alternative way?
You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.
create table hive.default.xxx (<columns>)
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');
(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)
Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.