I am working on POC of Hive and Adls Gen2 integration.Requirement is if we submit a job we need to store the data in ADLS GEN2 location. So we need to create an External table on the top ADLS location.
Please share me any leads or pseudo code of it. I explored a lot but did not get any concrete solution yet.
Related
Trying to move data from Teradata to Snowflake. Have created a process to run TPT scripts for each table to generate files for each table.
Files are also split to achieve concurrency while running COPY INTO in snowflake.
Need to understand what is the best way to move those Files from On Prem Linux Machine to Azure ADLS. Considering files in Terabyte size.
Does Azure provide any mechanism to move these files or can we directly create files on ADLS from Teradata?
The best approach to load data to snowflake via external table if you have the Azure Blob Storage or ADLS Gen2. Load data to blob storage and create external table and then load data data to snowflake.
I am trying to use the Copy Data Activity to copy data from Databricks DBFS to another place on the DBFS, but I am not sure if this is possible.
When I select Azure Delta Storage as a dataset source or sink, I am able to access the tables in the cluster and preview the data, but when validating it says that the tables are not delta tables (which they aren't, but I don't seem to acsess the persistent data on DBFS)
Furthermore, what I want to access is the DBFS, not the cluster tables. Is there an option for this?
I've created a DataFrame which I would like to write / export next to my Azure DataLake Gen2 in Tables (need to create new Table for this).
In the future I will also need to update this Azure DL Gen2 Table with new DataFrames.
In Azure Databricks I've created a connection Azure Databricks -> Azure DataLake to see my my files:
Appreciate help how to write it in spark / pyspark.
Thank you!
I would suggest instead of writing data in parquet format, go for Delta format which internally uses Parquet format but provide other features like ACID transaction.The syntax would be
df.write.format("delta").save(path)
In my project, we have been using BLOBs on Azure. We were able to upload ORC files into an existing BLOB container named, say, student_dept in quite a handy manner using:
hdfs fs -copyFromLocal myfolder/student_remarks/*.orc wasbs://student_dept#universitygroup.blob.core.windows.net/DEPT/STUDENT_REMARKS
And we have a Hive EXTERNAL table: STUDENT_REMARKS created on the student_dept BLOB. This way, we can very easily access our data from cloud using Hive queries.
Now, we're trying to shift from BLOB storage to ADLS Gen2 for storing the ORC files and I'm tring to understand the impact this change would have on our upload/data retrieval process.
I'm totally new to Azure, and what I want to know now is how do I upload the ORC files from my HDFS to ADLS Gen2 stoage? How different is it?
Does the same command with the different destination (ADLS G2 instead of BLOB) work or is there something extra that needs to be done in order to upload data to ADLS G2?
Can someone please help me with your inputs on this?
I didn't give it a try, but as per doc like this and this, you can use command like below for ADLS GEN2:
hdfs dfs -copyFromLocal myfolder/student_remarks/*.orc
abfs://student_dept#universitygroup.dfs.core.windows.net/DEPT/STUDENT_REMARKS
I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase.
Is it possible to acess databricks tables through Azure Data factory?
No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here.
You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table to Blob storage as it's final action. In the next step of your data factory job, you can then read the dumped data from the storage account and process further.
Another option may be databricks delta although I have not tried this yet...
If you register the table in the Databricks hive metastore then ADF could read from it using the ODBC source in ADF. Though this would require an IR.
Alternatively you could write the table to external storage such as blob or lake. ADF can then read that file and push it to your sql database.