Trying to move data from Teradata to Snowflake. Have created a process to run TPT scripts for each table to generate files for each table.
Files are also split to achieve concurrency while running COPY INTO in snowflake.
Need to understand what is the best way to move those Files from On Prem Linux Machine to Azure ADLS. Considering files in Terabyte size.
Does Azure provide any mechanism to move these files or can we directly create files on ADLS from Teradata?
The best approach to load data to snowflake via external table if you have the Azure Blob Storage or ADLS Gen2. Load data to blob storage and create external table and then load data data to snowflake.
Related
So, I have some parquet files stored in azure devops containers inside a blob storaged. I used data factory pipeline with the "copy data" connector to extract the data from a on premise Oracle database.
I'm using Synapse Analytics to do a sql query that uses some of the parquet files store in the blob container and I want to save the results of the query in another blob. Which Synapse connector can I use the make this happen? To do the query I'm using the "develop" menu inside Synapse Analytics.
To persist the results of a serverless SQL query, use CREATE EXTERNAL TABLE AS SELECT (CETAS). This will create a physical copy of the result data in your storage account as a collection of files in a folder, so you cannot specify how many files nor the naming scheme of the files.
Usecase: I have data files of varying size copied to a specific SFTP folder periodically (Daily/Weekly). All these files needs to be validated and processed. Then write them to related tables in Azure SQL. Files are of CSV format and are actually a flat text file which directly corresponds to a specific Table in Azure SQL.
Implementation:
Planning to use Azure Data Factory. So far, from my reading I could see that I can have a Copy pipeline in-order to copy the data from On-Prem SFTP to Azure Blob storage. As well, we can have SSIS pipeline to copy data from On-Premise SQL Server to Azure SQL.
But I don't see a existing solution to achieve what I am looking for. can someone provide some insight on how can I achieve the same?
I would try to use Data Factory with a Data Flow to validate/process the files (if possible for your case). If the validation is too complex/depends on other components, then I would use functions and put the resulting files to blob. The copy activity is also able to import the resulting CSV files to SQL server.
You can create a pipeline that does the following:
Copy data - Copy Files from SFTP to Blob Storage
Do Data processing/validation via Data Flow
and sink them directly to SQL table (via Data Flow sink)
Of course, you need an integration runtime, that can access the on-prem server - either by using VNet integration or by using the self hosted IR. (If it is not publicly accessible)
I am trying to use the Copy Data Activity to copy data from Databricks DBFS to another place on the DBFS, but I am not sure if this is possible.
When I select Azure Delta Storage as a dataset source or sink, I am able to access the tables in the cluster and preview the data, but when validating it says that the tables are not delta tables (which they aren't, but I don't seem to acsess the persistent data on DBFS)
Furthermore, what I want to access is the DBFS, not the cluster tables. Is there an option for this?
In my project, we have been using BLOBs on Azure. We were able to upload ORC files into an existing BLOB container named, say, student_dept in quite a handy manner using:
hdfs fs -copyFromLocal myfolder/student_remarks/*.orc wasbs://student_dept#universitygroup.blob.core.windows.net/DEPT/STUDENT_REMARKS
And we have a Hive EXTERNAL table: STUDENT_REMARKS created on the student_dept BLOB. This way, we can very easily access our data from cloud using Hive queries.
Now, we're trying to shift from BLOB storage to ADLS Gen2 for storing the ORC files and I'm tring to understand the impact this change would have on our upload/data retrieval process.
I'm totally new to Azure, and what I want to know now is how do I upload the ORC files from my HDFS to ADLS Gen2 stoage? How different is it?
Does the same command with the different destination (ADLS G2 instead of BLOB) work or is there something extra that needs to be done in order to upload data to ADLS G2?
Can someone please help me with your inputs on this?
I didn't give it a try, but as per doc like this and this, you can use command like below for ADLS GEN2:
hdfs dfs -copyFromLocal myfolder/student_remarks/*.orc
abfs://student_dept#universitygroup.dfs.core.windows.net/DEPT/STUDENT_REMARKS
I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase.
Is it possible to acess databricks tables through Azure Data factory?
No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here.
You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table to Blob storage as it's final action. In the next step of your data factory job, you can then read the dumped data from the storage account and process further.
Another option may be databricks delta although I have not tried this yet...
If you register the table in the Databricks hive metastore then ADF could read from it using the ODBC source in ADF. Though this would require an IR.
Alternatively you could write the table to external storage such as blob or lake. ADF can then read that file and push it to your sql database.