Connect Pentaho to Azure blob storage or ADLS - azure

Is there any way to connect pentaho to Azure Blob Storage or ADLS? Because I am not able to find any option?

Go to the file in Azure Blob and Generate SAS token and URL and copy the URL only. In PDI, select Hadoop file input. Double click the Hadoop file input and select local for the Evironment and insert the Azure URL in File/Folder field and that's it. You should see the file in PDI.
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/authentication-wasb.html
This article explains how to connect the Pentaho Server to a Microsoft Azure HDInsight cluster. Pentaho supports both the HDFS file system and the WASB (Windows Azure Storage BLOB) extension for Azure HDInsight.

Related

How to create api in azure that uploads file from SFTP to azure data lake storage

My requirement is I want to create api/azure function for uploading file from SFTP server to azure data lake storage is there any refference projects or guidences please guide me
Yes, you can achieve the above requirement to connect and transfer the files from SFTP to Data lake storage ;
Please check if my findings are helpful,
You can do while uploading files to Data Lake through SFTP server the configuration of authentication is set to basic and disable SSH host keys validation this is when you are trying to upload files using without any authentication with Azure AD .
For example:-
When you are trying to upload files to DATA LAKE Storage through SFTP you need to configure SFTP on Portal and provide the connection details .
For complete configuration and information please find the below workaround links :-
MICROSOFT DOCUMENTATION| Copy and transform data in SFTP server using Azure Data Factory & Use .NET to manage directories and files in Azure Data Lake Storage Gen2.
SO THREAD|Copy files from ALDS Gen2 to SFTP using data factory without SSH
GitHub Blog|how to ingest data from an SFTP source to your Azure Data Lake

Get the full path of a file in azure synapse studio using pyspark

I need to process a pdf file from my storage account. In the local environment, we use to get the path of the file 'C:\path\file1.pdf'. But how can I access the data in Azure storage account in the azure synapse studio pyspark(python)?
Manual Method: If you want to manually get the full path of the storage account manually.
For ADLS GEN2 accounts: 'abfss://<FileSystemName>#<StorageName>.dfs.core.windows.net/FilePath/FileName/'
For Azure Blob accounts: 'wasbs://<ContainerName>#<StorageName>.blob.core.windows.net/FilePath/FileName/'
Automatic Method: Here are the steps to get the full path of a file in Azure Synapse Studio using Pyspark.
You can create a linked service to connection to the external data (Azure Blob Storage/Gen1/Gen2).
Step1: You can analyze the data in your workspace default ADLS Gen2 account or you can link an ADLS Gen2 or Blob storage account to your workspace through "Manage" > "Linked Services" > "New"
Step2: Once a connection is created, the underlying data of that connection will be available for analysis in the Data hub or for pipeline activities in the Integrate hub.
Step3: Now you have successfully connected Azure Data Lake Gen2 without pass any path.
Reference: Azure Synapse Analytics - Analyze data in a storage account

How to access the blob storage in Microsoft Azure HDInsight?

I have just created a Spark based HDInsight cluster. I have selected a blob storage that I created before, while creating the cluster. However, I have no idea how to access that blob storage from within the VM created there. I have read many different tutorials, but couldn't get a proper answer.
I can see that the default container's folders/files correspond to the HDFS directories in the VM. Is it possible to add the blob storage to the default container, so that I can also access it just like an HDFS directory?
You can access blobs using Azure PowerShell or Azure CLI with cmdlets.
Refer : Access blobs in Azure HDInsight.
If you want to access blobs using Azure Storage Explorer with GUI:
Refer: Azure Storage Explorer.

Connect Pentaho to Azure Blob storage

I was recently thrown into dealing with databases and Microsoft Azure Blob storage.
As I am completely new to this field I have some troubles:
I can't figure out how to connect to the Blob storage from Pentaho and I could not find any good information online on this topic either.
I would be glad for any information on how to set up this connection.
You can update the core-site.xml like so: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/authentication-wasb.html
This will give you access to the azure blob storage account.
Go to the file in Azure Blob and Generate SAS token and URL and copy the URL only. In PDI, select Hadoop file input. Double click the Hadoop file input and select local for the Evironment and insert the Azure URL in File/Folder field and that's it. You should see the file in PDI.
I figured it out eventually.
Pentaho provides a HTTP element, in which you can, amongst other things specify an URL.
In Microsoft Azure Blob storage you can generate a SAS token. If you use the URL made from the storage resource URI and the SAS token as input for the URL field in the HTTP element, Pentaho can access the respective file in the Blob storage.

how can i upload larger files into azure hadoop cluster?

how can i upload larger files into azure hadoop cluster?
Is there a way i can browse to the /example/apps directory in hadoop cluster by taking remote desktop connection so that i can copy the files?
HDInsight uses Windows Azure Blob storage (WASB). There are several ways you can upload data to WASB:
AzCopy
Windows Azure PowerShell
3rd party tools include Azure Storage Explorer, Cloud Storage Studio 2, CloudXplorer and so on.
Hadoop command line. You must enable RDP first.
For more information, see http://www.windowsazure.com/en-us/manage/services/hdinsight/howto-upload-data-to-hdinsight/.
Copy the files to Azure Blob Storage via HTTP/HTTPS and then MapReduce them from Hadoop on Azuure (HDInsight) by pointing at the HTTP URL of the uploaded file(s).
Cheers

Resources