SFTP to Azure Blob Store - azure

I am trying to copy file from SFTP to Azure Blob store using SFTPToWasbOperator. I am getting error. It seems like I'm doing something wrong, but I can't figure out what it is. Could someone please check the following code and see if there is anything wrong with it?
Airflow Logs
**
[2022-07-10, 13:08:48 UTC] {sftp_to_wasb.py:188} INFO - Uploading /SPi_ESG_Live/07-04-2022/DataPoint_2022_07_04.csv to wasb://testcotainer as https://test.blob.core.windows.net/testcotainer/DataPoint_2022_07_04.csv
[2022-07-10, 13:08:48 UTC] {_universal.py:473} INFO - Request URL: 'https://.blob.core.windows.net/***/test/https%3A//test.blob.core.windows.net/testcontainer/DataPoint_2022_07_04.csv'
Error msg
"azure.core.exceptions.ServiceRequestError: URL has an invalid label."
Airflow DAG
import os
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
from airflow.providers.microsoft.azure.operators.wasb_delete_blob import WasbDeleteBlobOperator
from airflow.providers.microsoft.azure.transfers.sftp_to_wasb import SFTPToWasbOperator
from airflow.providers.sftp.hooks.sftp import SFTPHook
from airflow.providers.sftp.operators.sftp import SFTPOperator
AZURE_CONTAINER_NAME = "testcotainer"
BLOB_PREFIX = "https://test.blob.core.windows.net/testcotainer/"
SFTP_SRC_PATH = "/SPi_test_Live/07-04-2022/"
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_sftp_to_wasb"
with DAG(
DAG_ID,
schedule_interval=None,
catchup=False,
start_date=datetime(2021, 1, 1), # Override to match your needs
) as dag:
# [START how_to_sftp_to_wasb]
transfer_files_to_azure = SFTPToWasbOperator(
task_id="transfer_files_from_sftp_to_wasb",
# SFTP args
sftp_source_path=SFTP_SRC_PATH,
# AZURE args
container_name=AZURE_CONTAINER_NAME,
blob_prefix=BLOB_PREFIX,
)
# [END how_to_sftp_to_wasb]

The problem is with BLOB_PREFIX, its not a url its the prefix after the azure url
see this source example : https://airflow.apache.org/docs/apache-airflow-providers-microsoft-azure/stable/_modules/tests/system/providers/microsoft/azure/example_sftp_to_wasb.html

Related

Add RUN_ID as part of airflow logs

I have been reading a lot about logging in to Airflow and experimenting a lot but could not achieve what I am looking for. I want to customize the logging for Airflow. I have a lot of DAGs. Each DAG has multiple tasks. EAch DAG has multiple tasks corresponding to one alert_name. DAG is running hourly and pushing logs to S3. If something goes wrong and debugging is required it is very tough to look for logs in S3. I want to customize the logs to search the log lines by RUN_ID and alert_name.
I have the following piece of code.
from airflow import DAG # noqa
from datetime import datetime
from datetime import timedelta
from airflow.operators.python_operator import PythonOperator
import y
import logging
log = logging.getLogger(__name__)
default_args = {
'owner': 'SRE',
'execution_timeout': timedelta(minutes=150)
}
dag = DAG(
dag_id = 'new_dag',
default_args = default_args,
start_date = datetime(year=2021, month=11, day=22),
schedule_interval = timedelta(days=1),
catchup = False,
max_active_runs = 3,
)
def implement_alert_logic(alert_name):
log.info(f'In the implementation for {alert_name}')
pass
def myfunc(**wargs):
for alert in ['alert_1', 'alert_2', 'alert_3']:
log.info(f'Executing logic for {alert}')
implement_alert_logic(alert)
t1 = PythonOperator(
task_id='testing_this',
python_callable = myfunc,
provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='testing_this2',
python_callable = myfunc,
provide_context=True,
dag=dag)
t1 >> t2
It prints something like
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_1
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_1
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_2
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_2
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_3
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_3
Actual code is much more complex and sophisticated than this. That's why I need faster and more customized debugging logs.
What I am trying to achieve is to customize the log formatter and add RUN_ID and alert_name as part of the log_message
Logs should be something like this:
[2022-06-13, 08:16:54 UTC] [manual__2022-06-13T08:16:54.103265+00:00] {myenv.py:32} INFO - Executing logic for alert_1
[2022-06-13, 08:16:54 UTC] [manual__2022-06-13T08:16:54.103265+00:00] [alert1]{myenv.py:32} INFO - In the implementation for alert_1
You are already sending the context to the callable, just make use of it
def myfunc(**wargs):
for alert in ['alert_1', 'alert_2', 'alert_3']:
log.info(f"[{wargs['run_id']}] [{alert}] Executing logic for {alert}")
implement_alert_logic(alert)
Then just send the whole context or just the run_id also to implement_alert_logic function.

How to pass multiple delimiter in Python for BigQuery storage using Cloud Function

I am trying to load multiple csv files into a BigQuery table. For some csv files delimiter is comma and for some it is semicolon. Is there any way to pass multiple delimiter in Job config.
job_config = bigquery.LoadJobConfig(
autodetect=True,
source_format=bigquery.SourceFormat.CSV,
field_delimiter=",",
write_disposition="WRITE_APPEND",
skip_leading_rows=1,
)
Thanks
Ritz
I deployed the following code in Cloud Functions for this purpose. I am using “Cloud Storage” as the trigger and “Finalize/Create” as the event type. The code works successfully for running Bigquery Load jobs on comma and semicolon delimited files.
main.py
def hello_gcs(event, context):
from google.cloud import bigquery
from google.cloud import storage
import subprocess
# Construct a BigQuery client object.
client = bigquery.Client()
client1 = storage.Client()
bucket = client1.get_bucket('Bucket-Name')
blob = bucket.get_blob(event['name'])
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "ProjectID.DatasetName.TableName"
with open("/tmp/z", "wb") as file_obj:
blob.download_to_file(file_obj)
subprocess.call(["sed", "-i", "-e", "s/;/,/", "/tmp/z"])
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
field_delimiter=",",
write_disposition="WRITE_APPEND",
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
with open("/tmp/z", "rb") as source_file:
source_file.seek(0)
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
# Make an API request.
job.result() # Waits for the job to complete.
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud
google-cloud-bigquery
google-cloud-storage
Here, I am substituting the “;” with “,” using the Sed command. One point to note is while writing to a file in Cloud Functions, we need to give the path as /tmp/file_name, as it is the only place in Cloud Functions where writing to a file is allowed. It also assumed that there are no additional commas or semicolons in the files in addition to the delimiter.

How to efficiently read the data lake files' metadata [duplicate]

This question already has answers here:
script to get the file last modified date and file name pyspark
(3 answers)
Closed 1 year ago.
I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.
Thank you:)
UPDATE:
If you're working in Databricks, since Databricks runtime 10.4 released on Mar 18, 2022, dbutils.fs.ls() command returns “modificationTime” of the folders and files as well:
Regarding the issue, please refer to the following code
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()
conf.set(
"fs.azure.account.key.<account-name>.dfs.core.windows.net",
"<account-access-key>")
fs = Path('abfss://<container-name>#<account-name>.dfs.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())
status=fs.listStatus(Path('abfss://<container-name>#<account-name>.dfs.core.windows.net/<file-path>/'))
for i in status:
print(i)
print(i.getModificationTime())
We can get those details using a Python code as we don't have direct method to get the modified time and date of the files in data lake
Here is the code
from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
import os.path
block_blob_service = BlockBlobService(account_name='account-name', account_key='account-key')
container_name ='container-firstname'
second_conatainer_name ='container-Second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for blob in generator:
length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
print(line)
For more details, refer to the SO thread which addressing similar issue.

Airflow Composer custom module not found - PythonVirtualenvOperator

I have a very simple Airflow instance setup in GCP Composer. It has the bucket and everything. I want to set up each dag to run it its own environment with PythonVirtualenvOperator.
The structure in it is as follows:
dags ->
------> code_snippets/
----------> print_name.py - has function called print_my_name() which prints a string into the terminal
------> test_dag.py
test_dag.py:
import datetime
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow import DAG
def main_func():
import pandas as pd
import datetime
from code_snippets.print_name import print_my_name
print_my_name()
df = pd.DataFrame(data={
'date': [str(datetime.datetime.now().date())]
})
print(df)
default_args = {
'owner': 'test_dag',
'start_date': datetime.datetime(2020, 7, 3, 5, 1, 00),
'concurrency': 1,
'retries': 0
}
dag = DAG('test_dag', description='Test DAGS with environment',
schedule_interval='0 5 * * *',
default_args=default_args, catchup=False)
test_the_dag = PythonVirtualenvOperator(
task_id="test_dag",
python_callable=main_func,
python_version='3.8',
requirements=["DateTime==4.3", "numpy==1.20.2", "pandas==1.2.4", "python-dateutil==2.8.1", "pytz==2021.1",
"six==1.15.0", "zope.interface==5.4.0"],
system_site_packages=False,
dag=dag,
)
test_the_dag
Everything works until I start importing custom modules - having an init.py does not help, it still gives out the same error, which in my case is:
from code_snippets.print_name import print_my_name\nModuleNotFoundError: No module named \'code_snippets\'
I also have a local instance of Airflow and i experience the same issue. I have tried moving things around or adding the path to the folders to PATH, adding inits in the directories or even changing the import statements, but the error persists as long as I am importing custom modules.
system_site_packages=False or True also has no effect
Is there a fix for that or a way to go around it so I can utilize the custom code I have separated outside of the DAGs?
Airflow Version : 1.10.14+composer
Python version for Airflow is set to: 3
The implementation for airflow.operators.python.PythonVirtualenvOperator is such that the python_callable is expected to not reference external names.
Any non-standard library packages used in the callable must be declared as external dependencies in the requirements.txt file.
If you need to use code_snippets, publish it as a package either to pypi or a VCS repository and add it in the list of packages in the requirements kwargs for the PythonVirtualenvOperator.

How to read csv file with using pandas and cloud functions in GCP?

I just try to read csv file which was upload to GCS.
I want to read csv file which is upload to GCS with Cloud functions in GCP. And I want to deal with the csv data as "DataFrame".
But I can't read csv file by using pandas.
This is the code to read csv file on the GCS with using cloud functions.
def read_csvfile(data, context):
try:
bucket_name = "my_bucket_name"
file_name = "my_csvfile_name.csv"
project_name = "my_project_name"
# create gcs client
client = gcs.Client(project_name)
bucket = client.get_bucket(bucket_name)
# create blob
blob = gcs.Blob(file_name, bucket)
content = blob.download_as_string()
train = pd.read_csv(BytesIO(content))
print(train.head())
except Exception as e:
print("error:{}".format(e))
When I ran my Python code, I got the following error.
No columns to parse from file
Some websites says that the error means I read un empty csv file. But actually I upload non empty csv file.
So how can I solve this problem?
please give me your help. Thanks.
----add at 2020/08/08-------
Thank you for giving me your help!
But finally I cloud not read csv file by using your code... I still have the error, No columns to parse from file.
So I tried new way to read csv file as Byte type. The new Python code to read csv file is bellow.
MAIN.PY
from google.cloud import storage
import pandas as pd
import io
import csv
from io import BytesIO
def check_columns(data, context):
try:
object_name = data['name']
bucket_name = data['bucket']
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(object_name)
data = blob.download_as_string()
#read the upload csv file as Byte type.
f = io.StringIO(str(data))
df = pd.read_csv(f, encoding = "shift-jis")
print("df:{}".format(df))
print("df.columns:{}".format(df.columns))
print("The number of columns:{}".format(len(df.columns)))
REQUIREMENTS.TXT
Click==7.0
Flask==1.0.2
itsdangerous==1.1.0
Jinja2==2.10
MarkupSafe==1.1.0
Pillow==5.4.1
qrcode==6.1
six==1.12.0
Werkzeug==0.14.1
google-cloud-storage==1.30.0
gcsfs==0.6.2
pandas==1.1.0
The output I got is bellow.
df:Empty DataFrame
Columns: [b'Apple, Lemon, Orange, Grape]
Index: []
df.columns:Index(['b'Apple', 'Lemon', 'Orange', 'Grape'])
The number of columns:4
So I could read only first record in csv file as df.column!? But I could not get the other records in csv file...And the first column is not the column but normal record.
So how can I get some records in csv file as DataFrame with using pandas?
Could you help me again? Thank you.
Pandas, since version 0.24.1, can directly read a Google Cloud Storage URI.
For example:
gs://awesomefakebucket/my.csv
Your service account attached to your function must have access to read the CSV file.
Please, feel free to test and modify this code.
I used Python 3.7
function.py
from google.cloud import storage
import pandas as pd
def hello_world(request):
# it is mandatory initialize the storage client
client = storage.Client()
#please change the file's URI
temp = pd.read_csv('gs://awesomefakebucket/my.csv', encoding='utf-8')
print (temp.head())
return f'check the results in the logs'
requirements.txt
google-cloud-storage==1.30.0
gcsfs==0.6.2
pandas==1.1.0

Resources