How to directly read excel file from s3 with pandas in airflow dag?

How to directly read excel file from s3 with pandas in airflow dag? - excel

I am trying to read an excel file from s3 inside an aiflow dag with python, but it does not seem to work. It is very weird because it works when I read it from outside airflow with pd.read_excel(s3_excel_path).
What I did :
Set AWS credential in airflow (this works well as I can list my s3 bucket)
Install pandas, s3fs in my Docker environment where I run Airflow
Try to read the file with pd.read_excel(s3_excel_path)
As I said, it works when I try it outside of Airflow. Moreover, I don't get any error, the dag just continues to run undefinitely (at the step where it is supposed to read the file) and nothing happens, even if I wait 20 minutes.
(I would like to avoir to download the file from s3, process it and then upload it back to s3, which is why I am trying to read it directly from s3)
Note: I does not work with csv as well.
EDIT : Likewise, I can't save my dataframe directly to S3 with df.to_csv('s3_path') in airflow dag while I can do it in python

To read data files stored in S3 using pandas, you have two options, download them using boto3 (or AWS CLI) and read local files, which is the solution you are not locking for, and use s3fs API supported by pandas:
import os
import pandas as pd
AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
key = "path/to/excel/file"
books_df = pd.read_excel(
f"s3://{AWS_S3_BUCKET}/{key}",
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
},
)
to use this solution, you need to install s3fs and apache-airflow-providers-amazon
pip install s3fs
pip install apache-airflow-providers-amazon

Related

Cannot import CSV file into h2o from Databricks cluster DBFS

I have successfully installed both h2o on my AWS Databricks cluster, and then successfully started the h2o server with:
h2o.init()
When I attempt to import the iris CSV file that is stored in my Databricks DBFS:
train, valid = h2o.import_file(path="/FileStore/tables/iris.csv").split_frame(ratios=[0.7])
I get an H2OResponseError: Server error water.exceptions.H2ONotFoundArgumentException
The CSV file is absolutely there; in the same Databricks notebook, I am able to read it directly into a DataFrame and view the contents using the exact same fully qualified path:
df_iris = ks.read_csv("/FileStore/tables/iris.csv")
df_iris.head()
I've also tried calling:
h2o.upload_file("/FileStore/tables/iris.csv")
but to no avail; I get H2OValueError: File /FileStore/tables/iris.csv does not exist. I've also tried uploading the file directly from my local computer (C drive), but that doesn't succeed either.
I've tried not using the fully qualified path, and just specifying the file name, but I get the same errors. I've read through the H2O documentation and searched the web, but cannot find anyone who has ever encountered this problem before.
Can someone please help me?
Thanks.

H2O may not understand that this path is on the DBFS. You may try to specify path /dbfs/FileStore/tables/iris.csv - in this case it will be read as "local file", or try to specify the full path with schema, like dbfs:/FileStore/tables/iris.csv - but this may require DBFS-specific jars for H2O.

get list of tables in database using boto3

I’m trying to get a list of the tables from a database in my aws data catalog. I’m trying to use boto3. I’m running the code below on aws, in a sagemaker notebook. It runs forever (like over 30 minutes) and doesn’t return any results. The test_db only has 4 tables in it. My goal is to run similar code as part of an aws glue etl job, that I would run in an edited aws etl job script. Does anyone see what the issue might be or suggest how to do this?
code:
import boto3
from pprint import pprint
glue = boto3.client('glue', region_name='us-east-2')
response = glue.get_tables(
DatabaseName=‘test_db’
)
print(pprint(response['TableList']))

db = session.resource('dynamodb', region_name="us-east-2")
tables = list(db.tables.all())
print(tables)
resource
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/dynamodb.html

Setting environment variable in python for google bigquery in Mac

Before calling the Jupyter Notebook, i run the below code in Terminal for Google Application Credentials :
export GOOGLE_APPLICATION_CREDENTIALS="/Users/mac/Desktop/Bigquery-Key.json"
Then set the below configuration in Jupyter Notebook :
%load_ext google.cloud.bigquery
# Imports the Google Cloud Client Library
from google.cloud import bigquery
# Instantiates a Client for Bigquery Service
bigquery_client = bigquery.Client()
Now, i wanted to write a Python script(.py file) which will do both the tasks instead of using Terminal.
How can it be done ? Kindly advise ?
Thanks

You can change the environment within a Python script. The environment is stored in the dictionary os.environ:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/mac/Desktop/Bigquery-Key.json"

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.

Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'

Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Boto/Boto3: bucket.get_key(): 403 Forbidden

I am trying to connect to AWS S3 without using credentials. I attached the Role S3 fullaccess for my instance to check if the file exists or not; if it is not, upload it into S3 bucket. If is isn't I want to check md5sum and if it is different from the local local file, upload a new version.
I try to get key of file in S3 via boto by using bucket.get_key('mykey') and get this error:
File "/usr/local/lib/python3.5/dist-packages/boto/s3/bucket.py", line 193, in get_key key, resp = self._get_key_internal(key_name, headers, query_args_l)
File "/usr/local/lib/python3.5/dist-packages/boto/s3/bucket.py", line 232, in _get_key_internal response.status, response.reason, '') boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden"
I searched and added "validate=False" when getting the bucket, but this didn't resolve my issue. I'm using Python 3.5, boto and boto3.
Here is my code:
import boto3
import boto
from boto import ec2
import os
import boto.s3.connection
from boto.s3.key import Key
bucket_name = "abc"
conn = boto.s3.connect_to_region('us-west-1', is_secure = True, calling_format = boto.s3.connection.OrdinaryCallingFormat())
bucket = conn.get_bucket(bucket_name, validate=False)
key = bucket.get_key('xxxx')
print (key)
I don't know why I get that error. Please help me to clearly this problem. Thanks!
Updated
I've just find root cause this problem. Cause by "The difference between the request time and the current time is too large".
Then it didn't get key of file from S3 bucket. I updated ntp service to synchronize local time and UTC time. It run success.
Synchronization time by:
sudo service ntp stop
sudo ntpdate -s 0.ubuntu.pool.ntp.org
sudo service ntp start
Thanks!

IAM role is the last in the order of search. I bet you have the credentials stored before the search order which doesn't have full S3 access. Check Configuration Settings and Precedence and make sure no credentials is present so that IAM role is used to fetch the credentials. Though it is for CLI, it applies to scripts too.
The AWS CLI looks for credentials and configuration settings in the following order:
Command line options – region, output format and profile can be specified as command options to override default settings.
Environment variables – AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN.
The AWS credentials file – located at ~/.aws/credentials on Linux, macOS, or Unix, or at C:\Users\USERNAME .aws\credentials on Windows. This file can contain multiple named profiles in addition to a default profile.
The CLI configuration file – typically located at ~/.aws/config on Linux, macOS, or Unix, or at C:\Users\USERNAME .aws\config on Windows. This file can contain a default profile, named profiles, and CLI specific configuration parameters for each.
Container credentials – provided by Amazon Elastic Container Service on container instances when you assign a role to your task.
Instance profile credentials – these credentials can be used on EC2 instances with an assigned instance role, and are delivered through the Amazon EC2 metadata service.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to directly read excel file from s3 with pandas in airflow dag? - excel

Related

Cannot import CSV file into h2o from Databricks cluster DBFS

get list of tables in database using boto3

Setting environment variable in python for google bigquery in Mac

Read CSV file from AWS S3

Boto/Boto3: bucket.get_key(): 403 Forbidden

Categories

Resources