airflow.exceptions.AirflowException: 403:Forbidden - http-status-code-403

from datetime import datetime
from airflow import DAG
from airflow.providers.sqlite.operators.sqlite import SqliteOperator
from airflow.providers.http.sensors.http import HttpSensor
default_args= {
'start_date': datetime(2022, 1, 1)
}
# DAG skeleton
with DAG(dag_id='nft-pipeline',
schedule='#daily',
default_args=default_args,
tags=['nft'],
catchup=False) as dag:
creating_table = SqliteOperator(
task_id='creating_table',
sqlite_conn_id='db_sqlite',
sql='''
CREATE TABLE IF NOT EXISTS nfts (
token_id TEXT PRIMARY KEY,
name TEST NOT NULL,
image_url TEXT NOT NULL
)
'''
)
is_api_available = HttpSensor(
task_id='is_api_available',
http_conn_id='opensea_api',
endpoint='api/v1/assets?collection=doodles-official&limit=1'
)
When I execute "$ airflow tasks test nft-pipeline is_api_available 2022-01-01" command, I got an "airflow.exceptions.AirflowException: 403:Forbidden" error message.

Related

Cloud Schedule Cloud Function to read and write data to BigQuery fails

I am trying to schedule a read and write Cloud Function in GCP, but I keep getting a fail on the execution of the scheduling in Cloud Scheduler. My function (which b.t.w. is validated and activated by Cloud Functions) is given by
def geopos_test(request):
from flatten_json import flatten
import requests
import flatten_json
import pandas as pd
import os, json, sys,glob,pathlib
import seaborn as sns
from scipy import stats
import collections
try:
collectionsAbc = collections.abc
except AttributeError:
collectionsAbc = collections
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.ticker as ticker
import datetime
import seaborn as sns
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from matplotlib.lines import Line2D
import numpy as np
import math
from pandas.io.json import json_normalize
from operator import attrgetter
from datetime import date, timedelta
import pandas_gbq
import collections
from google.cloud import bigquery
client = bigquery.Client()
project = "<ProjectId>"
dataset_id = "<DataSet>"
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table('sectional_accuracy')
table = client.get_table(table_ref)
Sectional_accuracy = client.list_rows(table).to_dataframe()
sectional_accuracy = sectional_accuracy.drop_duplicates()
sectional_accuracy.sort_values(['Store'])
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("Store", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("storeid", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("storeIdstr", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("Date", bigquery.enums.SqlTypeNames.TIMESTAMP),
bigquery.SchemaField("Sections", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("Percentage", bigquery.enums.SqlTypeNames.FLOAT),
bigquery.SchemaField("devicePercentage", bigquery.enums.SqlTypeNames.FLOAT),
bigquery.SchemaField("distance", bigquery.enums.SqlTypeNames.STRING),
],)
NtableId = '<ProjectId>.<DataSet>.test'
job = client.load_table_from_dataframe(sectional_accuracy, Ntable_id, job_config=job_config)
This function only reads data from one table and writes it to a new one. The idea is to do a load of transformations between the reading and writing.
The Function is associated to the App Engine default service account for which I am the owner and I have added (probably overkill) The Cloud Run Invoker, Cloud Functions Invoker and Cloud Scheduler Job Runner.
Now, for the Cloud Scheduler:
I have defined it by HTTP with POSTmethod with an URL, AUth OIDC token with the same service account as that used by the function. As for the HTTP header, I have User-Agent with value Google-Cloud-Scheduler. Note that I have no other header as I am uncertain of what it should be.
Yet, it fails every single time with a PERMISSION DENIED message in the log.
What Have I tried:
Change geopos_test(request) to geopos_test(event, context)
Tried to change the HTTP header to (Content-Type, application/octet-stream) or (Content-Type, application/json)
Change service account
What I haven't tried is to give some value in body, since I do not know what it could be.
I am now out of ideas. Any help would be appreciated.
Update: Error message:
{
httpRequest: {1}
insertId: "********"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
jobName: "******************"
status: "PERMISSION_DENIED"
targetType: "HTTP"
url: "*************"
}
logName: "*************/logs/cloudscheduler.googleapis.com%2Fexecutions"
receiveTimestamp: "2022-10-24T10:10:52.337822391Z"
resource: {2}
severity: "ERROR"
timestamp: "2022-10-24T10:10:52.337822391Z"

Step function to invoke glue job and lambda function with passed parameters

Scenario :
I want to pass S3 (source file location) and the s3 (output file location) as
input parameters in my workflow .
Workflow : Aws Step Function calls -> lambda function and lambda function calls -> the glue job,
I want to pass the parameters from step function -> lambda function -> glue job, where glue job does
some transformation on the S3 input file and writes its output to S3 output file
Below are step function, lambda function and glue job respectively and the input
json which is passed to step function as input.
1:Input (Parameters passed) :
{
"S3InputFileLocation": "s3://bucket_name/sourcefile.csv",
"S3OutputFileLocation": "s3://bucket_name/FinalOutputfile.csv"
}
2: Step Function/ state machine ( which calls lambda with the above input
parameters) :
{
"StartAt":"AWSStepFunctionInitiator",
"States":{
"AWSStepFunctionInitiator": {
"Type":"Task",
"Resource":"arn:aws:lambda:us-east-1:xxxxxx:function:AWSLambdaFunction",
"InputPath": "$",
"End": true
}
}
}
3: Lambda Function(I.E AWSLambdaFunction invoked above, which in turn calls AWSGlueJob below):
import json
import boto3
def lambda_handler(event,context):
client= boto3.client("glue")
client.start_job_run(
JobName='AWSGlueJob',
Arguments={
'S3InputFileLocation': event["S3InputFileLocation"],
'S3OutputFileLocation': event["S3OutputFileLocation"]})
return{
'statusCode':200,
'body':json.dumps('AWS lambda function invoked!')
}
4: AWS Glue Job Script:
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
print('AWS Glue Job started')
args = getResolvedOptions(sys.argv, ['AWSGlueJob','S3InputFileLocation', 'S3OutputFileLocation'])
S3InputFileLocation= args['S3InputFileLocation']
S3OutputFileLocation= args['S3OutputFileLocation']
glueContext = GlueContext(SparkContext.getOrCreate())
dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': [S3_InputLocation] }, format="csv" )
datasink = glueContext.write_dynamic_frame.from_options(frame = dfnew, connection_type = "s3",connection_options = {"path": S3_OutputLocation}, format = "csv", transformation_ctx ="datasink")
The above step function and corresponding workflow executes without any compilation or run time error, also I do see parameters successfully passed from Step function to lambda function, but none of my print statements in glue job are getting logged in cloud watch that means there is some issue when lambda function calls the glue job. Kindly help me figure out if there is some issue in the way I am invoking glue from lambda ?
Hej,
Maybe it is already solved but what helps are those two links:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
And to give precise answer to your question: Add '--' to the arguments in here:
import json
import boto3
def lambda_handler(event,context):
client= boto3.client("glue")
client.start_job_run(
JobName='AWSGlueJob',
Arguments={
'--S3InputFileLocation': event["S3InputFileLocation"],
'--S3OutputFileLocation': event["S3OutputFileLocation"]})
return{
'statusCode':200,
'body':json.dumps('AWS lambda function invoked!')
}

How to Create Partition table at runtime over Apache Beam using Python

I am trying to Create a new partition Bigquery table on runtime with following code, but i am not getting option to pass column names "_time" over which partition need to be done on my new BQ table.
Can any please please help me on it.
My Code
#------------Import Lib-----------------------#
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
import os, sys
import argparse
import logging
from apache_beam.options.pipeline_options import SetupOptions
from datetime import datetime
#------------Set up BQ parameters-----------------------#
# Replace with Project Id
project = 'xxxx'
#plitting Of Records----------------------#
class Transaction_DB_UC2(beam.DoFn):
def process(self, element):
logging.info(element)
result = json.loads(element)
data_time = result.get('_time', 'null')
data_dest = result.get('dest', 'null')
data_DBID = result.get('DBID', 'null')
data_SESSIONID = result.get('SESSIONID', 'null')
data_USERHOST = result.get('USERHOST', 'null')
data_raw = result.get('_raw', 'null')
data_ACTION = result.get('ACTION', 'null')
data_host = result.get('host', 'null')
data_result = result.get('result', 'null')
data_DBUSER = result.get('DBUSER', 'null')
data_OS_USERNAME = result.get('OS_USERNAME', 'null')
data_ACTION_NAME = result.get('ACTION', 'null').replace('100','LOGON').replace('101','LOGOFF')
return [{"_time": data_time[:-8], "dest": data_dest, "DBID": data_DBID, "SESSIONID": data_SESSIONID, "_raw": data_raw, "USERHOST": data_USERHOST, "ACTION": data_ACTION, "host": data_host, "result": data_result, "DBUSER": data_DBUSER, "OS_USERNAME": data_OS_USERNAME, "ACTION_NAME": data_ACTION_NAME}]
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument(
'--input',
dest='input',
help='Input file to process.')
parser.add_argument(
'--pro_id',
dest='pro_id',
type=str,
default='ORACLE_SEC_DEFAULT',
help='project id')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p1 = beam.Pipeline(options=pipeline_options)
#data_f = sys.argv[1]
logging.info('***********')
logging.info(known_args.input)
data_loading = (
p1
|'Read from File' >> beam.io.ReadFromText(known_args.input,skip_header_lines=0)
)
project_id = "xxxxx"
dataset_id = 'test123'
table_schema_DB_UC2 = ('_time:DATETIME, dest:STRING, DBID:STRING, SESSIONID:STRING, _raw:STRING, USERHOST:STRING, ACTION:STRING, host:STRING, result:STRING, DBUSER:STRING, OS_USERNAME:STRING, ACTION_NAME:STRING')
# Persist to BigQuery
# WriteToBigQuery accepts the data as list of JSON objects
#---------------------Index = DB-UC2----------------------------------------------------------------------------------------------------------------------
result = (
data_loading
| 'Clean-DB-UC2' >> beam.ParDo(Transaction_DB_UC2())
| 'Write-DB-UC2' >> beam.io.WriteToBigQuery(
table=known_args.pro_id,
dataset=dataset_id,
project=project_id,
schema=table_schema_DB_UC2,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
result = p1.run()
result.wait_until_finish()
if __name__ == '__main__':
#logging.getLogger().setLevel(logging.INFO)
path_service_account = 'ml-fbf8cabcder.json'
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_service_account
run()
I want to create partition on filed "_time", please suggest how it can be achieved.
Thanks.
I believe that you can do that with additional_bq_parameters (Note the limitations) with the timePartitioning parameter.
When creating a new BigQuery table, there are a number of extra parameters
that one may need to specify. For example, clustering, partitioning, data
encoding, etc. It is possible to provide these additional parameters by
passing a Python dictionary as additional_bq_parameters (Reference).
In your case, you could add to your WriteToBigQuery transform the timePartitioning parameter with the required type and optional field fields (Note that field must be a top-level TIMESTAMP or DATE field):
additional_bq_parameters={'timePartitioning': {
'type': 'DAY',
'field': '_time'
}}
I didn't have the time to try it out yet. I'll try to reproduce tomorrow.
Let me know if it works for you.
EDIT
Finally got the chance to try the timePartitioning parameter to create a partitioned table and it worked.
Here is a simple pipeline code to test it.
#!/usr/bin/env python
import apache_beam as beam
PROJECT='YOUR_PROJECT'
BUCKET='YOUR_BUCKET'
def run():
argv = [
'--project={0}'.format(PROJECT),
'--job_name=YOUR_JOB_NAME',
'--save_main_session',
'--staging_location=gs://{0}/staging/'.format(BUCKET),
'--temp_location=gs://{0}/staging/'.format(BUCKET),
'--region=us-central1',
'--runner=DataflowRunner'
]
p = beam.Pipeline(argv=argv)
table_schema = {'fields': [
{'name': 'country', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': '_time', 'type': 'DATETIME', 'mode': 'NULLABLE'},
{'name': 'query', 'type': 'STRING', 'mode': 'NULLABLE'}]}
additional_bq_parameters = {
'timePartitioning': {'type': 'DAY', 'field': '_time'}}
elements = (p | beam.Create([
{'country': 'mexico', '_time': '2020-06-10 22:19:26', 'query': 'acapulco'},
{'country': 'canada', '_time': '2020-12-11 15:42:32', 'query': 'influenza'},
]))
elements | beam.io.WriteToBigQuery(
table='YOUR_DATASET.YOUR_NEW_TABLE',
schema=table_schema,
additional_bq_parameters=additional_bq_parameters,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE
)
p.run()
if __name__ == '__main__':
run()

how to translate sqlalchemy query correctly

all hi!
have a query
query = insert(balance)
query = query.on_conflict_do_nothing(
index_elements=["query_id", "warehouse", "product"]
)
I use sqlalchemy compile for translation in raw sql
sql_str = (
query.compile(
dialect=postgresql.dialect(),
)
)
I get output
INSERT INTO balance (id, query_id, warehouse, product) VALUES (%(id)s, %(query_id)s, %(warehouse)s, %(product)s) ON CONFLICT (query_id, warehouse, product) DO NOTHING
how to get the?
INSERT INTO balance (id, query_id, warehouse, product) VALUES ($1, $2, $3, $4) ON CONFLICT (query_id, warehouse, product) DO NOTHING
full code:
import asyncio
import logging
import sys
from contextlib import asynccontextmanager
from typing import AsyncGenerator
import async_timeout
import asyncpgsa
import stackprinter
from faker import Faker
from sqlalchemy.dialects import postgresql
logging.basicConfig(level=logging.DEBUG)
import sqlalchemy
from sqlalchemy import Boolean, Column, Table, MetaData, NUMERIC, func
from sqlalchemy.dialects.postgresql import UUID, TIMESTAMP, insert
from sqlalchemy.types import Integer
metadata = MetaData()
stock_balance: sqlalchemy.table = Table(
"stock_balance",
metadata,
Column("warehouse", UUID, primary_key=True),
Column("product", UUID, primary_key=True),
Column("balance", NUMERIC(15, 3), index=True),
Column("reserve", NUMERIC(15, 3), index=True),
)
balance: sqlalchemy.table = Table(
"balance",
metadata,
Column("id", Integer, primary_key=True),
Column("query_id", UUID(as_uuid=True), nullable=False, index=True),
Column("warehouse", UUID, index=True, nullable=False),
Column("product", UUID, index=True, nullable=False),
Column("balance", Boolean, index=True, server_default="f"),
Column("reserve", Boolean, index=True, server_default="f"),
Column("count", NUMERIC(15, 3), nullable=False),
Column("date_time", TIMESTAMP, server_default=func.now(tz="UTC"), index=True),
Column("updated", Boolean, server_default="f", index=True),
)
def data_traffic():
fake = Faker("ru_RU")
balance = fake.boolean()
return {
"id": fake.uuid4(),
"warehouse": fake.uuid4(),
"product": fake.uuid4(),
"balance": balance,
"count": fake.pyfloat(left_digits=15, right_digits=3, min_value=0),
"reserve": not balance,
}
#asynccontextmanager
async def connect_db() -> AsyncGenerator:
try:
with async_timeout.timeout(5):
conn = await asyncpgsa.create_pool(
f"postgresql://postgres:some_secret#"
f"localhost:10001/stockbalance_test",
# echo=True,
min_size=1,
max_size=1,
dialect=postgresql.dialect()
)
async with conn.acquire() as c:
yield c
await conn.close()
except Exception as exc:
logging.error(
"Server Errors: {}\n{}\n{}\n{}".format(
exc, sys.exc_info()[0], sys.exc_info()[1], stackprinter.format()
)
)
yield None
finally:
await conn.close()
async def update_balance(conn, data: list):
query = insert(balance)
query = query.on_conflict_do_nothing(
index_elements=["query_id", "warehouse", "product"]
)
sql_str = (
query.compile(
dialect=postgresql.dialect(),
)
)
await conn.executemany(
str(sql_str),
[
(
item["id"],
item["warehouse"],
item["product"],
item["balance"],
item["reserve"],
item["count"],
)
for item in data
],
)
async def main():
async with connect_db() as conn:
try:
await update_balance(conn, [data_traffic()])
except Exception as exc:
print(exc)
if __name__ == "__main__":
asyncio.run(main())
syntax error at or near "%"
found a solution. Maybe someone will help. it looks like this
query = insert(balance).values(
[
{
"query_id": item["id"],
"warehouse": item["warehouse"],
"product": item["product"],
}
for item in data
]
)
query = query.on_conflict_do_nothing(
index_elements=["query_id", "warehouse", "product"]
)
sql_str, args = compile_query(query)
print("sql_str: ", sql_str)
print("args: ", args)
await conn.executemany(sql_str, [tuple(args)])
sql_str: INSERT INTO balance (query_id, warehouse, product) VALUES ($2, $1, $3) ON CONFLICT (query_id, warehouse, product) DO NOTHING
args: ['d982e84e-0f09-43dd-a296-aaf51dd6c36d', '4373a8df-0fd8-4f2f-855c-4a7b36a7ad9e', '831ce109-ff0f-4d0b-8c53-d148bb699bdf']

I am using Cloud SQL on Google app engine. Once I add a new user in the console and re-deploy my app the new user can't seem to login?

When I add in a new user and password in the GCP console, refresh and wait, then re deploy and run my web app I can't login with that user. I can still login with my original test user (the first and only user thus far beside the 'postgres' admin. user)
Ive tried deleting and re-adding the same user.Ive tried adding yet another user and deploying - re attempting to login in again. Ive made sure Ive refreshed and waited for the change to take effect before re-deploying the web app . I have logged with my original user , log out and try login with the new user, also initially with the new user.I've scoured online for answers but surprisingly to no avail.
The main , outer, app.py file that has the user management/Auth code using Flask and flask_login functionality :
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import sys
#sys.path.append('/Users/crowledj/Mindfule/dash-flask-login/views/')
#sys.path.append('/Users/crowledj/Mindfule/dash-flask-login/flask_login/')
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__ , external_stylesheets=external_stylesheets)
#server=app.server
app.css.append_css({'external_url': 'https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.css'})
from server import app, server
from flask_login import logout_user, current_user
import success, login, login_fd, logout
#import sqlalchemy
header = html.Div(
className='header',
children=html.Div(
className='container-width',
style={'height': '100%'},
children=[
html.Img(
src='mindfule_company_logo.jpg',
className='logo'
),
html.Div(className='links', children=[
html.Div(id='user-name', className='link'),
html.Div(id='logout', className='link')
])
]
)
)
app.layout = html.Div(
[
header,
html.Div([
html.Div(
html.Div(id='page-content', className='content'),
className='content-container'
),
], className='container-width'),
dcc.Location(id='url', refresh=False),
]
)
#app.callback(Output('page-content', 'children'),
[Input('url', 'pathname')])
def display_page(pathname):
if pathname == '/':
return login.layout
elif pathname == '/login':
return login.layout
elif pathname == '/success':
if current_user.is_authenticated:
print('returning success page from main app ... \n')
return success.layout
else:
return login_fd.layout
elif pathname == '/logout':
if current_user.is_authenticated:
logout_user()
return logout.layout
else:
return logout.layout
else:
return '404'
#app.callback(
Output('user-name', 'children'),
[Input('page-content', 'children')])
def cur_user(input1):
if current_user.is_authenticated:
return html.Div('Current user: ' + current_user.username)
# 'User authenticated' return username in get_id()
else:
return ''
#app.callback(
Output('logout', 'children'),
[Input('page-content', 'children')])
def user_logout(input1):
if current_user.is_authenticated:
return html.A('Logout', href='/logout')
else:
return ''
if __name__ == '__main__':
app.run_server(debug=True,port=8080,host= "foodmoodai.appspot.com") #"0.0.0.0") #
the only Postgres and SQL related code # start of my 'success' page file :
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output, State
import plotly.graph_objs as go
from textwrap import dedent as d
from flask import Flask
import pandas as pd
import numpy as np
from NutrientParser import parseNutrientStr_frmUser,parseResearch,parseFoodResearch,find_substring
from userMindfuleClasses import *
import PIL
import urllib3
from PIL import Image
import json,os
import arrow
from server import app
from flask_login import current_user
import psycopg2
from datetime import datetime
timeStamp=datetime.now()
#db_user='test'
#db_pass='test1'
#db_name='foodnmood-db'
#INSTANCE_CONNECTION_NAME='foodmoodai:europe-west2:foodnmood-db'
from sqlalchemy import Table, Column, Integer, String, MetaData,create_engine
meta = MetaData()
#engine = create_engine('postgresql+psycopg2://postgres:Pollgorm1#/cloudsql/foodmoodai:europe-west2:foodnmood-db')
engine = create_engine('postgresql+psycopg2://postgres:Pollgorm1#/?host=/cloudsql/foodmoodai:europe-west2:foodnmood-db')
mealnMoodwithTimesnFoods = Table(
'mealnMoodwithTimesnFoods', meta,
Column('time', String, primary_key = True),
Column('id', String),
Column('food_1', String),
Column('food_2', String),
Column('food_3', String),
Column('mood', String),
)
meta.create_all(engine)
I expect to be able to at least add a new user (which automatically has login permissions) and log in past the log in page when I redeploy the app after making this change in the GCP console.
The issue here turned out to actually be completely to do with a local authentication library I had installed from GitHub - which uses 'flask_login' (flask_login==0.4.1 --> pip install flask-login==0.4.1).All I needed to do was update the new user and password in a local .txt file as well as on gcloud's cloud SQL console.

Resources