Use RabbitMQ as procedure and Celery as consumer - python-3.x

I'm trying to use RabbitMQ, Celery, and Flask app to simply update the database. ProcedureAPI.py is an API that gets the data, inserts records in the database, and pushes data to the Radbbitmq server. Celery gets the data from Rabbit Queue and updates the database.
I'm new to this, please point out what I'm doing wrong.
consumer.py
from celery import Celery
import sqlite3
import time
#app = Celery('Task_Queue')
#default_config = 'celeryconfig'
#app.config_from_object(default_config)
app = Celery('tasks', backend='rpc://', broker='pyamqp://guest:guest#localhost')
#app.task(serializer='json')
def updateDB(x):
x=x["item"]
with sqlite3.connect("test.db") as conn:
time.sleep(5)
conn.execute('''updateQuery''', [x])
# app.log(f"{x['item']} status is updated as completed!")
return x
ProcedureAPI.py
from flask import Flask,request,jsonify
import pandas as pd
import sqlite3
import json
import pika
import configparser
parser = configparser.RawConfigParser()
configFilePath = 'appconfig.conf'
parser.read(configFilePath)
# RabbitMQ Config
rmq_username = parser.get('general', 'rmq_USERNAME')
rmq_password = parser.get('general', 'rmq_PASSWORD')
host= parser.get('general', 'rmq_IP')
port= parser.get('general', 'rmq_PORT')
# Database
DATABASE= parser.get('general', 'DATABASE_FILE')
app = Flask(__name__)
conn_credentials = pika.PlainCredentials(rmq_username, rmq_password)
connection = pika.BlockingConnection(pika.ConnectionParameters(
host=host,
port=port,
credentials=conn_credentials))
channel = connection.channel()
#app.route('/create', methods=['POST'])
def create_main():
if request.method=="POST":
print(DATABASE)
with sqlite3.connect(DATABASE) as conn:
conn.execute('''CREATE TABLE table1
(feild1 INTEGER PRIMARY KEY, ##AUTOINCREMENT
feild2 varchar(20) NOT NULL,
feild3 varchar(20) DEFAULT 'pending');''')
return "Table created",202
#app.route('/getData', methods=['GET'])
def display_main():
if request.method=="GET":
with sqlite3.connect(DATABASE) as conn:
df = pd.read_sql_query("SELECT * from table1", conn)
df_list = df.values.tolist()
JSONP_data = jsonify(df_list)
return JSONP_data,200
#app.route('/', methods=['POST'])
def update_main():
if request.method=="POST":
updatedata=request.get_json()
with sqlite3.connect(DATABASE) as conn:
conn.execute("INSERT_Query")
print("Records Inserted successfully")
channel.queue_declare(queue='celery', durable=True)
channel.basic_publish(exchange = 'celery',routing_key ='celery' ,body = json.dumps(updatedata),properties=pika.BasicProperties(delivery_mode = 2))
return updatedata,202
# main driver function
if __name__ == '__main__':
app.run()
configfile
[general]
# RabbitMQ server (broker) IP address
rmq_IP=127.0.0.1
# RabbitMQ server (broker) TCP port number (5672 or 5671 for SSL)
rmq_PORT=5672
# queue name (storage node hostname)
rmq_QUEUENAME=Task_Queue
# RabbitMQ authentication
rmq_USERNAME=guest
rmq_PASSWORD=guest
DATABASE_FILE=test.db
# log file
receiver_LOG_FILE=cmdmq_receiver.log
sender_LOG_FILE=cmdmq_sender.log
run celery
celery -A consumer worker --pool=solo -l info
The error I got:
(env1) PS C:\Users\USER\Desktop\Desktop\Jobs Search\nodepython\flaskapp> celery -A consumer worker --pool=solo -l info
-------------- celery#DESKTOP-FRBNH77 v5.2.0 (dawn-chorus)
--- ***** -----
-- ******* ---- Windows-10-10.0.19041-SP0 2021-11-12 17:35:04
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: tasks:0x1ec10c9c5c8
- ** ---------- .> transport: amqp://guest:**#localhost:5672//
- ** ---------- .> results: rpc://
- *** --- * --- .> concurrency: 12 (solo)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> celery exchange=celery(direct) key=celery
[tasks]
. consumer.updateDB
[2021-11-12 17:35:04,546: INFO/MainProcess] Connected to amqp://guest:**#127.0.0.1:5672//
[2021-11-12 17:35:04,571: INFO/MainProcess] mingle: searching for neighbors
[2021-11-12 17:35:05,594: INFO/MainProcess] mingle: all alone
[2021-11-12 17:35:05,605: INFO/MainProcess] celery#DESKTOP-FRBNH77 ready.
[2021-11-12 17:35:14,952: WARNING/MainProcess] Received and deleted unknown message. Wrong destination?!?
The full contents of the message body was: body: '{"item": "1BOOK"}' (17b)
{content_type:None content_encoding:None
delivery_info:{'consumer_tag': 'None4', 'delivery_tag': 1, 'redelivered': False, 'exchange':
'celery', 'routing_key': 'celery'} headers={}}
Any reference code or suggestion will be a great help.

Looks like you haven’t declare the exchange and bind into queue that you want to route
channel.exchange_declare(exchange='exchange_name', exchange_type="type_of_exchange")
channel.queue_bind(exchange='exchange_name, queue='your queue_name')
Producer : P
Exchange : E
Queue : Q
Bind : B
Producer(your pika script) does not able to send message directly Producer needs some intermediate to send to Queue, therefore message route from
P >> E >> B >> Q
Exchange route the request to one or multiple Queue depending upon exchange type
Bind (As name explain) it use to bind the exchanges to Queue depending upon exchange type
for more detail please refer ::
https://hevodata.com/learn/rabbitmq-exchange-type/

Related

Access to RDS in AWS - HTTPConnectionPool: Max retries exceeded with url

I have a Docker container running on an AWS ECS task. Basically this container needs to get a dataframe that contains customer information in each row. Each row is then accessed and data stored in AWS S3 is read depending on the customer information in that row. Multiple metrics and pdf reports are generated for each row. In initial tests that I have performed, the customer information was provided in a csv within the container. This csv was read and then the container run successfully.
The next step in the testing was to read the customer information from a RDS instance (MySQL) hosted in AWS. So basically, I query the RDS instance and get a dataframe (analogous to the csv described above with customer information). Here, is where I am having problems. I am getting the following error after running the ECS task in Fargate (port number not shown):
HTTPConnectionPool(host='localhost', port=[port_number]): Max retries exceeded with url: /session/03b59c11-8e28-4e6a-9b66-77959c774858/window/maximize (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f515cf97580>: Failed to establish a new connection: [Errno 111] Connection refused'))
When I look at the logs in CloudWatch, I can see that the query is executed successfully and the container starts reading each row of the dataframe. However, it breaks after N amount of rows are read (or probably there is a time limit?).
This are the functions in the container that I am using to execute a given query:
def create_connection_mysql():
"""
Opens a connection to a mySQL server and returns a MySQLConnection object.
Returns
-------
connection : MySQLConnection
MySQLConnection object.
"""
# read credentials from environment file
host = os.getenv("HOST")
user = os.getenv("USER_MYSQL")
passwd = os.getenv("PASSWORD")
database = os.getenv("DATABASE")
# create connection
connection = connect(
host=host,
user=user,
passwd=passwd,
database=database,
use_unicode=True,
charset="utf8",
port=3306
)
print("Connection to MySQL DB successful")
return connection
def execute_query(connection, query):
"""
Executes a query on mySQL and returns the result as a dataframe.
Parameters
----------
connection : MySQLConnection object
A MySQLConnection object to access the database.
query : str
A valid query to perform on the provided connection.
Returns
-------
result : DataFrame
Returns a DataFrame with the results of the query.
"""
# execute query
cursor = connection.cursor(buffered=True)
cursor.execute(query)
# output as a pandas DataFrame
colnames = cursor.column_names
result = pd.DataFrame(cursor.fetchall(), columns=colnames)
# close connection and return result
cursor.close()
connection.close()
return result
Because these functions are working correctly in the container, I wonder what could be causing this problem with the RDS instance.

SFTP to Azure Blob Store

I am trying to copy file from SFTP to Azure Blob store using SFTPToWasbOperator. I am getting error. It seems like I'm doing something wrong, but I can't figure out what it is. Could someone please check the following code and see if there is anything wrong with it?
Airflow Logs
**
[2022-07-10, 13:08:48 UTC] {sftp_to_wasb.py:188} INFO - Uploading /SPi_ESG_Live/07-04-2022/DataPoint_2022_07_04.csv to wasb://testcotainer as https://test.blob.core.windows.net/testcotainer/DataPoint_2022_07_04.csv
[2022-07-10, 13:08:48 UTC] {_universal.py:473} INFO - Request URL: 'https://.blob.core.windows.net/***/test/https%3A//test.blob.core.windows.net/testcontainer/DataPoint_2022_07_04.csv'
Error msg
"azure.core.exceptions.ServiceRequestError: URL has an invalid label."
Airflow DAG
import os
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
from airflow.providers.microsoft.azure.operators.wasb_delete_blob import WasbDeleteBlobOperator
from airflow.providers.microsoft.azure.transfers.sftp_to_wasb import SFTPToWasbOperator
from airflow.providers.sftp.hooks.sftp import SFTPHook
from airflow.providers.sftp.operators.sftp import SFTPOperator
AZURE_CONTAINER_NAME = "testcotainer"
BLOB_PREFIX = "https://test.blob.core.windows.net/testcotainer/"
SFTP_SRC_PATH = "/SPi_test_Live/07-04-2022/"
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_sftp_to_wasb"
with DAG(
DAG_ID,
schedule_interval=None,
catchup=False,
start_date=datetime(2021, 1, 1), # Override to match your needs
) as dag:
# [START how_to_sftp_to_wasb]
transfer_files_to_azure = SFTPToWasbOperator(
task_id="transfer_files_from_sftp_to_wasb",
# SFTP args
sftp_source_path=SFTP_SRC_PATH,
# AZURE args
container_name=AZURE_CONTAINER_NAME,
blob_prefix=BLOB_PREFIX,
)
# [END how_to_sftp_to_wasb]
The problem is with BLOB_PREFIX, its not a url its the prefix after the azure url
see this source example : https://airflow.apache.org/docs/apache-airflow-providers-microsoft-azure/stable/_modules/tests/system/providers/microsoft/azure/example_sftp_to_wasb.html

aiomysql select data problem: not updated

version:
Python 3.6.9
aiomysql: 0.0.20
aiohttp: 3.6.2
problem:
where mysql table data deleted or inserted, query data is not updated for hours, unless web_app restart.
codes using aiomysql pool:
# initial
pool = await aiomysql.create_pool(
# echo=True,
db=conf['database'],
user=conf['user'],
password=conf['password'],
host=conf['host'],
port=conf['port'],
minsize=conf['minsize'],
maxsize=conf['maxsize'],
)
# query
async def get_data(request)::
cmd = 'select a,b,c from tbl where d = 0'
# request.app['db'] == pool
async with request.app['db'].acquire() as conn:
async with conn.cursor() as cur:
await cur.execute(cmd)
...
current solution:
set pool_recycle=20 when aiomysql.create_pool seems solve the problem. but why? other better way?

Influxdb bulk insert using influxdb-python

I used influxDB-Python to insert a large amount of data read from the Redis-Stream. Because Redis-stream and set maxlen=600 and the data is inserted at a speed of 100ms, and I needed to retain all of its data. so I read and transfer it to influxDB(I don't know what's a better database), but using batch inserts only ⌈count/batch_size⌉ pieces of data, both at the end of each batch_size, appear to be overwritten. The following code
import redis
from apscheduler.schedulers.blocking import BlockingScheduler
import time
import datetime
import os
import struct
from influxdb import InfluxDBClient
def parse(datas):
ts,data = datas
w_json = {
"measurement": 'sensor1',
"fields": {
"Value":data[b'Value'].decode('utf-8')
"Count":data[b'Count'].decode('utf-8')
}
}
return w_json
def archived_data(rs,client):
results= rs.xreadgroup('group1', 'test', {'test1': ">"}, count=600)
if(len(results)!=0):
print("len(results[0][1]) = ",len(results[0][1]))
datas = list(map(parse,results[0][1]))
client.write_points(datas,batch_size=300)
print('insert success')
else:
print("No new data is generated")
if __name__=="__main__":
try:
rs = redis.Redis(host="localhost", port=6379, db=0)
rs.xgroup_destroy("test1", "group1")
rs.xgroup_create('test1','group1','0-0')
except Exception as e:
print("error = ",e)
try:
client = InfluxDBClient(host="localhost", port=8086,database='test')
except Exception as e:
print("error = ", e)
try:
sched = BlockingScheduler()
sched.add_job(test1, 'interval', seconds=60,args=[rs,client])
sched.start()
except Exception as e:
print(e)
The data changes following for the influxDB
> select count(*) from sensor1;
name: sensor1
time count_Count count_Value
---- ----------- -----------
0 6 6
> select count(*) from sensor1;
name: sensor1
time count_Count count_Value
---- ----------- -----------
0 8 8
> select Count from sensor1;
name: sensor1
time Count
---- -----
1594099736722564482 00000310
1594099737463373188 00000610
1594099795941527728 00000910
1594099796752396784 00001193
1594099854366369551 00001493
1594099855120826270 00001777
1594099913596094653 00002077
1594099914196135122 00002361
Why does the data appear to be overwritten, and how can I resolve it to insert all the data at a time?
I would appreciate it if you could tell me how to solve it?
Can you provide more details on the structure of data that you wish to store in the influx DB ?
However, I hope the below information helps you.
In Influxdb, timestamp + tags are unique (i.e. two data points with same tag values and timestamp cannot exist). Unlike SQL influxdb doesn't throw unique constraint violation, it overwrites the existing data with the incoming data. It seems your data doesn't have tags, so if the some incoming data whose timestamps are already present in the influxdb will override the existing data

Python3: Multiprocessing closing psycopg2 connections to Postgres at AWS RDS

I'm trying to write chunk of 100000 rows to AWS RDS PostgreSQL server.
I'm using psycopg2.8 and multiprocessing. I'm creating new connection in each process and preparing the SQL statement as well. But every time random amount of rows are getting inserted. I assume the issue is the python multiprocessing library closing wrong connections, which is mentioned here:multiprocessing module and distinct psycopg2 connections
and here:https://github.com/psycopg/psycopg2/issues/829 in one of the comment.
The RDS server logs says:
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection with an open transaction
Here is the skeleton of the code:
from multiprocessing import Pool
import csv
from psycopg2 import sql
import psycopg2
from psycopg2.extensions import connection
def gen_chunks(reader, chunksize= 10 ** 5):
"""
Chunk generator. Take a CSV `reader` and yield
`chunksize` sized slices.
"""
chunk = []
for index, line in enumerate(reader):
if (index % chunksize == 0 and index > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
def write_process(chunk, postgres_conn_uri):
conn = psycopg2.connect(dsn=postgres_conn_uri)
with conn:
with conn.cursor() as cur:
cur.execute(
'''PREPARE scrape_info_query_plan (int, bool, bool) AS
INSERT INTO schema_name.table_name (a, b, c)
VALUES ($1, $2, $3)
ON CONFLICT (a, b) DO UPDATE SET (c) = (EXCLUDED.c)
'''
)
for row in chunk:
cur.execute(
sql.SQL(
''' EXECUTE scrape_info_query_plan ({})''').format(sql.SQL(', ').join([sql.Literal(value) for value in [1,True,True]]))
)
pool = Pool()
reader = csv.DictReader('csv file path', skipinitialspace=True)
for chunk in gen_chunks(reader):
#chunk is array of row's(100000) from csv
pool.apply_async(write_process, [chunk, postgres_conn_uri])
commands to create required DB stuff:
1. CREATE DATABASE foo;
2. CREATE SCHEMA schema_name;
3. CREATE TABLE table_name (
x serial PRIMARY KEY,
a integer,
b boolean,
c boolean;
Any suggestions on this?
Note: I'm having EC2 instance with 64 vCPU and I can see 60 to 64 parallel connection on my RDS instance.

Resources