Influxdb bulk insert using influxdb-python - python-3.x

I used influxDB-Python to insert a large amount of data read from the Redis-Stream. Because Redis-stream and set maxlen=600 and the data is inserted at a speed of 100ms, and I needed to retain all of its data. so I read and transfer it to influxDB(I don't know what's a better database), but using batch inserts only ⌈count/batch_size⌉ pieces of data, both at the end of each batch_size, appear to be overwritten. The following code
import redis
from apscheduler.schedulers.blocking import BlockingScheduler
import time
import datetime
import os
import struct
from influxdb import InfluxDBClient
def parse(datas):
ts,data = datas
w_json = {
"measurement": 'sensor1',
"fields": {
"Value":data[b'Value'].decode('utf-8')
"Count":data[b'Count'].decode('utf-8')
}
}
return w_json
def archived_data(rs,client):
results= rs.xreadgroup('group1', 'test', {'test1': ">"}, count=600)
if(len(results)!=0):
print("len(results[0][1]) = ",len(results[0][1]))
datas = list(map(parse,results[0][1]))
client.write_points(datas,batch_size=300)
print('insert success')
else:
print("No new data is generated")
if __name__=="__main__":
try:
rs = redis.Redis(host="localhost", port=6379, db=0)
rs.xgroup_destroy("test1", "group1")
rs.xgroup_create('test1','group1','0-0')
except Exception as e:
print("error = ",e)
try:
client = InfluxDBClient(host="localhost", port=8086,database='test')
except Exception as e:
print("error = ", e)
try:
sched = BlockingScheduler()
sched.add_job(test1, 'interval', seconds=60,args=[rs,client])
sched.start()
except Exception as e:
print(e)
The data changes following for the influxDB
> select count(*) from sensor1;
name: sensor1
time count_Count count_Value
---- ----------- -----------
0 6 6
> select count(*) from sensor1;
name: sensor1
time count_Count count_Value
---- ----------- -----------
0 8 8
> select Count from sensor1;
name: sensor1
time Count
---- -----
1594099736722564482 00000310
1594099737463373188 00000610
1594099795941527728 00000910
1594099796752396784 00001193
1594099854366369551 00001493
1594099855120826270 00001777
1594099913596094653 00002077
1594099914196135122 00002361
Why does the data appear to be overwritten, and how can I resolve it to insert all the data at a time?
I would appreciate it if you could tell me how to solve it?

Can you provide more details on the structure of data that you wish to store in the influx DB ?
However, I hope the below information helps you.
In Influxdb, timestamp + tags are unique (i.e. two data points with same tag values and timestamp cannot exist). Unlike SQL influxdb doesn't throw unique constraint violation, it overwrites the existing data with the incoming data. It seems your data doesn't have tags, so if the some incoming data whose timestamps are already present in the influxdb will override the existing data

Related

Using "UPDATE" and "SET" in Python to Update Snowflake Table

I have been using Python to read and write data to Snowflake for some time now to a table I have full update rights to using a Snowflake helper class my colleague found on the internet. Please see below for the class I have been using with my personal Snowflake connection information abstracted and a simply read query that works given you have a 'TEST' table in your schema.
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
import keyring
import pandas as pd
from sqlalchemy import text
# Pull the username and password to be used to connect to snowflake
stored_username = keyring.get_password('my_username', 'username')
stored_password = keyring.get_password('my_password', 'password')
class SNOWDBHelper:
def __init__(self):
self.user = stored_username
self.password = stored_password
self.account = 'account'
self.authenticator = 'authenticator'
self.role = stored_username + '_DEV_ROLE'
self.warehouse = 'warehouse'
self.database = 'database'
self.schema = 'schema'
def __connect__(self):
self.url = URL(
user=stored_username,
password=stored_password,
account='account',
authenticator='authenticator',
role=stored_username + '_DEV_ROLE',
warehouse='warehouse',
database='database',
schema='schema'
)
# =============================================================================
self.url = URL(
user=self.user,
password=self.password,
account=self.account,
authenticator=self.authenticator,
role=self.role,
warehouse=self.warehouse,
database=self.database,
schema=self.schema
)
self.engine = create_engine(self.url)
self.connection = self.engine.connect()
def __disconnect__(self):
self.connection.close()
def read(self, sql):
self.__connect__()
result = pd.read_sql_query(sql, self.engine)
self.__disconnect__()
return result
def write(self, wdf, tablename):
self.__connect__()
wdf.to_sql(tablename.lower(), con=self.engine, if_exists='append', index=False)
self.__disconnect__()
# Initiate the SnowDBHelper()
SNOWDB = SNOWDBHelper()
query = """SELECT * FROM """ + 'TEST'
snow_table = SNOWDB.read(query)
I now have the need to update an existing Snowflake table and my colleague suggested I could use the read function to send the query containing the update SQL to my Snowflake table. So I adapted an update query I use successfully in the Snowflake UI to update tables and used the read function to send it to Snowflake. It actually tells me that the relevant rows in the table have been updated, but they have not. Please see below for update query I use to attempt to change a field "field" in "test" table to "X" and the success message I get back. Not thrilled with this hacky update attempt method overall (where the table update is a side effect of sorts??), but could someone please help with method to update within this framework?
# Query I actually store in file: '0-Query-Update-Effective-Dating.sql'
UPDATE "Database"."Schema"."Test" AS UP
SET UP.FIELD = 'X'
# Read the query in from file and utilize it
update_test = open('0-Query-Update-Effective-Dating.sql')
update_query = text(update_test.read())
SNOWDB.read(update_query)
# Returns message of updated rows, but no rows updated
number of rows updated number of multi-joined rows updated
0 316 0
SQL2Pandas | UPDATE row(s) in pandas

Pandas .to_sql fails silently randomly

I have several large pandas dataframes (about 30k+ rows) and need to upload a different version of them daily to a MS SQL Server db. I am trying to do so with the to_sql pandas function. On occasion, it will work. Other times, it will fail - silently - as if the code uploaded all of the data despite not having uploaded a single row.
Here is my code:
class SQLServerHandler(DataBaseHandler):
...
def _getSQLAlchemyEngine(self):
'''
Get an sqlalchemy engine
from the connection string
The fast_executemany fails silently:
https://stackoverflow.com/questions/48307008/pandas-to-sql-doesnt-insert-any-data-in-my-table/55406717
'''
# escape special characters as required by sqlalchemy
dbParams = urllib.parse.quote_plus(self.connectionString)
# create engine
engine = sqlalchemy.create_engine(
'mssql+pyodbc:///?odbc_connect={}'.format(dbParams))
return engine
#logExecutionTime('Time taken to upload dataframe:')
def uploadData(self, tableName, dataBaseSchema, dataFrame):
'''
Upload a pandas dataFrame
to a database table <tableName>
'''
engine = self._getSQLAlchemyEngine()
dataFrame.to_sql(
tableName,
con=engine,
index=False,
if_exists='append',
method='multi',
chunksize=50,
schema=dataBaseSchema)
Switching the method to None seems to work properly but the data takes an insane amount of time to upload (30+ mins). Having multiple tables (20 or so) a day of this size discards this solution.
The proposed solution here to add the schema as a parameter doesn't work. Neither does creating a sqlalchemy session and passsing it to the con parameter with session.get_bind().
I am using:
ODBC Driver 17 for SQL Server
pandas 1.2.1
sqlalchemy 1.3.22
pyodbc 4.0.30
Does anyone know how to make it raise an exception if it fails?
Or why it is not uploading any data?
In rebuttal to this answer, if to_sql() was to fall victim to the issue described in
SQL Server does not finish execution of a large batch of SQL statements
then it would have to be constructing large anonymous code blocks of the form
-- Note no SET NOCOUNT ON;
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (0, 'row0');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (1, 'row1');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (2, 'row2');
…
and that is not what to_sql() is doing. If it were, then it would start to fail well below 1_000 rows, at least on SQL Server 2017 Express Edition:
import pandas as pd
import pyodbc
import sqlalchemy as sa
print(pyodbc.version) # 4.0.30
table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}") # 400
cnxn = pyodbc.connect("DSN=mssqlLocal64", autocommit=True)
crsr = cnxn.cursor()
crsr.execute(f"TRUNCATE TABLE {table_name}")
sql = "".join(
[
f"INSERT INTO {table_name} ([id], [txt]) VALUES ({i}, 'row{i}');"
for i in range(num_rows)
]
)
crsr.execute(sql)
row_count = crsr.execute(f"SELECT COUNT(*) FROM {table_name}").fetchval()
print(f"row_count: {row_count}") # 316
Using to_sql() for that same operation works
import pandas as pd
import pyodbc
import sqlalchemy as sa
print(pyodbc.version) # 4.0.30
table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}") # 400
df = pd.DataFrame(
[(i, f"row{i}") for i in range(num_rows)], columns=["id", "txt"]
)
engine = sa.create_engine(
"mssql+pyodbc://#mssqlLocal64", fast_executemany=True
)
df.to_sql(
table_name,
engine,
index=False,
if_exists="replace",
)
with engine.connect() as conn:
row_count = conn.execute(
sa.text(f"SELECT COUNT(*) FROM {table_name}")
).scalar()
print(f"row_count: {row_count}") # 400
and indeed will work for thousands and even millions of rows. (I did a successful test with 5_000_000 rows.)
Ok, this seems to be an issue with SQL Server itself.
SQL Server does not finish execution of a large batch of SQL statements

how to avoid duplication in BigQuery by streaming insert

I made a function that inserts .CSV data into BigQuery in every 5~6 seconds. I've been looking for the way to avoid duplicating the data in BigQuery after inserting. I want to remove data that has same luid but I have no idea how to remove it so is it possible to check each data of .CSV has already existed in BigQuery table before inserting .
I put row_ids parameter to avoid duplicate luid but it seems not to work well .
Could you give me any idea ?? Thanks.
def stream_upload():
# BigQuery
client = bigquery.Client()
project_id = 'test'
dataset_name = 'test'
table_name = "test"
full_table_name = dataset_name + '.' + table_name
json_rows = []
with open('./test.csv','r') as f:
for line in csv.DictReader(f):
del line[None]
line_json = dict(line)
json_rows.append(line_json)
errors = client.insert_rows_json(
full_table_name,json_rows,row_ids=[row['luid'] for row in json_rows]
)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
print("end")
schedule.every(0.5).seconds.do(stream_upload)
while True:
schedule.run_pending()
time.sleep(0.1)
BigQuery doesn't have a native way to deal with this. You could either create a view off of this table that performs deduping or create an external cache of luids and lookup if they have already been written to BigQuery before writing and update the cache after writing new data. This could be as simple as a file cache or you could use an additional database.

Python3: Multiprocessing closing psycopg2 connections to Postgres at AWS RDS

I'm trying to write chunk of 100000 rows to AWS RDS PostgreSQL server.
I'm using psycopg2.8 and multiprocessing. I'm creating new connection in each process and preparing the SQL statement as well. But every time random amount of rows are getting inserted. I assume the issue is the python multiprocessing library closing wrong connections, which is mentioned here:multiprocessing module and distinct psycopg2 connections
and here:https://github.com/psycopg/psycopg2/issues/829 in one of the comment.
The RDS server logs says:
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection with an open transaction
Here is the skeleton of the code:
from multiprocessing import Pool
import csv
from psycopg2 import sql
import psycopg2
from psycopg2.extensions import connection
def gen_chunks(reader, chunksize= 10 ** 5):
"""
Chunk generator. Take a CSV `reader` and yield
`chunksize` sized slices.
"""
chunk = []
for index, line in enumerate(reader):
if (index % chunksize == 0 and index > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
def write_process(chunk, postgres_conn_uri):
conn = psycopg2.connect(dsn=postgres_conn_uri)
with conn:
with conn.cursor() as cur:
cur.execute(
'''PREPARE scrape_info_query_plan (int, bool, bool) AS
INSERT INTO schema_name.table_name (a, b, c)
VALUES ($1, $2, $3)
ON CONFLICT (a, b) DO UPDATE SET (c) = (EXCLUDED.c)
'''
)
for row in chunk:
cur.execute(
sql.SQL(
''' EXECUTE scrape_info_query_plan ({})''').format(sql.SQL(', ').join([sql.Literal(value) for value in [1,True,True]]))
)
pool = Pool()
reader = csv.DictReader('csv file path', skipinitialspace=True)
for chunk in gen_chunks(reader):
#chunk is array of row's(100000) from csv
pool.apply_async(write_process, [chunk, postgres_conn_uri])
commands to create required DB stuff:
1. CREATE DATABASE foo;
2. CREATE SCHEMA schema_name;
3. CREATE TABLE table_name (
x serial PRIMARY KEY,
a integer,
b boolean,
c boolean;
Any suggestions on this?
Note: I'm having EC2 instance with 64 vCPU and I can see 60 to 64 parallel connection on my RDS instance.

How do I insert new data to database every minute from api in python

Hi am getting a weather data from an API every 10 minutes and am using python. I have managed to connect to the api and gets the data and gets the code to run every 10mins using thread. However the data that is being recorded every 10mins is the same without getting the new weather data. I will like it to have new rows inserted since the station updates new records every 10mins. Thanks in advance for your help.
below is my code.
timestr=datetime.now()
for data in retrieve_data():
dashboard_data = data['dashboard_data']
Dew_point = dashboard_data['Temperature'] - (100 - dashboard_data['Humidity']) / 5
weather_list = [time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(dashboard_data['time_utc'])),
dashboard_data['Temperature'],
dashboard_data['Humidity'],
Dew_point,
dashboard_data['Pressure'],
dashboard_data['Noise'],
dashboard_data['CO2']
]
def insert_weather_data(*arg):
sql = (
"""INSERT INTO weather_data(time,temperature,
humidity,
dew_point,
pressure,
noise,
CO2 ) VALUES(%s,%s,%s,%s,%s,%s,%s);
"""
)
conn = None
weather_id = None
# read database configuration
params = config()
# connect to postgres database
conn = psycopg2.connect(**params)
# create a new cursor
cur = conn.cursor()
try:
# execute the insert statement
cur.execute(sql,arg)
conn.commit()
except(Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
def repeat_func_call():
insert_weather_data(weather_list)
threading.Timer(120, repeat_func_call, weather_list).start()
repeat_func_call()

Resources