I can copy data from a s3 bucket into a redshift table using psycopg2:
import psycopg2
sql = """ copy table1 from 's3://bucket/myfile.csv'
access_key_id 'xxxx'
secret_access_key 'xxx' DELIMITER '\t'
timeformat 'auto'
maxerror as 250 GZIP IGNOREHEADER 1 """
cur.execute(sql)
How do I string multiple redshift statements to do these three things:
create another table (table2) from table1 after data has moved from s3
move data over from table1 to table2
drop table1
I tried the following:
sql = """ copy table1 from 's3://bucket/myfile.csv'
access_key_id 'xxxx'
secret_access_key 'xxx' DELIMITER '\t'
timeformat 'auto'
maxerror as 250 GZIP IGNOREHEADER 1
create table table2 as table1
drop table table1"""
I don't get back any error, but the table is not created, only the copy is working from above. What am I doing wrong in my sql?
Following code does Copy from Table1 to Table2 by creating a duplicate Copy. Then, it deletes Table1.
import psycopg2
def redshift():
conn = psycopg2.connect(dbname='***', host='******.redshift.amazonaws.com', port='5439', user='****', password='*****')
cur = conn.cursor();
cur.execute("create table table2 as select * from table1;")
cur.execute(" drop table table1;")
print("Copy executed fine!")
redshift()
Related
I'm using psycopg2 to connect to postgre DB, and to export the data into CSV file.
This is how I made the export DB to csv:
def export_table_to_csv(self, table, csv_path):
sql = "COPY (SELECT * FROM %s) TO STDOUT WITH CSV DELIMITER ','" % table
self.cur.execute(sql)
with open(csv_path, "w") as file:
self.cur.copy_expert(sql, file)
But the data is just the rows - without the column names.
How can I export the data with the column names?
P.S. I am able to print the column names:
sql = '''SELECT * FROM test'''
self.cur.execute(sql)
column_names = [desc[0] for desc in self.cur.description]
for i in column_names:
print(i)
I want the cleanest way to do export the DB with columns name (i.e. I prefer to do this in one method, and not rename columns In retrospect).
As I said in my comment, you can add HEADER to the WITH clause of your SQL:
sql = "COPY (SELECT * FROM export_test) TO STDOUT WITH CSV HEADER"
By default, comma delimiters are used with CSV option so you don't need to specify.
For future Questions, you should submit a minimal reproducible example. That is, code we can directly copy and paste and run. I was curious if this would work so I made one and tried it:
import psycopg2
conn = psycopg2.connect('host=<host> dbname=<dbname> user=<user>')
cur = conn.cursor()
# create test table
cur.execute('DROP TABLE IF EXISTS export_test')
sql = '''CREATE TABLE export_test
(
id integer,
uname text,
fruit1 text,
fruit2 text,
fruit3 text
)'''
cur.execute(sql)
# insert data into table
sql = '''BEGIN;
insert into export_test
(id, uname, fruit1, fruit2, fruit3)
values(1, 'tom jones', 'apple', 'banana', 'pear');
insert into export_test
(id, uname, fruit1, fruit2, fruit3)
values(2, 'billy idol', 'orange', 'cherry', 'strawberry');
COMMIT;'''
cur.execute(sql)
# export to csv
fid = open('export_test.csv', 'w')
sql = "COPY (SELECT * FROM export_test) TO STDOUT WITH CSV HEADER"
cur.copy_expert(sql, fid)
fid.close()
And the resultant file is:
id,uname,fruit1,fruit2,fruit3
1,tom jones,apple,banana,pear
2,billy idol,orange,cherry,strawberry
I have created the database and table in MySQL to load csv file in MySQL table using python script and it's working fine but when I add new row in existing csv file and then I execute python script then it will load all the data from starting till end and it create duplication in MySQL table I just want to append only new row or updated row will update in MySQL table otherwise existing record not be loaded. plz can anybody help me regarding this? below are the code I'm using.
#create table under database and insert the records in MySQL server.
try:
conn = msql.connect(host='127.0.0.1', database='AuditLogDB',user='root',
password='Pharma123')
if conn.is_connected():
cursor = conn.cursor()
cursor.execute("select database();")
record = cursor.fetchone()
print("You're connected to database: ", record)
cursor.execute('DROP TABLE IF EXISTS AuditLogs;')
print('Creating table....')
cursor.execute("CREATE TABLE AuditLogs (Id nvarchar (50), CreationTime nvarchar (50),UserId nvarchar(50) ,Operation nvarchar(50) ,Workload nvarchar(50) ,DatasetName nvarchar(50),ReportName nvarchar(50) ,ReportId nvarchar(50) ,Activity nvarchar(50) ,ActivityId nvarchar(50) ,DatasetId nvarchar(50),LastRefreshTime nvarchar(50),ObjectId nvarchar(255),SiteUrl nvarchar(50),SourceFileName nvarchar(50),ClientIP nvarchar(50),UserAgent nvarchar(50),Email nvarchar(50))")
print("AuditLogs table is created....")
for i,row in AuditLogData.iterrows():
sql = "INSERT INTO AuditLogDB.AuditLogs VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
cursor.execute(sql, tuple(row))
# the connection is not autocommitted by default, so we must commit to save our changes
conn.commit()
except Error as e:
print("Error while connecting to MySQL", e)
I have been given a .db file, that has already been populated with both Tables and Data. However, no description of the content of the database has been made available.
Is there a way for me to retrieve individual lists listing the different tables, and their respective sets of columns using SQLite3 and python?
This code help you to show tables with keys , when you get tables and their keys you can get data.
import sqlite3
def readDb():
connection = sqlite3.connect('data.db')
connection.row_factory = sqlite3.Row
cursor = connection.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
rows = cursor.fetchall()
tabs=[]
for row in rows:
for r in row:
tabs.append(r)
d={}
for tab in tabs:
cursor.execute("SELECT * FROM "+tab+";")
rows = cursor.fetchone()
t=[]
for row in rows.keys():
t.append(row)
d[tab]=t
connection.commit()
return d
print(readDb())
I am using pyodbc to establish connection with Azure Synapse SQL DW. The connection is successfully established. However when it comes to inserting a pandas dataframe into the database, I am getting an error when I try inserting multiple rows as values. However, it works if I insert rows one by one. Inserting multiple rows together as values used to work fine with AWS Redshift and MS SQL, but fails with Azure Synapse SQL DW. I think the Azure Synapse SQL is T-SQL and not MS-SQL. Nonetheless, I am unable to find any relevant documentation as well.
I have a pandas df named 'df' that looks like this:
student_id admission_date
1 2019-12-12
2 2018-12-08
3 2018-06-30
4 2017-05-30
5 2020-03-11
This code below works fine
import pandas as pd
import pyodbc
#conn object below is the pyodbc 'connect' object
batch_size = 1
i = 0
chunk = df[i:i+batch_size]
conn.autocommit = True
sql = 'insert INTO {} values {}'.format('myTable', ','.join(
str(e) for e in zip(chunk.student_id.values, chunk.admission_date.values.astype(str))))
print(sql)
cursor = conn.cursor()
cursor.execute(sql)
As you can see, it's inserting just 1 row of the 'df'. So, yes, I can loop through and insert one by one but it takes hell lot of time when it comes dataframes of larger sizes
This code below doesn't work when I try to insert all rows together
import pandas as pd
import pyodbc
batch_size = 5
i = 0
chunk = df[i:i+batch_size]
conn.autocommit = True
sql = 'insert INTO {} values {}'.format('myTable', ','.join(
str(e) for e in zip(chunk.student_id.values, chunk.admission_date.values.astype(str))))
print(sql)
cursor = conn.cursor()
cursor.execute(sql)
The error I get this one below:
ProgrammingError: ('42000', "[42000]
[Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Parse error at
line: 1, column: 74: Incorrect syntax near ','. (103010)
(SQLExecDirectW)")
This is the sample SQL query for 2 rows which fails:
insert INTO myTable values (1, '2009-12-12'),(2, '2018-12-12')
That's because Azure Synapse SQL does not support multi-row insert via the values constructor.
One work around is to chain "select (value list) union all". Your pseudo SQL should look like so:
insert INTO {table}
select {chunk.student_id.values}, {chunk.admission_date.values.astype(str)} union all
...
select {chunk.student_id.values}, {chunk.admission_date.values.astype(str)}
COPY statement in Azure Synapse Analytics is a better way for loading your data in Synapse SQL Pool.
COPY INTO test_parquet
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/*.parquet'
WITH (
FILE_FORMAT = myFileFormat,
CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)
You can save your pandas dataframe into blob storage, and then trigger the copy command using execute method.
I'm creating new table then inserting values in it because the tsv file doesn't have headers so i need to create table structure first then insert the value. I'm trying to insert the value in database table which is been created. I'm using df.to_sql function to insert tsv values into database table but its creating table but it's not inserting values in that table and its not giving any type of error either.
I have tried to create new table through sqalchemy and insert value it worked but it didn't worked for already created table.
conn, cur = create_conn()
engine = create_engine('postgresql://postgres:Shubham#123#localhost:5432/walmart')
create_query = '''create table if not exists new_table(
"item_id" TEXT, "product_id" TEXT, "abstract_product_id" TEXT,
"product_name" TEXT, "product_type" TEXT, "ironbank_category" TEXT,
"primary_shelf" TEXT, apparel_category" TEXT, "brand" TEXT)'''
cur.execute(create_query)
conn.commit()
file_name = 'new_table'
new_file = "C:\\Users\\shubham.shinde\\Desktop\\wallll\\new_file.txt"
data = pd.read_csv(new_file, delimiter="\t", chunksize=500000, error_bad_lines=False, quoting=csv.QUOTE_NONE, dtype="unicode", iterator=True)
with open(file_name + '_bad_rows.txt', 'w') as f1:
sys.stderr = f1
for df in data:
df.to_sql('new_table', engine, if_exists='append')
data.close()
I want to insert values from df.to_sql() into database table
Not 100% certain if this argument works with postgresql, but I had a similar issue when doing it on mssql. .to_sql() already creates the table in the first argument of the method in new_table. The if_exists = append also doesn't check for duplicate values. If data in new_file is overwritten, or run through your function again, it will just add to the table. As to why you're seeing the table name, but not seeing the data in it, might be due to the size of the df. Try setting fast_executemany=True as the second argument of the create_engine.
My suggestion, get rid of create_query, and handle the data types after to_sql(). Once the SQL table is created, you can use your actual SQL table, and join against this staging table for duplicate testing. The non-duplicates can be written to the actual table, converting datatypes on UPDATE to match the tables data type structure.