How to insert value in already created Database table through pandas `df.to_sql()` - python-3.x

I'm creating new table then inserting values in it because the tsv file doesn't have headers so i need to create table structure first then insert the value. I'm trying to insert the value in database table which is been created. I'm using df.to_sql function to insert tsv values into database table but its creating table but it's not inserting values in that table and its not giving any type of error either.
I have tried to create new table through sqalchemy and insert value it worked but it didn't worked for already created table.
conn, cur = create_conn()
engine = create_engine('postgresql://postgres:Shubham#123#localhost:5432/walmart')
create_query = '''create table if not exists new_table(
"item_id" TEXT, "product_id" TEXT, "abstract_product_id" TEXT,
"product_name" TEXT, "product_type" TEXT, "ironbank_category" TEXT,
"primary_shelf" TEXT, apparel_category" TEXT, "brand" TEXT)'''
cur.execute(create_query)
conn.commit()
file_name = 'new_table'
new_file = "C:\\Users\\shubham.shinde\\Desktop\\wallll\\new_file.txt"
data = pd.read_csv(new_file, delimiter="\t", chunksize=500000, error_bad_lines=False, quoting=csv.QUOTE_NONE, dtype="unicode", iterator=True)
with open(file_name + '_bad_rows.txt', 'w') as f1:
sys.stderr = f1
for df in data:
df.to_sql('new_table', engine, if_exists='append')
data.close()
I want to insert values from df.to_sql() into database table

Not 100% certain if this argument works with postgresql, but I had a similar issue when doing it on mssql. .to_sql() already creates the table in the first argument of the method in new_table. The if_exists = append also doesn't check for duplicate values. If data in new_file is overwritten, or run through your function again, it will just add to the table. As to why you're seeing the table name, but not seeing the data in it, might be due to the size of the df. Try setting fast_executemany=True as the second argument of the create_engine.
My suggestion, get rid of create_query, and handle the data types after to_sql(). Once the SQL table is created, you can use your actual SQL table, and join against this staging table for duplicate testing. The non-duplicates can be written to the actual table, converting datatypes on UPDATE to match the tables data type structure.

Related

Create table in Athena using all objects from multiple folders in S3 Bucket via Boto3

My S3 Bucket has multiple sub-directories that store data for multiple websites based on the day.
example:
bucket/2020-01-03/website 1 and within this are where the csv's are stored.
I am able to create tables based on each of the objects but I want to create one consolidated table for all sub-directories/objects/data stored within the prefix bucket/2020-01-03 for all websites as well as all other dates.
I used the code below to create one table for
Athena configuration
athena = boto3.client('athena',aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY,
region_name= 'us-west-2')
s3_input = 's3://bucket/2020-01-03/website1'
database = 'database1'
table = 'consolidated_table'
Athena database and table definition
create_table = \
"""CREATE EXTERNAL TABLE IF NOT EXISTS `%s.%s` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
) LOCATION '%s'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1576774420');""" % ( database, table, s3_input )
athena.start_query_execution(QueryString=create_table,
WorkGroup = 'user_group',
QueryExecutionContext={'Database': 'database1'},
ResultConfiguration={'OutputLocation': 's3://aws-athena-query-results-5000-us-west-2'})
I also want to over-write this table with new data from S3 everytime I run it.
You can have a consolidated table for the files from different "directories" on S3 only if all of them adhere the same data schema. As I can see from your CREATE EXTERNAL TABLE, each file contains 4 columns website_id, user, action and date. So you can simply change LOCATION to point to the root of your S3 "directory structure"
CREATE EXTERNAL TABLE IF NOT EXISTS `database1`.`consolidated_table` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
)
LOCATION 's3://bucket' -- instead of restricting it to s3://bucket/2020-01-03/website1
TBLPROPERTIES (
'skip.header.line.count'='1'
);
In this case, each Athena query would scan all files under s3://bucket location and you can use website_id and date in WHERE clause to filter results. However, if you have a lot of data you should consider partitioning. It will save you not only time to execute query but also money (see this post)
I also want to over-write this table with new data from S3 every time I run it.
I assume you mean, that every time you run Athena query, it should scan files on S3 even if they were added after you executed CREATE EXTERNAL TABLE. Note, that CREATE EXTERNAL TABLE simply defines a meta information about you data, i.e. where it is located on S3, columns etc. Thus, query against table with LOCATION 's3://bucket' (w/o partitioning) will always include all your S3 files

How to insert an array of strings in javascript into PostgreSQL

I am building an API server which accepts file uploads using multer.
I need to store an array of all the paths to all files uploaded for each request to a column in the PostgreSQL database which I have connected to the server.
Say I have a table created with the following query
CREATE TABLE IF NOT EXISTS records
(
id SERIAL PRIMARY KEY,
created_on TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_by INTEGER,
title VARCHAR NOT NULL,
type VARCHAR NOT NULL
)
How do I define a new column filepaths on the above table where I can insert a javascript string array (ex: ['path-to-file-1', 'path-to-file-2', 'path-to-file-3']).
Also how do I retrive, update/edit the list in javascript using node-postgres
You have 2 options:
use json or jsonb type. In the case string to insert will look:
'["path-to-file-1", "path-to-file-2", "path-to-file-3"]'
I would prefer jsonb - it allows to have good indexes. Json is rather just text with some additional built-in functions.
Use array of text - something like filepaths text[]. To insert you can use:
ARRAY ['path-to-file-1', 'path-to-file-2', 'path-to-file-3']
or
'{path-to-file-1,path-to-file-2,path-to-file-3,"path to file 4"}'
You need to use " here only for elements that contain space and so on. But you fill free to use it for all elements too.
You can create a file table that has a path column and a foreign key reference to the record that it belongs to. This way you can store the path as just a text column instead of storing an array in a column, which is better practice for relational databases. You'll also be able to store additional information on a file if you need to later. And it'll be more simple to interact with the file path records since you'd add a new file path by just inserting a new row into the file table (with the appropriate foreign key) and remove by deleting a row from the file table.
For example:
CREATE TABLE IF NOT EXISTS file (
record_id integer NOT NULL REFERENCES records(id) ON DELETE CASCADE,
path text NOT NULL
);
Then to get all the files for a record you can join the two tables together and convert to an array if you want.
For example:
SELECT
records.*,
ARRAY (
SELECT
file.path
FROM
file
WHERE
records.id = file.record_id
) AS file_paths
FROM
records;
Sample input (using only the title field of records):
INSERT INTO records (title) VALUES ('A'), ('B'), ('C');
INSERT INTO file (record_id, path) VALUES (1, 'patha1'), (1, 'patha2'), (1, 'patha3'), (2, 'pathb1');
Sample output:
id | title | file_paths
----+-------+------------------------
1 | A | {patha1,patha2,patha3}
2 | B | {pathb1}
3 | C | {}

is it possible to insert json into cassandra without creating table

How can I insert json objects to Cassandra table without creating table? Can Cassandra parse json to table which is not created? Or, Can I create a table with no column and insert json?
Thanks.
After Cassandra 2.2 you can insert json directly, but the table still should be created beforehead.
You need to create table First, then you can insert data
You can create table like the below one :
CREATE TABLE json_data (
id timeuuid PRIMARY KEY,
data text
);
And you can insert the json as string with the below query :
INSERT INTO json_data (id , data ) VALUES ( now(), '{"first_name" : "Ashraful", "last_name" : "Islam"}') ;

COPY FROM CSV with static fields on Postgres

I'd like to switch an actual system importing data into a PostgreSQL 9.5 database from CSV files to a more efficient system.
I'd like to use the COPY statement because of its good performance. The problem is that I need to have one field populated that is not in the CSV file.
Is there a way to have the COPY statement add a static field to all the rows inserted ?
The perfect solution would have looked like that :
COPY data(field1, field2, field3='Account-005')
FROM '/tmp/Account-005.csv'
WITH DELIMITER ',' CSV HEADER;
Do you know a way to have that field populated in every row ?
My server is running node.js so I'm open to any cost-efficient solution to complete the files using node before COPYing it.
Use a temp table to import into. This allows you to:
add/remove/update columns
add extra literal data
delete or ignore records (such as duplicates)
, before inserting the new records into the actual table.
-- target table
CREATE TABLE data
( id SERIAL PRIMARY KEY
, batch_name varchar NOT NULL
, remote_key varchar NOT NULL
, payload varchar
, UNIQUE (batch_name, remote_key)
-- or::
-- , UNIQUE (remote_key)
);
-- temp table
CREATE TEMP TABLE temp_data
( remote_key varchar -- PRIMARY KEY
, payload varchar
);
COPY temp_data(remote_key,payload)
FROM '/tmp/Account-005'
;
-- The actual insert
-- (you could also filter out or handle duplicates here)
INSERT INTO data(batch_name, remote_key, payload)
SELECT 'Account-005', t.remote_key, t.payload
FROM temp_data t
;
BTW It is possible to automate the above: put it into a function (or maybe a prepared statement), using the filename/literal as argument.
Set a default for the column:
alter table data
alter column field3 set default 'Account-005'
Do not mention it the the copy command:
COPY data(field1, field2) FROM...

Inserting/Updating sqlite table from python program

I have a sqlite3 table as shown below
Record(WordID INTEGER PRIMARY KEY, Word TEXT, Wordcount INTEGER, Docfrequency REAL).
I want to create and insert data into this table if the table not exists else I want to update the table in such a way that only 'Wordcount' column get updated on the basis(Reference) of data in the column 'Word'. I am trying to execute this from a python program like
import sqlite3
conn = sqlite3.connect("mydatabase")
c = conn.cursor()
#Create table
c.execute("CREATE TABLE IF NOT EXISTS Record(WordID INTEGER PRIMARY KEY, Words TEXT, Wordcount INTEGER, Docfrequency REAL)")
#Update table
c.execute("UPDATE TABLE IF EXISTS Record")
#Insert a row of data
c.execute("INSERT INTO Record values (1,'wait', 9, 10.0)")
c.execute("INSERT INTO Record values (2,'Hai', 5, 6.0)")
#Updating data
c.execute("UPDATE Record SET Wordcount='%d' WHERE Words='%s'" %(11,'wait') )
But I can't update the table. On running the program I am getting the error message as
c.execute("UPDATE TABLE IF EXISTS Record")
sqlite3.OperationalError: near "TABLE": syntax error
How should I write the code to update the table ?
Your SQL query for UPDATE is invalid - see the documentation.
Also, I don't understand why you'd want to check for the table's existence when updating, given that just before that you're creating it if it doesn't exist.
If your goal is to update an entry if it exists or insert it if it doesn't, you might do it either by:
First doing an UPDATE and checking the number of rows updated. If 0, you know the record didn't exist and you should INSERT instead.
First doing an INSERT - if there's an error related to constraint violation, you know the entry already existed and you should do an UPDATE instead.

Resources