I have the below situation. I receive data from a datasource in CSV format every two weeks. I upload that to a postgres dB. I need to ensure the following
data in postgres should not be deleted
any updates in CSV need to be carried over without adding new rows
any new data marked by uinque ID Needs to be added
In other words the diff between the data set needs to be appended to postgres
In today's implementation I am using node-postgres to stream the data to postgres
I dont know how to implement the updates
Any ideas ? Ideally if there is a way to create a temp table stream the new data and do a diff between the old and temp table will be good.
If the CSV has a unique ID already, and you're using PostgreSQL 9.5+, then you can use INSERT ... ON CONFLICT DO UPDATE .... Otherwise you could create a plpgsql stored procedure with parameters (either individual values or a single ROW parameter), that does
UPDATE table SET
value = param_value
...
WHERE ID = param_id;
IF NOT found THEN
INSERT INTO table (ID, value, ...)
VALUES (param_id, param_value, ...);
END IF;
And execute that function for each row on the CSV. You can first import the CSV into a temporary table and do
SELECT the_above_function(f.id, f.value, ...)
FROM csv_temp_table f;
Related
I am using Excel power query to import csv files containing transactions from a directory. That way adding a new file to the directory automatically makes it available when refreshing the query/data model. I load the table from the csv files into the data model. I do some cleaning and data transformation in the query.
However, there are some things that I can't do in the query that loads the raw data.
There may be missing data that I need to enter manually (a column missing some values)
I may need to split a transaction/row into multiple transactions/rows to categorize the parts correctly
It seems like there should be a way to do this that allows me to make my changes and not have them overwritten when I refresh the query to import new transactions.
Currently I am experimenting with creating a column with a unique id for the transaction table as part of the query. Then creating an aux table in excel relating to the raw transactions by unique id. I then make my changes in the aux table. And finally, I create a new table that merges the raw transactions with the aux table to create the working transaction table. This does work for missing data, or incorrect values, but it still doesn't allow me to split a row into multiple rows.
I would welcome any suggestions or references.
I want to run a presql script in the Data flow in SINK. I want to delete exiting records for particular year.
This particular year will be coming from source excel file.
Say I have a file for 2021 data and loaded that data into DB. when I rerun the pipeline for the same excel file I want to delete 2021 related records in DB and insert fresh. This table may contain multiple years data. So everytime a new file arrives for a particular year, I want to delete that respective records and reload the new data.
I can read the year value from source file column. And I can keep it as derived column. How Can I write a presql script to delete?
delete from where year = <sourcefile.year>
how can i do this in data flow?
Pls help!
You can use a temporary table in sink1 and delete the records in the sink2.
Please follow the demonstration below:
This is my SQL table with a sample data.
Create two SQL sinks for it from same source using a new branch.
In the first sink, provide a dataset with edit table name option for the temporary table.
In sink check on Recreate table.
In sink 2, give your SQL table with the below SQL script.
DELETE FROM exceltable WHERE year in (select year from dbo.temp1)
drop table dbo.temp1;
In the settings of Data flow give the Write order of sinks.
Temporary table sink should execute first.
This is my result in the SQL table after deleting records.
I am trying to copy a dbf database into a sqlite database
I have the following code
db=dataset.connect('sqlite:///:memory:')
table =db['table1']
for record in DBF(settings['xbasefile']):
db['table1'].insert(record)
this loads the record but fails to insert with a datatype mismatch on the ID column because the row coming in has a format like
ID:text
field1:Text
this function
table=db['table1']
seems to assume an int id for the table. Any way to get this to do an insert with the text id that is in the table?
Ended up using the dbf2sqlite utility which automatically creates the table with correct columns from the dbf
I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.
I have a case where I need to read an Excel/csv/text file containing two columns (say colA and colB) of values (around 1000 rows). I need to query the database using values in colA. The query will return an XMLType into which the respective colB value needs to be inserted. I have the XML query and the insert working but I am stuck on what approach I should take to read the data, query and update it on the fly.
I have tried using external tables but realized that I don't have access to the server root to host the data file. I have also considered creating a temporary table to load the data to using SQL Loader or something similar and run the query/update within the tables. But that would need some formal overhead to go through. I would appreciate suggestions on the approach. Examples would be greatly helpful.
e.g.
text or Excel file:
ColA,ColB
abc,123
def,456
ghi,789
XMLTypeVal e.g.
<node1><node2><node3><colA></colA><colB></colB></node3></node2></node1>
UPDATE TableA SET XMLTypeVal
INSERTCHILDXML(XMLTypeVal,
'/node1/node2/node3', 'colBval',
XMLType('<colBval>123</colBval>'))
WHERE EXTRACTVALUE(TableA.XMLTypeVal, node1/node2/node3/ColA') = ('colAval');