Insert list of multiple values in SQL table using pyodbc - python-3.x

I have a list of length greater than 5 million rows. I need to insert this list into a table in my database using pyodbc. I tried 2 methods but both take a lot of time.
My list looks like this:
a=[3000200,3000201,3000202, ...]
Code which I have tried so far (Assume table TEMP created already and mylist.csv has all values of list a):
Create a csv file and insert from that (takes 1.5 hrs to create csv file):
cursor.execute("INSERT INTO TEMP SELECT * FROM EXTERNAL 'mylist.csv' USING (DELIMITER ',' REMOTESOURCE 'ODBC')")
Insert from list by using loops(ETA is over 120 hrs!):
b='INSERT INTO TEMP VALUES (?)'
for i in tqdm(a):
cursor.execute(b,i)
Is there an easier way to perform this insert without waiting so long ?
EDIT
For Code 1, I was using a 'for' loop to insert the values of a string into a csv file and this was taking far too long than necessary.
As mentioned by Gord in comments below, I converted the list into a dataframe and loaded it (Using code 1) with 'to_csv' method and it takes less than a minute now!

Related

Most efficient way to avoid injecting duplicate rows into Postgres db

This is more of a conceptual question. I'm building a relational db, using python and the psycopg2 library, and have a table that has over 44 million rows (and growing) that I want to try and inject sanitized rows from a csv file into the table without injecting duplicate rows; each row has an auto incrementing unique id from it's origin db table.
The current way I'm injecting the data is using the COPY table(columns...) FROM '/path/to/file' command; which is working like a charm. This occurs after we've sanitized all rows in the the csv file to match the datatypes in the rows to the appropriate column's datatypes in the table.
There are a few ideas that I have in mind, and one I've tried, but want to see what the most efficient option is before implementation.
The one I tried ended up being a tremendous burden on the server's cpu and memory; which we have decided not to proceed on. I ended up creating a script that makes a query to the db that searches for the unique id in the table (over 44 million rows).
My other idealistic solutions:
Allow injection of duplicates then create a script to clean up any duplicate rows in the table.
Create a temporary table with the data from the csv. Compare the temp table with the existing table, removing any duplicate values from the temp table, then injecting the temp table into the existing table.
Step 2 might be simplified with this issue. Instead of comparing the two tables we just use the INSERT INTO command along with the ON CONFLICT option.
This one might be more of a stretch of the imagination, and probably pretty unique to our situation. But, since we know that the unique id field will be auto incrementing, we can set a global variable to equal the largest unique id value in the table, then before sanitizing the data we make a query to check if the unique id value is less than the global variable data, and if that is True, we throw out the row from being injected. (No longer an option)

Delimited File with Varying Number of Rows Azure Data Factory

I have a delimited file separated by hashes that looks somewhat like this,
value#value#value#value#value#value##value
value#value#value#value##value#####value#####value
value#value#value#value###value#value####value##value
As you can see, when separated by hashes, there are more columns in the 2nd and 3rd rows than there is in the first. I want to be able to ingest this into a database using a ADF Data Flow after some transformations. However, whenever I try to do any kind of mapping, I always only see 7 columns (the number of columns in the first row).
Is there any way to get all of the values? As many columns as there are in the row with most number of items? I do not mind the nulls.
Note: I do not have a header row for this.
Azure Data Factory directly will not be able to Import schema -row with the maximum number of column. Hence, it is important to make sure you have same number of columns in your file.
You can use Azure functions to validate your file and update it to get equal number of columns in all rows.
You could give it a try to have a local file with row with the maximum number of column and import the schema from the file, else you have to go for Azure Functions where you have to convert the file and then trigger the pipeline.

Excel Power Query - Incremental Load from Query and adding date

I am trying to do something quite simple which I am failing to understand.
Take the output from a query, date time stamp and write it into a Excel table.
Iterate the logic again and you get the same output but the generated date time has progressed in time.
Query 1 -- From SQL which yields 2 columns category, count.
I am taking this and adding a generated date to it using DateTime.LocalNow().
Query 2 -- Target table
How can i construct a query which adds to an existing table and doesnt require me to load the result into a new table.
I have seen this blog.oraylis.de and i cant make it work since the DateTime.LocalNow() call runs for source and target and i end up with the same datetime throughout the query.
I think i am missing something obvious.
EDIT:-
= Table.Combine({SOURCE_DATA, TARGET_DATA})
This loads into a 3rd new table and doesnt take into account that 3rd table when loading - so you just end up with a new version of just the first two tables with new timestamp
These steps should work
create a query Q1 based on the SQL Statement, add your timestamp using DateTime.LocalNow() and load this into an Excel table (execute the query)
create a new query Q2 based on this Excel new table (just like that, no transforms)
Modify the first query Q1 by adding the Table.Combine with Q2 as the last step.
So, in other words, Q2 loads the existing data from the Excel table into which Q1 writes. The Excel table is always written completely but since the existing data is preserved you will get the result of new data being loaded to the table. Hope this helps.
Good luck, Hilmar

Spark Python: Converting multiple lines from inside a loop into a dataframe

I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.
Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.
File Header:
output_str="Col1,Col2,Col3,Col4\n"
Inside for loop:
output_str += "Val1,Val2,Val3,Val4\n"
I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.
output_rdd = sc.parallelize(output_str.split("\n"))
output_df = output_rdd.map(lambda x: (x, )).toDF()
It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.
Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)
Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00
So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.
Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.
First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:
CSV -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
jdbc -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.jdbc
json -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
parquet -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet
not structured text file -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.textFile
In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.
A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:
output_df = output_rdd.map(lambda x: (x, )).toDF()
with
output_df = output_rdd.map(lambda x: x.split()).toDF()

Importing CSV to Cassandra DB

I am trying to import a .csv file to Cassandra by using the COPY command in the CQL. The problem is that, after executing the command, the console displays that '327 rows imported in 0.589 seconds' but only the last row from the csv file gets into the Cassandra table, and the size of the table is only one row.
Here is the command that i am using for copying:
copy test from '/root/Documents/test.csv' with header=true;
And when i do select * from test; it shows only one row.
Any help would be appreciated.
Posting my comment here as an answer for future reference.
One way this could happen (where all rows are imported but the result is only 1 left in table) is if every row had the same primary key. Each insert is really doing a replacement, hence the result of just a single row of data.

Resources