Import huge dataset into Access from Excel via VBA - excel

I have a huge dataset which I need to import from Excel into Access (~800k lines). However, I can ignore lines with a particular column value, which make up like 90% of the actual dataset. So in fact, I only really need like 10% of the lines imported.
In the past I've been importing Excel Files line-by-line in the following manner (pseudo code):
For i = 1 To EOF
sql = "Insert Into [Table] (Column1, Column2) VALUES ('" & _
xlSheet.Cells(i, 1).Value & " ', '" & _
xlSheet.Cells(i, 2).Value & "');"
Next i
DoCmd.RunSQL sql
With ~800k lines this takes waaay to long as for every single line a query would be created and run.
Considering the fact that I can also ignore 90% of the lines, what is the fastest approach to import the dataset from Excel to Access?
I was thinking of creating a temporary excel file with a filter activated. And then I just import the filtered excel.
But is there a better/faster approach than this? Also, what is the fastest way to import an excel via vba access?
Thanks in advance.

Consider running a special Access query for the import. Add the below SQL into an Access query window or as SQL query in a DAO/ADO connection. Include any WHERE clauses which requires named column headers, right now set to HDR:No:
INSERT INTO [Table] (Column1, Column2)
SELECT *
FROM [Excel 12.0 Xml;HDR=No;Database=C:\Path\To\Workbook.xlsx].[SHEET1$];
Alternatively, run a Make-Table query in case you need a staging temp table (to remove 90% of lines) prior to final table but do note this query replaces table if exists:
SELECT * INTO [NewTable]
FROM [Excel 12.0 Xml;HDR=No;Database=C:\Path\To\Workbook.xlsx].[SHEET1$];

A slight change in your code will do the filtering for you:
Dim strTest As String
For i = 1 To EOF
strTest=xlSheet.Cells(i, 1).Value
if Nz(strTest)<>"" Then
sql = "Insert Into [Table] (Column1, Column2) VALUES ('" & _
strTest & " ', '" & _
xlSheet.Cells(i, 2).Value & "');"
DoCmd.RunSQL sql
End If
Next i
I assume having the RunSQL outside the loop was just a mistake in your pseudocode. This tests for the Cell in the first column to be empty but you can substitute with any condition is appropriate for your situation.

I'm a little late to the party but I stumbled on this looking for information on a similar problem. I thought I might share my solution in case it could help others or maybe OP, if he/she is still working on it. Here's my problem and what I did:
I have an established Access database of approximately the same number of rows as OPs (6 columns, approx 850k rows). We receive a .xlsx file with one sheet and the data in the same structure as the db about once a week from a partner company.
This file contains the entire db, plus updates (new records and changes to old records, no deletions). The first column contains a unique identifier for each row. The Access db is updated when we receive the file through similar queries as suggested by Parfait, but since it's the entire 850k+ records, this takes 10-15 minutes or longer to compare and update, depending on what else we have going on.
Since it would be faster to load just the changes into the current Access db, I needed to produce a delta file (preferably .txt that can be opened with excel and saved as .xlsx if needed). I assume this is something similar to what OP was looking for. To do this I wrote a small application in c++ to compare the file from the previous week, to the one from the current week. The data itself is an amalgam of character and numerical data that I will just call string1 through string6 here for simplicity. It looks like this:
Col1 Col2 Col3 Col4 Col5 Col6
string1 string2 string3 string4 string5 string6
.......
''''Through 850k rows''''
After saving both .xlsx files as .txt tab delimited files, they look like this:
Col1\tCol2\tCol3\tCol4\tCol5\tCol6\n
string1\tstring2\tstring3\tstring4\tstring5\tstring6\n
....
//Through 850k rows//
The fun part! I took the old .txt file and stored it as a hash table (using the c++ unordered_map from the standard library). Then with an input filestream from the new .txt file I used Col1 in the new file as a key to the hash table and output any differences to two different files. One you could use a query to append the db with new data and the other you could use to update data that has changed.
I've heard it's possible to create a more efficient hash table than the unordered_map but at the moment, this works well so I'll stick with it. Here's my code.
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <unordered_map>
int main()
{
using namespace std;
//variables
const string myInFile1{"OldFile.txt"};
const string myInFile2{"NewFile.txt"};
string mappedData;
string key;
//hash table objects
unordered_map<string, string> hashMap;
unordered_map<string, string>::iterator cursor;
//input files
ifstream fin1;
ifstream fin2;
fin1.open(myInFile1);
fin2.open(myInFile2);
//output files
ofstream fout1;
ofstream fout2;
fout1.open("For Updated.txt"); //updating old records
fout2.open("For Upload.txt"); //uploading new records
//This loop takes the original input file (i.e.; what is in the database already)
//and hashes the entire file using the Col1 data as a key. On my system this takes
//approximately 2 seconds for 850k+ rows with 6 columns
while(fin1)
{
getline(fin1, key, '\t'); //get the first column
getline(fin1, mappedData, '\n'); //get the other 5 columns
hashMap[key] = mappedData; //store the data in the hash table
}
fin1.close();
//output file headings
fout1 << "COl1\t" << "COl2\t" << "COl3\t" << "COl4\t" << "COl5\t" << "COl6\n";
fout2 << "COl1\t" << "COl2\t" << "COl3\t" << "COl4\t" << "COl5\t" << "COl6\n";
//This loop takes the second input file and reads each line, first up to the
//first tab delimiter and stores it as "key", then up to the new line character
//storing it as "mappedData" and then uses the value of key to search the hash table
//If the key is not found in the hash table, a new record is created in the upload
//output file. If it is found, the mappedData from the file is compared to that of
//the hash table and if different, the updated record is sent to the update output
//file. I realize that while(fin2) is not the optimal syntax for this loop but I
//have included a check to see if the key is empty (eof) after retrieving
//the current line from the input file. YMMV on the time here depending on how many
//records are added or updated (1000 records takes about another 5 seconds on my system)
while(fin2)
{
getline(fin2, key, '\t'); //get key from Col1 in the input file
getline(fin2, mappedData, '\n'); //get the mappeData (Col2-Col6)
if(key.empty()) //exit the file read if key is empty
break;
cursor = hashMap.find(key); //assign the iterator to the hash table at key
if(cursor != hashMap.end()) //check to see if key in hash table
{
if(cursor->second != mappedData) //compare mappedData
{
fout2 << key << "\t" << mappedData<< "\n";
}
}
else //for updating old records
{
fout1 << key << "\t" << mappedData<< "\n";
}
}
fin2.close();
fout1.close();
fout2.close();
return 0;
}
There are a few things I am working on to make this an easy to use executable file (for example reading the xml structure of the excel.zip file for direct reading or maybe using an ODBC connection) but for now, I'm just testing it to make sure the outputs are correct. Of course the output files would then have to be loaded into the access database using queries similar to what Parfait suggested. Also, I'm not sure if Excel or Access VBA have a library to build hash tables but it might be worth exploring further if it saves time in accessing the excel data. Any criticisms or suggestions are welcomed.

Related

Using sql string as sql statement in odbc power query in excel

I am looking to use a string that takes in some parameters from my cells and use it as my sql statement like so:
" select * from table where " & data_from_cells & " group by ...;"
store it as sqlstring
let
mystring = sqlstring,
myQuery = Odbc.Query("driver={Oracle in etc etc etc", mystring)
and I run into this error
Formula.Firewall: Query '...' (step 'myQuery') references other queries or steps, so it may not directly access a data source. Please rebuild this data combination.
now apparently I can't combine external queries and another query -- but I am only using passed-on parameters from Excel for my sql string? I am hoping to use WITH keyword as well to make nested queries using parameters but it doesn't even let me combine values from excel with an sql statement..
to be clear, the data_from_cells was transformed and formatted as a string.
When queries that do stuff are called by other queries that do other stuff, sometimes you can get firewall issues.
The way to get around that is for everything to be done in a single query.
The way to get around that without ending up with horrible code is to change your called queries from returning the result to functions that return the result.
For sqlstring:
() => " select * from table where " & data_from_cells & " group by ...;" // returns a function that gets the query string when called
Then your "myQuery" query can be
let
mystring = sqlstring(), //note the parentheses!
myQuery = Odbc.Query("driver={Oracle in etc etc etc", mystring)

In Ruby, how would one create new CSV's conditionally from an original CSV?

I'm going to use this as sample data to simplify the problem:
data_set_1
I want to split the contents of this csv according to Column A - DEPARTMENT and place them on new csv's named after the department.
If it were done in the same workbook (so it can fit in one image) it would look like:
data_set_2
My initial thought was something pretty simple like:
CSV.foreach('test_book.csv', headers: true) do |asset|
CSV.open("/import_csv/#{asset[1]}", "a") do |row|
row << asset
end
end
Since that should take care of the logic for me. However, from looking into it, CSV#foreach does not accept file access rights as second parameter, and it gets an error when I run it. Any help would be appreciated, thanks!
I don't see why you would need to pass file access rights to CSV#foreach. This method just reads the CSV. How I would do this is like so:
# Parse the entire CSV into an array.
orig_rows = CSV.parse(File.read('test_book.csv'), headers: true)
# Group the rows by department.
# This becomes { 'deptA' => [<rows>], 'deptB' => [<rows>], etc }
groups = orig_rows.group_by { |row| row[1] }
# Write each group of rows to its own file
groups.each do |dept, rows|
CSV.open("/import_csv/#{dept}.csv", "w") do |csv|
rows.each do |row|
csv << row.values
end
end
end
A caveat, though. This approach does load the entire CSV into memory, so if your file is very large, it wouldn't work. In that case, the "streaming" approach (line-by-line) that you show in your question would be preferrable.

How to replace Ctrl+M character from spark dataset using regexp_replace()?

Hi have a Spark dataset with one of its column having a Ctrl+M char present in the column's data, as a result that record is getting split into two records, and data corruption.
Even though I have added the code for handling regex newline \r\n, but I am not sure if this same code will be able to handle Ctrl+M, i.e. ^M:
filtered = filtered.selectExpr(convertListToSeq(colsList))
.withColumn(newCol, functions.when(filtered.col(column).notEqual("null"), functions.regexp_replace(filtered.col(column), "[\r\n]", " ")));
Will the code functions.regexp_replace(filtered.col(column), "<ascii for Ctrl+M>", " "); work ? ..I don't know the ascii value of Ctrl+M.

Postgresql COPY empty string as NULL not work

I have a CSV file with some integer column, now it 's saved as "" (empty string).
I want to COPY them to a table as NULL value.
With JAVA code, I have try these:
String sql = "COPY " + tableName + " FROM STDIN (FORMAT csv,DELIMITER ',', HEADER true)";
String sql = "COPY " + tableName + " FROM STDIN (FORMAT csv,DELIMITER ',', NULL '' HEADER true)";
I get: PSQLException: ERROR: invalid input syntax for type numeric: ""
String sql = "COPY " + tableName + " FROM STDIN (FORMAT csv,DELIMITER ',', NULL '\"\"' HEADER true)";
I get: PSQLException: ERROR: CSV quote character must not appear in the NULL specification
Any one has done this before ?
I assume you are aware that numeric data types have no concept of "empty string" ('') . It's either a number or NULL (or 'NaN' for numeric - but not for integer et al.)
Looks like you exported from a string data type like text and had some actual empty string in there - which are now represented as "" - " being the default QUOTE character in CSV format.
NULL would be represented by nothing, not even quotes. The manual:
NULL
Specifies the string that represents a null value. The default is \N
(backslash-N) in text format, and an unquoted empty string in CSV format.
You cannot define "" to generally represent NULL since that already represents an empty string. Would be ambiguous.
To fix, I see two options:
Edit the CSV file / stream before feeding to COPY and replace "" with nothing. Might be tricky if you have actual empty string in there as well - or "" escaping literal " inside strings.
(What I would do.) Import to an auxiliary temporary table with identical structure except for the integer column converted to text. Then INSERT (or UPSERT?) to the target table from there, converting the integer value properly on the fly:
-- empty temp table with identical structure
CREATE TEMP TABLE tbl_tmp AS TABLE tbl LIMIT 0;
-- ... except for the int / text column
ALTER TABLE tbl_tmp ALTER col_int TYPE text;
COPY tbl_tmp ...;
INSERT INTO tbl -- identical number and names of columns guaranteed
SELECT col1, col2, NULLIF(col_int, '')::int -- list all columns in order here
FROM tbl_tmp;
Temporary tables are dropped at the end of the session automatically. If you run this multiple times in the same session, either just truncate the existing temp table or drop it after each transaction.
Related:
How to update selected rows with values from a CSV file in Postgres?
Rails Migrations: tried to change the type of column from string to integer
postgresql thread safety for temporary tables
Since Postgres 9.4 you now have the ability to use FORCE_NULL. This causes the empty string to be converted into a NULL. Very handy, especially with CSV files (actually this is only allowed when using CSV format).
The syntax is as follow:
COPY table FROM '/path/to/file.csv'
WITH (FORMAT CSV, DELIMITER ';', FORCE_NULL (columnname));
Further details are explained in the documentation: https://www.postgresql.org/docs/current/sql-copy.html
If we want to replace all blank and empty rows with null then you just have to add emptyasnull blanksasnull in copy command
syntax :
copy Table_name (columns_list)
from 's3://{bucket}/{s3_bucket_directory_name + manifest_filename}'
iam_role '{REDSHIFT_COPY_COMMAND_ROLE}' emptyasnull blanksasnull
manifest DELIMITER ',' IGNOREHEADER 1 compupdate off csv gzip;
Note: It will apply for all the records which contains empty/blank values

Importing data from Excel into Access using DAO and WHERE clause

I need to import certain information from an Excel file into an Access DB and in order to do this, I am using DAO.
The user gets the excel source file from a system, he does not need to directly interact with it. This source file has 10 columns and I would need to retrieve only certain records from it.
I am using this to retrieve all the records:
Set destinationFile = CurrentDb
Set dbtmp = OpenDatabase(sourceFile, False, True, "Excel 8.0;")
DoEvents
Set rs = dbtmp.OpenRecordset("SELECT * FROM [EEX_Avail_Cap_ALL_DEU_D1_S_Y1$A1:J65536]")
My problem comes when I want to retrieve only certain records using a WHERE clause. The name of the field where I want to apply the clause is 'Date (UCT)' (remember that the user gets this source file from another system) and I can not get the WHERE clause to work on it. If I apply the WHERE clause on another field, whose name does not have ( ) or spaces, then it works. Example:
Set rs = dbtmp.OpenRecordset("SELECT * FROM [EEX_Avail_Cap_ALL_DEU_D1_S_Y1$A1:J65536] WHERE Other = 12925")
The previous instruction will retrieve only the number of records where the field Other has the value 12925.
Could anyone please tell me how can I achieve the same result but with a field name that has spaces and parenthesis i.e. 'Date (UCT)' ?
Thank you very much.
Octavio
Try enclosing the field name in square brackets:
SELECT * FROM [EEX_Avail_Cap_ALL_DEU_D1_S_Y1$A1:J65536] WHERE [Date (UCT)] = 12925
or if it's a date we are looking for:
SELECT * FROM [EEX_Avail_Cap_ALL_DEU_D1_S_Y1$A1:J65536] WHERE [Date (UCT)] = #02/14/13#;
To use date literal you must enclose it in # characters and write the date in MM/DD/YY format regardless of any regional settings on your machine

Resources