How to replace Ctrl+M character from spark dataset using regexp_replace()? - apache-spark

Hi have a Spark dataset with one of its column having a Ctrl+M char present in the column's data, as a result that record is getting split into two records, and data corruption.
Even though I have added the code for handling regex newline \r\n, but I am not sure if this same code will be able to handle Ctrl+M, i.e. ^M:
filtered = filtered.selectExpr(convertListToSeq(colsList))
.withColumn(newCol, functions.when(filtered.col(column).notEqual("null"), functions.regexp_replace(filtered.col(column), "[\r\n]", " ")));
Will the code functions.regexp_replace(filtered.col(column), "<ascii for Ctrl+M>", " "); work ? ..I don't know the ascii value of Ctrl+M.

Related

how do I get rid of leading/trailing spaces in SAS search terms?

I have had to look up hundreds (if not thousands) of free-text answers on google, making notes in Excel along the way and inserting SAS-code around the answers as a last step.
The output looks like this:
This output contains an unnecessary number of blank spaces, which seems to confuse SAS's search to the point where the observations can't be properly located.
It works if I manually erase superflous spaces, but that will probably take hours. Is there an automated fix for this, either in SAS or in excel?
I tried using the STRIP-function, to no avail:
else if R_res_ort_txt=strip(" arild ") and R_kom_lan=strip(" skåne ") then R_kommun=strip(" Höganäs " );
If you want to generate a string like:
if R_res_ort_txt="arild" and R_kom_lan="skåne" then R_kommun="Höganäs";
from three variables, let's call them A B C, then just use code like:
string=catx(' ','if R_res_ort_txt=',quote(trim(A))
,'and R_kom_lan=',quote(trim(B))
,'then R_kommun=',quote(trim(C)),';') ;
Or if you are just writing that string to a file just use this PUT statement syntax.
put 'if R_res_ort_txt=' A :$quote. 'and R_kom_lan=' B :$quote.
'then R_kommun=' C :$quote. ';' ;
A saner solution would be to continue using the free-text answers as data and perform your matching criteria for transformations with a left join.
proc import out=answers datafile='my-free-text-answers.xlsx';
data have;
attrib R_res_ort_txt R_kom_lan length=$100;
input R_res_ort_txt ...;
datalines4;
... whatever all those transforms will be performed on...
;;;;
proc sql;
create table want as
select
have.* ,
answers.R_kommun_answer as R_kommun
from
have
left join
answers
on
have.R_res_ort_txt = answers.res_ort_answer
& have.R_kom_lan = abswers.kom_lan_answer
;
I solved this by adding quotes in excel using the flash fill function:
https://www.youtube.com/watch?v=nE65QeDoepc

Postgresql COPY empty string as NULL not work

I have a CSV file with some integer column, now it 's saved as "" (empty string).
I want to COPY them to a table as NULL value.
With JAVA code, I have try these:
String sql = "COPY " + tableName + " FROM STDIN (FORMAT csv,DELIMITER ',', HEADER true)";
String sql = "COPY " + tableName + " FROM STDIN (FORMAT csv,DELIMITER ',', NULL '' HEADER true)";
I get: PSQLException: ERROR: invalid input syntax for type numeric: ""
String sql = "COPY " + tableName + " FROM STDIN (FORMAT csv,DELIMITER ',', NULL '\"\"' HEADER true)";
I get: PSQLException: ERROR: CSV quote character must not appear in the NULL specification
Any one has done this before ?
I assume you are aware that numeric data types have no concept of "empty string" ('') . It's either a number or NULL (or 'NaN' for numeric - but not for integer et al.)
Looks like you exported from a string data type like text and had some actual empty string in there - which are now represented as "" - " being the default QUOTE character in CSV format.
NULL would be represented by nothing, not even quotes. The manual:
NULL
Specifies the string that represents a null value. The default is \N
(backslash-N) in text format, and an unquoted empty string in CSV format.
You cannot define "" to generally represent NULL since that already represents an empty string. Would be ambiguous.
To fix, I see two options:
Edit the CSV file / stream before feeding to COPY and replace "" with nothing. Might be tricky if you have actual empty string in there as well - or "" escaping literal " inside strings.
(What I would do.) Import to an auxiliary temporary table with identical structure except for the integer column converted to text. Then INSERT (or UPSERT?) to the target table from there, converting the integer value properly on the fly:
-- empty temp table with identical structure
CREATE TEMP TABLE tbl_tmp AS TABLE tbl LIMIT 0;
-- ... except for the int / text column
ALTER TABLE tbl_tmp ALTER col_int TYPE text;
COPY tbl_tmp ...;
INSERT INTO tbl -- identical number and names of columns guaranteed
SELECT col1, col2, NULLIF(col_int, '')::int -- list all columns in order here
FROM tbl_tmp;
Temporary tables are dropped at the end of the session automatically. If you run this multiple times in the same session, either just truncate the existing temp table or drop it after each transaction.
Related:
How to update selected rows with values from a CSV file in Postgres?
Rails Migrations: tried to change the type of column from string to integer
postgresql thread safety for temporary tables
Since Postgres 9.4 you now have the ability to use FORCE_NULL. This causes the empty string to be converted into a NULL. Very handy, especially with CSV files (actually this is only allowed when using CSV format).
The syntax is as follow:
COPY table FROM '/path/to/file.csv'
WITH (FORMAT CSV, DELIMITER ';', FORCE_NULL (columnname));
Further details are explained in the documentation: https://www.postgresql.org/docs/current/sql-copy.html
If we want to replace all blank and empty rows with null then you just have to add emptyasnull blanksasnull in copy command
syntax :
copy Table_name (columns_list)
from 's3://{bucket}/{s3_bucket_directory_name + manifest_filename}'
iam_role '{REDSHIFT_COPY_COMMAND_ROLE}' emptyasnull blanksasnull
manifest DELIMITER ',' IGNOREHEADER 1 compupdate off csv gzip;
Note: It will apply for all the records which contains empty/blank values

SQLite - Left-pad zeros in returned Text field

I have a text field in my SQLite database that stores a Time value, but for unrelated reasons I can't change the data type to TIME.
The values are stored in HH:MM format, and I'm having trouble trying to sort results by time because the values below '10:00' are missing a leading zero. I would prefer not to store the data with leading zero for the same unrelated reasons.
I'd like to add something to the Query that would pad the missing character if necessary, causing the results to read '08:30' when collected. I've been searching through the command and function lexicon though and I'm not finding what I need.
Is there a simple way to do this inside a query?
Thanks
I think this would work:
select your_col, case when length(your_col) < 5
then '0' || your_col else your_col end from your_table
Demo using Python
>>> conn.execute('''select c, case when length(c) < 5
then '0' || c else c end from t''').fetchall()
[(u'10:00', u'10:00'), (u'8:00', u'08:00')]
SELECT REPLACE(PRINTF('%5s', your_col), ' ', '0') FROM your_table
The PRINTF call pads the value with spaces until it's 5 characters, and the
REPLACE call replaces those spaces with zeros.

Import huge dataset into Access from Excel via VBA

I have a huge dataset which I need to import from Excel into Access (~800k lines). However, I can ignore lines with a particular column value, which make up like 90% of the actual dataset. So in fact, I only really need like 10% of the lines imported.
In the past I've been importing Excel Files line-by-line in the following manner (pseudo code):
For i = 1 To EOF
sql = "Insert Into [Table] (Column1, Column2) VALUES ('" & _
xlSheet.Cells(i, 1).Value & " ', '" & _
xlSheet.Cells(i, 2).Value & "');"
Next i
DoCmd.RunSQL sql
With ~800k lines this takes waaay to long as for every single line a query would be created and run.
Considering the fact that I can also ignore 90% of the lines, what is the fastest approach to import the dataset from Excel to Access?
I was thinking of creating a temporary excel file with a filter activated. And then I just import the filtered excel.
But is there a better/faster approach than this? Also, what is the fastest way to import an excel via vba access?
Thanks in advance.
Consider running a special Access query for the import. Add the below SQL into an Access query window or as SQL query in a DAO/ADO connection. Include any WHERE clauses which requires named column headers, right now set to HDR:No:
INSERT INTO [Table] (Column1, Column2)
SELECT *
FROM [Excel 12.0 Xml;HDR=No;Database=C:\Path\To\Workbook.xlsx].[SHEET1$];
Alternatively, run a Make-Table query in case you need a staging temp table (to remove 90% of lines) prior to final table but do note this query replaces table if exists:
SELECT * INTO [NewTable]
FROM [Excel 12.0 Xml;HDR=No;Database=C:\Path\To\Workbook.xlsx].[SHEET1$];
A slight change in your code will do the filtering for you:
Dim strTest As String
For i = 1 To EOF
strTest=xlSheet.Cells(i, 1).Value
if Nz(strTest)<>"" Then
sql = "Insert Into [Table] (Column1, Column2) VALUES ('" & _
strTest & " ', '" & _
xlSheet.Cells(i, 2).Value & "');"
DoCmd.RunSQL sql
End If
Next i
I assume having the RunSQL outside the loop was just a mistake in your pseudocode. This tests for the Cell in the first column to be empty but you can substitute with any condition is appropriate for your situation.
I'm a little late to the party but I stumbled on this looking for information on a similar problem. I thought I might share my solution in case it could help others or maybe OP, if he/she is still working on it. Here's my problem and what I did:
I have an established Access database of approximately the same number of rows as OPs (6 columns, approx 850k rows). We receive a .xlsx file with one sheet and the data in the same structure as the db about once a week from a partner company.
This file contains the entire db, plus updates (new records and changes to old records, no deletions). The first column contains a unique identifier for each row. The Access db is updated when we receive the file through similar queries as suggested by Parfait, but since it's the entire 850k+ records, this takes 10-15 minutes or longer to compare and update, depending on what else we have going on.
Since it would be faster to load just the changes into the current Access db, I needed to produce a delta file (preferably .txt that can be opened with excel and saved as .xlsx if needed). I assume this is something similar to what OP was looking for. To do this I wrote a small application in c++ to compare the file from the previous week, to the one from the current week. The data itself is an amalgam of character and numerical data that I will just call string1 through string6 here for simplicity. It looks like this:
Col1 Col2 Col3 Col4 Col5 Col6
string1 string2 string3 string4 string5 string6
.......
''''Through 850k rows''''
After saving both .xlsx files as .txt tab delimited files, they look like this:
Col1\tCol2\tCol3\tCol4\tCol5\tCol6\n
string1\tstring2\tstring3\tstring4\tstring5\tstring6\n
....
//Through 850k rows//
The fun part! I took the old .txt file and stored it as a hash table (using the c++ unordered_map from the standard library). Then with an input filestream from the new .txt file I used Col1 in the new file as a key to the hash table and output any differences to two different files. One you could use a query to append the db with new data and the other you could use to update data that has changed.
I've heard it's possible to create a more efficient hash table than the unordered_map but at the moment, this works well so I'll stick with it. Here's my code.
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <unordered_map>
int main()
{
using namespace std;
//variables
const string myInFile1{"OldFile.txt"};
const string myInFile2{"NewFile.txt"};
string mappedData;
string key;
//hash table objects
unordered_map<string, string> hashMap;
unordered_map<string, string>::iterator cursor;
//input files
ifstream fin1;
ifstream fin2;
fin1.open(myInFile1);
fin2.open(myInFile2);
//output files
ofstream fout1;
ofstream fout2;
fout1.open("For Updated.txt"); //updating old records
fout2.open("For Upload.txt"); //uploading new records
//This loop takes the original input file (i.e.; what is in the database already)
//and hashes the entire file using the Col1 data as a key. On my system this takes
//approximately 2 seconds for 850k+ rows with 6 columns
while(fin1)
{
getline(fin1, key, '\t'); //get the first column
getline(fin1, mappedData, '\n'); //get the other 5 columns
hashMap[key] = mappedData; //store the data in the hash table
}
fin1.close();
//output file headings
fout1 << "COl1\t" << "COl2\t" << "COl3\t" << "COl4\t" << "COl5\t" << "COl6\n";
fout2 << "COl1\t" << "COl2\t" << "COl3\t" << "COl4\t" << "COl5\t" << "COl6\n";
//This loop takes the second input file and reads each line, first up to the
//first tab delimiter and stores it as "key", then up to the new line character
//storing it as "mappedData" and then uses the value of key to search the hash table
//If the key is not found in the hash table, a new record is created in the upload
//output file. If it is found, the mappedData from the file is compared to that of
//the hash table and if different, the updated record is sent to the update output
//file. I realize that while(fin2) is not the optimal syntax for this loop but I
//have included a check to see if the key is empty (eof) after retrieving
//the current line from the input file. YMMV on the time here depending on how many
//records are added or updated (1000 records takes about another 5 seconds on my system)
while(fin2)
{
getline(fin2, key, '\t'); //get key from Col1 in the input file
getline(fin2, mappedData, '\n'); //get the mappeData (Col2-Col6)
if(key.empty()) //exit the file read if key is empty
break;
cursor = hashMap.find(key); //assign the iterator to the hash table at key
if(cursor != hashMap.end()) //check to see if key in hash table
{
if(cursor->second != mappedData) //compare mappedData
{
fout2 << key << "\t" << mappedData<< "\n";
}
}
else //for updating old records
{
fout1 << key << "\t" << mappedData<< "\n";
}
}
fin2.close();
fout1.close();
fout2.close();
return 0;
}
There are a few things I am working on to make this an easy to use executable file (for example reading the xml structure of the excel.zip file for direct reading or maybe using an ODBC connection) but for now, I'm just testing it to make sure the outputs are correct. Of course the output files would then have to be loaded into the access database using queries similar to what Parfait suggested. Also, I'm not sure if Excel or Access VBA have a library to build hash tables but it might be worth exploring further if it saves time in accessing the excel data. Any criticisms or suggestions are welcomed.

Replace all error values of all columns after importing datas (while keeping the rows)

An Excel table as data source may contain error values (#NA, #DIV/0), which could disturbe later some steps during the transformation process in Power Query.
Depending of the following steps, we may get no output but an error. So how to handle this cases?
I found two standard steps in Power Query to catch them:
Remove errors (UI: Home/Remove Rows/Remove Errors) -> all rows with an error will be removed
Replace error values (UI: Transform/Replace Errors) -> the columns have first to be selected for performing this operations.
The first possibility is not a solution for me, since I want to keep the rows and just replace the error values.
In my case, my data table will change over the time, means the column name may change (e.g. years), or new columns appear. So the second possibility is too static, since I do not want to change the script each time.
So I've tried to get a dynamic way to clean all columns, indepent from the column names (and number of columns). It replaces the errors by a null value.
let
Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
//Remove errors of all columns of the data source. ColumnName doesn't play any role
Cols = Table.ColumnNames(Source),
ColumnListWithParameter = Table.FromColumns({Cols, List.Repeat({""}, List.Count(Cols))}, {"ColName" as text, "ErrorHandling" as text}),
ParameterList = Table.ToRows(ColumnListWithParameter ),
ReplaceErrorSource = Table.ReplaceErrorValues(Source, ParameterList)
in
ReplaceErrorSource
Here the different three queries messages, after I've added two new column (with errors) to the source:
If anybody has another solution to make this kind of data cleaning, please write your post here.
let
src = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
cols = Table.ColumnNames(src),
replace = Table.ReplaceErrorValues(src, List.Transform(cols, each {_, "!"}))
in
replace
Just for novices like me in Power Query
"!" could be any string as substitute for error values. I initially thought it was a wild card.
List.Transform(cols, each {_, "!"}) generates the list of error handling by column for the main funcion:
Table.ReplaceErrorValues(table_with errors, {{col1,error_str1},{col2,error_str2},{},{}, ...,{coln,error_strn}})
Nice elegant solution, Sergei

Resources