Issue with Python copy_expert loading data with nulls

Issue with Python copy_expert loading data with nulls - python-3.x

I'm unable to import a csv into a table in postgres using copy_expert. The error is due to null values.
My field type in db is to allow nulls. Manually inserting via insert into proves to be successful
Based on what I've understood so far, it is because copy_expert translates nulls into a text, which is why it fails on a timestamp datatype. However, I'm unable to locate the right syntax to coerce the nulls as nulls. Code snippet below:
with open(ab, 'r') as f:
cur.copy_expert("""COPY client_marketing (field1,field2,field3) FROM STDIN DELIMITER ',' CSV HEADER""", f)
Error msg:
DataError: invalid input syntax for type timestamp: "". Appreciate any help on the script or pointing me to the right sources to read on.

I was able to do this by adding force_null (column_name). E.g., if field3 is your timestamp:
copy client_marketing (field1, field2, field3) from stdin with (
format csv,
delimiter ',',
header,
force_null (field3)
);
Hope that helps. See https://www.postgresql.org/docs/10/sql-copy.html

Related

What is the correct CSV format for tuples when loading data with DSBulk?

I recently started using Cassandra for my new project and doing some load testing.
I have a scenario where I’m doing dsbulk load using CSV like this,
$ dsbulk load -url <csv path> -k <keyspace> -t <table> -h <host> -u <user> -p <password> -header true -cl LOCAL_QUORUM
My CSV file entries looks like this,
userid birth_year created_at freq
1234 1990 2023-01-13T23:27:15.563Z {1234:{"(1, 2)": 1}}
Column types,
userid bigint PRIMARY KEY,
birth_year int,
created_at timestamp,
freq map<bigint, frozen<map<frozen<tuple<tinyint, smallint>>, smallint>>>
The issue is, for column freq, I try different ways of setting the value in csv like below, but not able to insert the row using dsbulk
Let’s say if I set freq as {1234:{[1, 2]: 1}},
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field freq to variable freq; conversion from Java type java.lang.String to CQL type Map(BIGINT => Map(Tuple(TINYINT, SMALLINT) => SMALLINT, not frozen), not frozen) failed for raw value: {1234:{[1,2]: 1}}
Caused by: java.lang.IllegalArgumentException: Could not parse ‘{1234:{[1, 2]: 1}}’ as Json
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character (‘[’ (code 91)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name
at [Source: (String)“{1234:{[1, 2]: 1}}“; line: 1, column: 9]
If I set freq as {\"1234\":{\"[1, 2]\":1}},
java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.
If I set freq as {1234:{"[1, 2]": 1}} or {1234:{"(1, 2)": 1}},
Source: 1234,80,2023-01-13T23:27:15.563Z,“{1234:{“”[1, 2]“”: 1}}” java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.
But in COPY FROM TABLE command, the value for freq {1234:{[1, 2]:1}} inserts into DB without any error, the value in DB looks like this {1234: {(1, 2): 1}}
I guess the JSON not accepting array(tuple) as key when I try with dsbulk? Can someone advise me what’s the issue and how to fix this? Appreciate your help.

When loading data using the DataStax Bulk Loader (DSBulk), the CSV format for CQL tuple type is different from the format used by the COPY ... FROM command because DSBulk uses a different parser.
Formatting the CSV data is particularly challenging in your case because the column contains multiple nested CQL collections.
InvalidMappingException
The JSON parser used by DSBulk doesn't accept parentheses () when enclosing tuples. It also expects tuples to be enclosed in double quotes " otherwise you'll get errors like:
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: \
Could not map field ... to variable ...; \
conversion from Java type ... to CQL type ... failed for raw value: ...
...
Caused by: java.lang.IllegalArgumentException: Could not parse '...' as Json
...
Caused by: com.fasterxml.jackson.core.JsonParseException: \
Unexpected character ('(' (code 91)): was expecting either valid name character \
(for unquoted name) or double-quote (for quoted) to start field name
...
IllegalArgumentException
Since values for tuples contain a comma (,) as a separator, DSBulk incorrectly parses the rows and it thinks each row contains more fields than expected and throws an IllegalArgumentException, for example:
java.lang.IllegalArgumentException: Expecting record to contain 2 fields but found 3.
Solution
Just to make it easier, here is the schema for the table I'm using as an example:
CREATE TABLE inttuples (
id int PRIMARY KEY,
inttuple map<frozen<tuple<tinyint, smallint>>, smallint>
)
In this example CSV file, I've used the pipe character (|) as a delimiter:
id|inttuple
1|{"[2,3]":4}
Here's another example that uses tabs as the delimiter:
id inttuple
1 {"[2,3]":4}
Note that you will need to specify the delimiter with either -delim '|' or -delim '\t' when running DSBulk. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

How to convert dataframe to a text file in spark?

I unloaded snowflake table and created a data frame.
this table has data of various datatype.
I tried to save it as a text file but got an error:
Text data source does not support Decimal(10,0).
So to resolve the error, I casted my select query and converted all columns to string datatype.
Then I got the below error:
Text data source supports only single column, and you have 5 columns.
my requirement is to create a text file as follows.
"column1value column2value column3value and so on"

You can use a CSV output with a space delimiter:
import pyspark.sql.functions as F
df.select([F.col(c).cast('string') for c in df.columns]).write.csv('output', sep=' ')
If you want only 1 output file, you can add .coalesce(1) before .write.

You need to have one column if you want to write using spark.write.text. You can use csv instead as suggested in #mck's answer or you can concatenate all columns into one before you write:
df.select(
concat_ws(" ", df.columns.map(c => col(c).cast("string")): _*).as("value")
).write
.text("output")

Redshift COPY Invalid digit, Value '"', Pos 0, Type: Long

I created a CSV file using spark as follows:
t1.write.option("sep","\001").mode("overwrite").format("csv").save("s3://test123/testcsv001/")
And then tried a COPY command in Redshift to load the CSV file:
copy schema123.table123
from 's3://test123/testcsv001/'
access_key_id 'removed'
secret_access_key 'removed'
session_token 'removed'
TIMEFORMAT 'auto' DATEFORMAT 'auto' DELIMITER '\001' IGNOREHEADER AS 0 TRUNCATECOLUMNS NULL as 'NULL' TRIMBLANKS ACCEPTANYDATE EMPTYASNULL BLANKSASNULL ESCAPE COMPUPDATE OFF STATUPDATE ON
;
The command fails on records where the first column has a null value.
The first column in spark has column definition of a LONG.
The target column is a BIGINT with no NOT NULL constraint.
I cast the column to INT in spark and wrote it to csv and it still failed for the same reason.
Per redshift documentation loading NULLs into BIGINT should work fine.
Any insight into this?

You are setting NULL AS 'NULL'. This mean that when you have the string "NULL" in your source file this means that the value is NULL. So when your input file has "" as the input to a bigint what is Redshift suppose to do? You said you will give it "NULL" when the value is NULL.
I expect you want NULL AS '' and you should also set the file type to CSV so that standard CSV rules will apply.

Table with "TIME" column could not be accessed from Python

I have a table in Postgres with a column name as TIME(UpperCase).
While inserting csv into this table from Postgres itself using SQL command is easy.
COPY american_district FROM 'O:\Python\PostGREsql\district.csv' WITH CSV HEADER DELIMITER AS ',' NULL AS '\N';
but inserting the same csv into the table using python code below gives me an error as
f = open('O:\Python\PostGREsql\district.csv')
cur_DBKPI.copy_from(f, 'american_district', sep=',', null='')
ERROR:
psycopg2.DataError: invalid input syntax for type date: "TIME"
CONTEXT: COPY american_district, line 1, column time: "TIME"
I found that its best practice to keep the column in lower case but is there any workaround of it?

Got it working by changing the syntax of my query to add header in it using "copy_expert".
f = open('O:\Python\PostGREsql\district.csv')
sql = "COPY american_district FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS '\\N'"
cur_DBKPI.copy_expert(sql=sql,file=f)

Turning a Comma Separated string into individual rows in Teradata

I read the post:
Turning a Comma Separated string into individual rows
And really like the solution:
SELECT A.OtherID,
Split.a.value('.', 'VARCHAR(100)') AS Data
FROM
( SELECT OtherID,
CAST ('<M>' + REPLACE(Data, ',', '</M><M>') + '</M>' AS XML) AS Data
FROM Table1
) AS A CROSS APPLY Data.nodes ('/M') AS Split(a);
But it did not work when I tried to apply the method in Teradata for a similar question. Here is the summarized error code:
select failed 3707: expected something between '.' and the 'value' keyword. So is the code only valid in SQL Server? Would anyone help me to make it work in Teradata or SAS SQL? Your help will be really appreciated!

This is SQL Server syntax.
In Teradata there's a table UDF named STRTOK_SPLIT_TO_TABLE,
e.g.
SELECT * FROM dbc.DatabasesV AS db
JOIN
(
SELECT token AS DatabaseName, tokennum
FROM TABLE (STRTOK_SPLIT_TO_TABLE(1, 'dbc,systemfe', ',')
RETURNS (outkey INTEGER,
tokennum INTEGER,
token VARCHAR(128) CHARACTER SET UNICODE)
) AS d
) AS dt
ON db.DatabaseName = dt.DatabaseName
ORDER BY tokennum;
Or see my answer to this similar question

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Issue with Python copy_expert loading data with nulls - python-3.x

I was able to do this by adding force_null (column_name). E.g., if field3 is your timestamp: copy client_marketing (field1, field2, field3) from stdin with ( format csv, delimiter ',', header, force_null (field3) ); Hope that helps. See https://www.postgresql.org/docs/10/sql-copy.html

Related

What is the correct CSV format for tuples when loading data with DSBulk?

How to convert dataframe to a text file in spark?

Redshift COPY Invalid digit, Value '"', Pos 0, Type: Long

Table with "TIME" column could not be accessed from Python

Turning a Comma Separated string into individual rows in Teradata

Categories

Resources