What is the correct CSV format for tuples when loading data with DSBulk?

What is the correct CSV format for tuples when loading data with DSBulk? - cassandra

I recently started using Cassandra for my new project and doing some load testing.
I have a scenario where I’m doing dsbulk load using CSV like this,
$ dsbulk load -url <csv path> -k <keyspace> -t <table> -h <host> -u <user> -p <password> -header true -cl LOCAL_QUORUM
My CSV file entries looks like this,
userid birth_year created_at freq
1234 1990 2023-01-13T23:27:15.563Z {1234:{"(1, 2)": 1}}
Column types,
userid bigint PRIMARY KEY,
birth_year int,
created_at timestamp,
freq map<bigint, frozen<map<frozen<tuple<tinyint, smallint>>, smallint>>>
The issue is, for column freq, I try different ways of setting the value in csv like below, but not able to insert the row using dsbulk
Let’s say if I set freq as {1234:{[1, 2]: 1}},
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field freq to variable freq; conversion from Java type java.lang.String to CQL type Map(BIGINT => Map(Tuple(TINYINT, SMALLINT) => SMALLINT, not frozen), not frozen) failed for raw value: {1234:{[1,2]: 1}}
Caused by: java.lang.IllegalArgumentException: Could not parse ‘{1234:{[1, 2]: 1}}’ as Json
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character (‘[’ (code 91)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name
at [Source: (String)“{1234:{[1, 2]: 1}}“; line: 1, column: 9]
If I set freq as {\"1234\":{\"[1, 2]\":1}},
java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.
If I set freq as {1234:{"[1, 2]": 1}} or {1234:{"(1, 2)": 1}},
Source: 1234,80,2023-01-13T23:27:15.563Z,“{1234:{“”[1, 2]“”: 1}}” java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.
But in COPY FROM TABLE command, the value for freq {1234:{[1, 2]:1}} inserts into DB without any error, the value in DB looks like this {1234: {(1, 2): 1}}
I guess the JSON not accepting array(tuple) as key when I try with dsbulk? Can someone advise me what’s the issue and how to fix this? Appreciate your help.

When loading data using the DataStax Bulk Loader (DSBulk), the CSV format for CQL tuple type is different from the format used by the COPY ... FROM command because DSBulk uses a different parser.
Formatting the CSV data is particularly challenging in your case because the column contains multiple nested CQL collections.
InvalidMappingException
The JSON parser used by DSBulk doesn't accept parentheses () when enclosing tuples. It also expects tuples to be enclosed in double quotes " otherwise you'll get errors like:
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: \
Could not map field ... to variable ...; \
conversion from Java type ... to CQL type ... failed for raw value: ...
...
Caused by: java.lang.IllegalArgumentException: Could not parse '...' as Json
...
Caused by: com.fasterxml.jackson.core.JsonParseException: \
Unexpected character ('(' (code 91)): was expecting either valid name character \
(for unquoted name) or double-quote (for quoted) to start field name
...
IllegalArgumentException
Since values for tuples contain a comma (,) as a separator, DSBulk incorrectly parses the rows and it thinks each row contains more fields than expected and throws an IllegalArgumentException, for example:
java.lang.IllegalArgumentException: Expecting record to contain 2 fields but found 3.
Solution
Just to make it easier, here is the schema for the table I'm using as an example:
CREATE TABLE inttuples (
id int PRIMARY KEY,
inttuple map<frozen<tuple<tinyint, smallint>>, smallint>
)
In this example CSV file, I've used the pipe character (|) as a delimiter:
id|inttuple
1|{"[2,3]":4}
Here's another example that uses tabs as the delimiter:
id inttuple
1 {"[2,3]":4}
Note that you will need to specify the delimiter with either -delim '|' or -delim '\t' when running DSBulk. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

Related

How do I escape the ampersand character (&) in cql?

I am inserting a statement into a table that looks something like this:
insert into db.table (field1, field2) values (1, 'eggs&cheese')
but when i later query this error on our servers, my query returns:
eggs\u0026cheese instead.
Not sure whether to use \ or '
If anyone can help, that would be great. Thank you!

This doesn't appear to be a problem with CQL but the way your app displays the value.
For example, if the CQL column type is text, the unicode character is encoded as a UTF-8 string.
Using this example schema:
CREATE TABLE unicodechars (
id int PRIMARY KEY,
randomtext text
)
cqlsh displays the ampersand as expected:
cqlsh> SELECT * FROM unicodechars ;
id | randomtext
----+-------------
1 | eggs&cheese

Redshift COPY Invalid digit, Value '"', Pos 0, Type: Long

I created a CSV file using spark as follows:
t1.write.option("sep","\001").mode("overwrite").format("csv").save("s3://test123/testcsv001/")
And then tried a COPY command in Redshift to load the CSV file:
copy schema123.table123
from 's3://test123/testcsv001/'
access_key_id 'removed'
secret_access_key 'removed'
session_token 'removed'
TIMEFORMAT 'auto' DATEFORMAT 'auto' DELIMITER '\001' IGNOREHEADER AS 0 TRUNCATECOLUMNS NULL as 'NULL' TRIMBLANKS ACCEPTANYDATE EMPTYASNULL BLANKSASNULL ESCAPE COMPUPDATE OFF STATUPDATE ON
;
The command fails on records where the first column has a null value.
The first column in spark has column definition of a LONG.
The target column is a BIGINT with no NOT NULL constraint.
I cast the column to INT in spark and wrote it to csv and it still failed for the same reason.
Per redshift documentation loading NULLs into BIGINT should work fine.
Any insight into this?

You are setting NULL AS 'NULL'. This mean that when you have the string "NULL" in your source file this means that the value is NULL. So when your input file has "" as the input to a bigint what is Redshift suppose to do? You said you will give it "NULL" when the value is NULL.
I expect you want NULL AS '' and you should also set the file type to CSV so that standard CSV rules will apply.

Issue with Python copy_expert loading data with nulls

I'm unable to import a csv into a table in postgres using copy_expert. The error is due to null values.
My field type in db is to allow nulls. Manually inserting via insert into proves to be successful
Based on what I've understood so far, it is because copy_expert translates nulls into a text, which is why it fails on a timestamp datatype. However, I'm unable to locate the right syntax to coerce the nulls as nulls. Code snippet below:
with open(ab, 'r') as f:
cur.copy_expert("""COPY client_marketing (field1,field2,field3) FROM STDIN DELIMITER ',' CSV HEADER""", f)
Error msg:
DataError: invalid input syntax for type timestamp: "". Appreciate any help on the script or pointing me to the right sources to read on.

I was able to do this by adding force_null (column_name). E.g., if field3 is your timestamp:
copy client_marketing (field1, field2, field3) from stdin with (
format csv,
delimiter ',',
header,
force_null (field3)
);
Hope that helps. See https://www.postgresql.org/docs/10/sql-copy.html

Space in column name is throwing exception while parquet is used for compression

I am getting below error while inserting the data into a table of parquet format with column name having space.
Used Hive client of Cloudera version
CREATE TABLE testColumNames( First Name string) stored as parquet;
insert into testColumNames select 'John Smith';
Is there any workaround to solve this issue? We got this error from Spark 2.3 code as well.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'name' at line 1: optional binary first name
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:248)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:583)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:527)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:636)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:98)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'name' at line 1: optional binary first name
at parquet.schema.MessageTypeParser.check(MessageTypeParser.java:212)
at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:185)
at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:310)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.<init>(ParquetRecordWriterWrapper.java:69)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:134)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:123)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:260)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:245)
... 18 more
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'name' at line 1: optional binary first name
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:248)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:583)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:527)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:974)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:199)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'name' at line 1: optional binary first name
at parquet.schema.MessageTypeParser.check(MessageTypeParser.java:212)
at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:185)
at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:310)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.<init>(ParquetRecordWriterWrapper.java:69)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:134)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:123)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:260)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:245)
... 16 more

Please refer the below url:
https://issues.apache.org/jira/browse/PARQUET-677
It seems this issue is not yet resolved.

From Hive Doc https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Table names and column names are case insensitive but SerDe and
property names are case sensitive.
In Hive 0.12 and earlier, only
alphanumeric and underscore characters are allowed in table and column
names.
In Hive 0.13 and later, column names can contain any Unicode
character (see HIVE-6013), however, dot (.) and colon (:) yield errors
on querying, so they are disallowed in Hive 1.2.0 (see HIVE-10120).
Any column name that is specified within backticks (`) is treated
literally. Within a backtick string, use double backticks (``) to
represent a backtick character. Backtick quotation also enables the
use of reserved keywords for table and column identifiers.
To revert
to pre-0.13.0 behavior and restrict column names to alphanumeric and
underscore characters, set the configuration property
hive.support.quoted.identifiers to none. In this configuration,
backticked names are interpreted as regular expressions. For details,
see Supporting Quoted Identifiers in Column Names.

schema issue while loading data in hive

this is my json file
{
"reviewerID": "A10000012B7CGYKOMPQ4L", "asin": "000100039X", "reviewerName": "Adam", "helpful": [0, 0], "reviewText": "Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!", "overall": 5.0, "summary": "Wonderful!", "unixReviewTime": 1355616000, "reviewTime": "12 16, 2012"
}
the code i am using to create table
scala> hc.sql("create table books (reviewerID string, asin string ,reviewerName string , helpful array<int>, reviewText string, overall int, summary string,unixReviewTime string,reviewTime string)row format delimited fields terminated by ','")
hc.sql("select * from books").show()
output from select *
here the data under the "helpful" column is moving in "reviewText" disturbing other columns as well so what can be the the correct schema for such json file also why it is showing [reviewerID": "A10000012B7CGYKOMPQ4L] in place of only [A10000012B7CGYKOMPQ4L] in specified column

The meaning of the clause row format delimited is - each field in the loaded file is separated by a delimiter and the meaning of - the fields terminated by ',' is - the delimiter is ',' in the loaded file.
So, the table you created interprets the fields from the file in the following way - beginning of the line till it encounters a ',' as first field and from end of the first field till it encounters another ',' as second field and so on.
1st Field --> {"reviewerID": "A10000012B7CGYKOMPQ4L"
2nd Field --> "asin": "000100039X"
4th Field --> "helpful": [0
5th Field --> 0]
6th Field --> "reviewText": "Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!"
In case if you want to create a hive table that interprets json input, you have to use JSON serde.
eg:
create table <table_name>(col1 data_type1, col2 data_type2, ....)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
you may go through the link below for a detailed example.
loading-json-file-in-hive-table

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

What is the correct CSV format for tuples when loading data with DSBulk? - cassandra

Related

How do I escape the ampersand character (&) in cql?

Redshift COPY Invalid digit, Value '"', Pos 0, Type: Long

Issue with Python copy_expert loading data with nulls

Space in column name is throwing exception while parquet is used for compression

schema issue while loading data in hive

Categories

Resources