How to parse log with "dynamically" double quotes - logstash

I have a similar to following logs
4294967295,"adult,low-risk",6564ec78-4995-45b7-b73d-44ee50851dcb,"everything,lost,bla",0
The value inside double quotes should be on the same field so i would get something like
field1 => 4294967295
field2 => "adult,low-risk"
field3 => 36564ec78-4995-45b7-b73d-44ee50851dcb
field4 => "everything,lost,bla"
field5 => 0
But, if the value empty or has a single value the double quotes would not present like:
4294967295,,6564ec78-4995-45b7-b73d-44ee50851dcb,everything,0
Then if i place my dissect/grok like:
%{field1},%{field2},%{field3},%{field4},%{field5}
it would return:
field1 => 4294967295
field2 => "adult
field3 => low-risk"
field4 => 36564ec78-4995-45b7-b73d-44ee50851dcb
field5 => "everything,lost,bla",0
and if i place my dissect/grok like:
%{field1},"%{field2}",%{field3},"%{field4}",%{field5}
it would work but once the value empty or has a single value like i mentioned above it would return _grokparsefailure or __dissectfailure
How do i solve this? Any help would be appreciated, thanks

Using dissect in preference to grok is often a good idea because it has limited functionality, which means it is cheaper. However, dissect does not know about the quoting conventions for commas in csv files. A csv filter does, so if you use
csv { columns => [ "field1", "field2", "field3", "field4", "field5" ] }
you will get the result you want.

Related

What is the correct CSV format for tuples when loading data with DSBulk?

I recently started using Cassandra for my new project and doing some load testing.
I have a scenario where I’m doing dsbulk load using CSV like this,
$ dsbulk load -url <csv path> -k <keyspace> -t <table> -h <host> -u <user> -p <password> -header true -cl LOCAL_QUORUM
My CSV file entries looks like this,
userid birth_year created_at freq
1234 1990 2023-01-13T23:27:15.563Z {1234:{"(1, 2)": 1}}
Column types,
userid bigint PRIMARY KEY,
birth_year int,
created_at timestamp,
freq map<bigint, frozen<map<frozen<tuple<tinyint, smallint>>, smallint>>>
The issue is, for column freq, I try different ways of setting the value in csv like below, but not able to insert the row using dsbulk
Let’s say if I set freq as {1234:{[1, 2]: 1}},
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field freq to variable freq; conversion from Java type java.lang.String to CQL type Map(BIGINT => Map(Tuple(TINYINT, SMALLINT) => SMALLINT, not frozen), not frozen) failed for raw value: {1234:{[1,2]: 1}}
Caused by: java.lang.IllegalArgumentException: Could not parse ‘{1234:{[1, 2]: 1}}’ as Json
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character (‘[’ (code 91)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name
at [Source: (String)“{1234:{[1, 2]: 1}}“; line: 1, column: 9]
If I set freq as {\"1234\":{\"[1, 2]\":1}},
java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.
If I set freq as {1234:{"[1, 2]": 1}} or {1234:{"(1, 2)": 1}},
Source: 1234,80,2023-01-13T23:27:15.563Z,“{1234:{“”[1, 2]“”: 1}}” java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.
But in COPY FROM TABLE command, the value for freq {1234:{[1, 2]:1}} inserts into DB without any error, the value in DB looks like this {1234: {(1, 2): 1}}
I guess the JSON not accepting array(tuple) as key when I try with dsbulk? Can someone advise me what’s the issue and how to fix this? Appreciate your help.
When loading data using the DataStax Bulk Loader (DSBulk), the CSV format for CQL tuple type is different from the format used by the COPY ... FROM command because DSBulk uses a different parser.
Formatting the CSV data is particularly challenging in your case because the column contains multiple nested CQL collections.
InvalidMappingException
The JSON parser used by DSBulk doesn't accept parentheses () when enclosing tuples. It also expects tuples to be enclosed in double quotes " otherwise you'll get errors like:
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: \
Could not map field ... to variable ...; \
conversion from Java type ... to CQL type ... failed for raw value: ...
...
Caused by: java.lang.IllegalArgumentException: Could not parse '...' as Json
...
Caused by: com.fasterxml.jackson.core.JsonParseException: \
Unexpected character ('(' (code 91)): was expecting either valid name character \
(for unquoted name) or double-quote (for quoted) to start field name
...
IllegalArgumentException
Since values for tuples contain a comma (,) as a separator, DSBulk incorrectly parses the rows and it thinks each row contains more fields than expected and throws an IllegalArgumentException, for example:
java.lang.IllegalArgumentException: Expecting record to contain 2 fields but found 3.
Solution
Just to make it easier, here is the schema for the table I'm using as an example:
CREATE TABLE inttuples (
id int PRIMARY KEY,
inttuple map<frozen<tuple<tinyint, smallint>>, smallint>
)
In this example CSV file, I've used the pipe character (|) as a delimiter:
id|inttuple
1|{"[2,3]":4}
Here's another example that uses tabs as the delimiter:
id inttuple
1 {"[2,3]":4}
Note that you will need to specify the delimiter with either -delim '|' or -delim '\t' when running DSBulk. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

Can NiFi string manipulate functions be used in QueryRecord?

Can one use substring or concat functions in queryRecord SQL statement field? See I have flowfile
{ "field1": "1, Tom Johnson", "field2":"3", "field3":"xyz" }
In QueryRecord processor SQL query is,
select substringAfter(/field1, ',') as NAME, substringBefore(/field2, ',') as ID, field3 from flowfile
, I got error when run the processor with error about query. Don't know what problem is. How can this be done?
The upstream is above flowfile and tried sql query as both below:
select substringAfter(/field1, ',') as NAME, substringBefore(/field2, ',') as ID, field3 from flowfile
and
select substringAfter(field1, ',') as NAME, substringBefore(field2, ',') as ID, field3 from flowfile
The query with PATH as /field1 are not accepted by processor. The second one trigger run time error during prepare the SQL statement. So are these NiFi function can be used in QueryRecord?
You can use the UpdateRecord processor to do this.
You'll need to clean up the leftover fields afterwards, but that can be done with QueryRecord with:
SELECT NAME, ID, field3 from FLOWFILE

How can I enter JSON value to my postgresql database?

So, I can email address, password and date using this code in node
client.query("INSERT INTO manlalaro (emailaddr,pwd,premiumexpiry) values ('tomatopie#coldmail.com','123absurdcodes', DATE '2009-09-19') ",(err,res)=>{
console.log(err, res)
client.end()
})
But how do I enter JSON data type successfully without getting errors? I have playersaved which is a data type JSON.
The best way to pass the data to be inserted in a separated parameter where the library or the driver do the right treatment to each data type.
In most cases it will be something like this:
client.query("INSERT INTO x (a, b, c) VALUES (?, ?, ?)", [1, "text", { "json": "data" }]);
Or this:
client.query("INSERT INTO x (a, b, c) VALUES ($1, $2, $3)", [1, "text", { "json": "data" }]);
The way to know the right thing to do is read the documentation of the library.
If you are using pg (node-postgres) https://node-postgres.com/
Note: As #Aedric pointed out, in some cases your object must be previously "stringified" (JSON.stringify()). But node-postgres claims it do this automatically. (https://node-postgres.com/features/types#uuid%20+%20json%20/%20jsonb).
You can insert JSON data in postgresql by converting it into string using JSON.stringify(Object)
`insert into tableName (id,json_data) values(1,'{"test":1}')`

Issue with Python copy_expert loading data with nulls

I'm unable to import a csv into a table in postgres using copy_expert. The error is due to null values.
My field type in db is to allow nulls. Manually inserting via insert into proves to be successful
Based on what I've understood so far, it is because copy_expert translates nulls into a text, which is why it fails on a timestamp datatype. However, I'm unable to locate the right syntax to coerce the nulls as nulls. Code snippet below:
with open(ab, 'r') as f:
cur.copy_expert("""COPY client_marketing (field1,field2,field3) FROM STDIN DELIMITER ',' CSV HEADER""", f)
Error msg:
DataError: invalid input syntax for type timestamp: "". Appreciate any help on the script or pointing me to the right sources to read on.
I was able to do this by adding force_null (column_name). E.g., if field3 is your timestamp:
copy client_marketing (field1, field2, field3) from stdin with (
format csv,
delimiter ',',
header,
force_null (field3)
);
Hope that helps. See https://www.postgresql.org/docs/10/sql-copy.html

Hardcoded date formats in predicate push-down?

If a date literal is used in a pushed-down filter expression, e.g.
val postingDate = java.sql.Date.valueOf("2016-06-03")
val count = jdbcDF.filter($"POSTINGDATE" === postingDate).count
where the POSTINGDATE column is of JDBC type Date, the resulting pushed-down SQL query looks like the following:
SELECT .. <columns> ... FROM <table> WHERE POSTINGDATE = '2016-06-03'
Specifically, the date is compiled into a string literal using the hardcoded yyyy-MM-dd format that java.sql.Date.toString emits. Note the implied string conversion for date (and timestamp) values in JDBCRDD.compileValue
/**
* Converts value to SQL expression.
*/
private def compileValue(value: Any): Any = value match {
case stringValue: String => s"'${escapeSql(stringValue)}'"
case timestampValue: Timestamp => "'" + timestampValue + "'"
case dateValue: Date => "'" + dateValue + "'"
case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
case _ => value
}
The resulting query fails if the database is expecting a different format for date string literals. For example, the default format for Oracle is 'dd-MMM-yy', so when the relation query is executed, it fails with a syntax error.
ORA-01861: literal does not match format string
01861. 00000 - "literal does not match format string"
In some situations it may be possible to change the database's expected date format to match the Java format, but in our case we don't have control over that.
It seems like this kind of conversion ought to be going through some kind of vendor specific translation (e.g. through a JDBCDialect). I've filed a JIRA issue to that effect, but, meanwhile, I suggestions how to work around this? I've tried a variety of different approaches, both on the spark side and the JDBC side, but haven't come up with anything yet. This is a critical issue for us, as we're processing very large tables organized by date -- without pushdown, we end up pulling over the entire table for a Spark-side filter.

Resources