junk data fix in pyspark or linux command

junk data fix in pyspark or linux command - linux

I have large data set will come from NIFI, then I'll do ETL transformation with pyspark,
unfortunately, one column in middle got split with new line, making extra column and existing records as NULL for same row, So I need to fix with Linux command at Nifi flow or pyspark code while doing ETL transformation
Ex: source.csv
1,hi,21.0,final,splitexthere,done,v1,v2,done
2,hi,21.0,final,splitext
here,done,v1,v2,done
3,hi,21.0,final,splitexthere,done,v1,v2,done
4,hi,21.0,final,splitexthere,done,v1,v2,failed
expected.csv
1,hi,21.0,final,splitexthere,done,v1,v2,done
2,hi,21.0,final,splitexthere,done,v1,v2,done
3,hi,21.0,final,splitexthere,done,v1,v2,done
4,hi,21.0,final,splitexthere,done,v1,v2,failed
here are some inputs,
we don't know which column will be split like above splittexhere
id column will be numbers always
and one file has multiple splits with new line

As #daggett highlighted, data must conform to the CSV format specifications to be valid across heterogeneous systems.
Add a ValidateRecord or ConvertRecord processor to your NiFi flow to validate CSV to CSV. This will filter out invalid records and valid records from the source data, so basically two forks out of a flowfile and then you can have a separate logic to handle/clean invalid data. Same can be doable in Spark as well but in NiFi it is pretty straightforward!
Note: While configuring CSVReader schema, make sure that all the fields are NOT NULL.
eg. Sample schema for two fields (you have nine fields)
{
"type" : "record",
"namespace" : "com.example.etl",
"name" : "validate_csv_data",
"fields" : [
{ "name" : "col_1", "type" : "string" },
{ "name" : "col_2", "type" : "string" }
]
}

Related

How to skip first N rows and last N rows while reading csv in spark

I have a rest api response which returns data in csv. Below is the snapshot of the csv :
----BEGIN_HEADER
----END_HEADER
----BEGIN_BODY
"ID", "Name", "Age", "Salary"
1,John,26,20k
2,Michel,28,30k
3,Prashad,27,30k
----END_BODY
----BEGIN_FOOTER
"Status Message"
"COMPLETE"
----END_FOOTER
After I read the file in Spark-scala, I need to have only the BODY :
"ID", "Name", "Age", "Salary"
1,John,26,20k
2,Michel,28,30k
3,Prashad,27,30k
So I tried reading this file in spark but looks like I need to clean the header lines and footer line first then go for spark read. Any other alternative?

Presto subqueries: Key not present in map

I have been banging my head a while to Superset -> Presto (PrestoSQL) -> Prometheus combination (as Superset does not yet support Prometheus) and got stymied with an issue when trying to extract columns from Presto's map type column containing Prometheus labels.
In order to get necessary labels mapped as columns from Superset's point of view, I create extra table (or I guess a view in this case) in Superset on top of existing table which had following SQL for creating the necessary columns:
SELECT labels['system_name'] AS "system",labels['instance'] AS "instance","timestamp" AS "timestamp","value" AS "value" FROM "up"
This table is then used as a data source in Superset's chart which treats it as a subquery. The resulting SQL query created by Superset and then sent to Presto looks e.g. like this:
SELECT "system" AS "system",
"instance" AS "instance",
"timestamp" AS "timestamp",
"value" AS "value"
FROM
(SELECT labels['system_name'] AS "system",
labels['instance'] AS "instance",
"timestamp" AS "timestamp",
"value" AS "value"
FROM "up") AS "expr_qry"
WHERE "timestamp" >= from_iso8601_timestamp('2020-10-19T12:00:00.000000')
AND "timestamp" < from_iso8601_timestamp('2020-10-19T13:00:00.000000')
ORDER BY "timestamp" ASC
LIMIT 250;
However, what I get out from above is an error:
io.prestosql.spi.PrestoException: Key not present in map: system_name
at io.prestosql.operator.scalar.MapSubscriptOperator$MissingKeyExceptionFactory.create(MapSubscriptOperator.java:173)
at io.prestosql.operator.scalar.MapSubscriptOperator.subscript(MapSubscriptOperator.java:143)
at io.prestosql.$gen.CursorProcessor_20201019_165636_32.filter(Unknown Source)
After reading a bit about queries from Presto's user guide, I tried a modified query from command line by using WITH:
WITH x AS (SELECT labels['system_name'] AS "system",labels['instance'] AS "instance","timestamp" AS "timestamp","value" AS "value" FROM "up")
SELECT system, timestamp, value FROM x
WHERE "timestamp" >= from_iso8601_timestamp('2020-10-19T12:00:00.000000')
AND "timestamp" < from_iso8601_timestamp('2020-10-19T13:00:00.000000')
LIMIT 250;
And that went throught without any issues. But it seems that I have no way to define how Superset executes its queries, so I'm stuck with the first option. The question is, is there anything wrong with it which could be fixed?
I guess that one option (if everything else fails) would be defining extra tables in Presto side which would do the same trick for mapping the columns, thus hopefully avoiding above issue.

The map subscript operator in Presto requires that the key be present in the map. Otherwise, you get the failure you described.
If some keys can be missing, you can use the element_at function instead, which will return a NULL result:
Returns value for given key, or NULL if the key is not contained in the map.

Save array of objects in cassandra

How can I save array of objects in cassandra?
I'm using a nodeJS application and using cassandra-driver to connect to Cassandra DB. I wanted to save records like below in my db:
{
"id" : "5f1811029c82a61da4a44c05",
"logs" : [
{
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source1",
"destination" : "destination1",
"url" : "https://asdasdas.com",
"data" : "data1"
},
{
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source2",
"destination" : "destination2",
"url" : "https://afdvfbwadvsffd.com",
"data" : "data2"
}
],
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667"
}
In the above record, I can use type "text" to save values of the columns "id" and "conversationId". But not sure how can I define the schema and save data for the field "logs".

With Cassandra, you'll want to store the data in the same way that you want to query it. As you mentioned querying by conversatonid, that's going to influence how the PRIMARY KEY definition should look. Given this, conversationid, should make a good partition key. As for the clustering columns, I had to make some guesses as to cardinality. So, sourceid looked like it could be used to uniquely identify a log entry within a conversation, so I went with that next.
I thought about using id as the final clustering column, but it looks like all entries with the same conversationid would also have the same id. It might be a good idea to give each entry its own unique identifier, to help ensure uniqueness:
{
"uniqueid": "e53723ca-2ab5-441f-b360-c60eacc2c854",
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source1",
"destination" : "destination1",
"url" : "https://asdasdas.com",
"data" : "data1"
},
This makes the final table definition look like this:
CREATE TABLE conversationlogs (
id TEXT,
conversationid TEXT,
uniqueid UUID,
source TEXT,
destination TEXT,
url TEXT,
data TEXT,
PRIMARY KEY (conversationid,sourceid,uniqueid));

You have a few options depending on how you want to query this data.
The first is to stringify the json in logs field and save that to the database and then convert it back to JSON after querying the data.
The second option is similar to the first, but instead of stringifying the array, you store the data as a list in the database.
The third option is to define a new table for the logs with a primary key of the conversation and clustering keys for each element of the logs. This will allow you to lookup either by the full key or query by just the primary key and retrieve all the rows that match those criteria.
CREATE TABLE conversationlogs (
conversationid uuid,
logid timeuuid,
...
PRIMARY KEY ((conversationid), logid));

How to store Cassandra maps in an array?

I want to store data in following structure :-
"id" : 100, -- primary key
"data" : [
{
"imei" : 862304021502870,
"details" : [
{
"start" : "2018-07-24 12:34:50",
"end" : "2018-07-24 12:44:34"
},
{
"start" : "2018-07-24 12:54:50",
"end" : "2018-07-24 12:56:34"
}
]
}
]
So how do I create table schema in Cassandra for the same ?
Thanks in advance.

There are several approaches to this, depending on the requirements regarding data access/modification - for example, do you need to modify individual fields, or you update at once:
Declare the map of imei/details as user-defined type (UDT), and then declare table like this:
create table tbl (
id int primary key,
data set<frozen<details_udt>>);
But this is relatively hard to support in the long term, especially if you add more nested objects with different types. Plus, you can't really update fields of the frozen records that you must to use in case of nested collections/UDTs - for this table structure you need to replace complete record inside set.
Another approach - just do explicit serialization/deserialization of data into/from JSON or other format, and have table structure like this:
create table tbl(
id int primary key,
data text);
the type of data field depends on what format you'll use - you can use blob as well to store binary data. But in this case you'll need to update/fetch complete field. You can simplify things if you use Java driver's custom codecs that will take care for conversion between your data structure in Java & desired format. See example in the documentation for conversion to/from JSON.

MongoDB Data Structure

I'm a bit of a noob with MongoDB, so would appreciate some help with figuring out the best solution/format/structure in storing some data.
Basically, the data that will be stored will be updated every second with a name, value and timestamp for a certain meter reading.
For example, one possibility is water level and temperature in a tank. The tank will have a name and then the level and temperature will be read and stored every second. Overall, there will be 100's of items (i.e. tanks), each with millions of timestamped values.
From what I've learnt so far (and please correct me if I'm wrong), there are a few options as how to structure the data:
A slightly RDMS approach:
This would consist of two collections, Items and Values
Items : {
_id : "id",
name : "name"
}
Values : {
_id : "id",
item_id : "item_id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The more document db denormalized method:
This method involves one collection of items each with an array of timestamped values
Items : {
_id : "id",
name : "name"
values : [{
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}]
}
A collection for each item
Save all the values in a collection named after that item.
ItemName : {
_id : "id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The majority of read queries will be to retrieve the timestamped values for a specified time period of an item (i.e. tank) and display in a graph. And for this, the first option makes more sense to me as I don't want to retrieve the millions of values when querying for a specific item.
Is it even possible to query for values between specific timestamps for option 2?
I will also need to query for a list of items, so maybe a combination of the first and third option with a collection for all the items and then a number of collections to store the values for each of those items?
Any feedback on this is greatly appreciated.

Don't use timestamp if you are not modifying the ObjectId.
As ObjectId itself has time stamp in it.
So you will be saving a lot of memory by it.
MongoDB Id Documentation
In case if you dont require the previous data then you can use update query in MongoDB to update the fields every second instead of storing.
If you want to store the updated data each time then instead of updating store it in flat structure.
{ "_id" : ObjectId("XXXXXX"),
"name" : "ItemName",
"value" : "ValueOfItem"
"created_at" : "timestamp"
}
Edit 1: Added timestamp as per the comments

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string