How to include SQL select statement in dsbulk unload command - cassandra

I have a huge orderhistory table in cassandra having data from 2013, But I want only last 12 months of orderhistory data to be unloaded, I use the below command to do it which unloads all the data starting from 2013 and stores in the path data/json/customer_data/orderhistory/data. How do I modify the below statement such that each time I run this it should select only last 12months of data?
dsbulk unload -k customer_data -t crawlsiteidentifiedpages -h '172.xx.xx.xxx' \
-c json -url data/json/customer_data/orderhistory/data

You need to remove options -k and -t, and instead use the -query option as described in documentation, like:
dsbulk unload -query 'select * from ks.table where <your condition>'
To make sure that unload is parallelized, make sure that your condition includes part like and token(pkcol) > :start and token(pkcol) <= :end where pkcol is the name of the partition column (if you have multiple partition columns, specify them comma-separated).

Instead of -t crawlsiteidentifiedpages you should use -query and provide the SELECT query, e.g.:
-query "SELECT * FROM crawlsiteidentifiedpages WHERE token(pk) > :start and token(pk) <= :end and date > maxTimeuuid('2021-06-21+0000') ALLOW FILTERING"
A few remarks:
I assume your table has one partition key column pk and one clustering column date of type timeuuid – please adjust the actual query accordingly.
The WHERE restriction token(pk) > :start and token(pk) <= :end allows DSBulk to parallelize the operation and improves performance.
The WHERE restriction date > maxTimeuuid('2021-06-21+0000') is where the magic happens and allows you to select only the last 12 months of data.
Unfortunately, you also need to add ALLOW FILTERING to this type of query, otherwise Cassandra will reject the query.

Related

How to write a shell script that will alter tables drop partitions in hive for more than 10 days for multiple tables(by importing a hql file)

I wanna write a shell script that will drop partitions for multiple tables in hive that have partitions greater than 10 days, can someone please help me out
I am able to hardcode it for only one table, but I do not want to hardcode anything as I am need to deploy it with oozie, So I want to make everything dynamic
As a starting point, you can compute for the value of 10 days ago as reference point and pass it to your HQL script.
# shell script
dt_delete=$(date --date="10 days ago" +'%F') # 2022-03-23
hive -hivevar dt_delete=$dt_delete -f "daily_deletion.hql"
In your Hive script, you might need to find a way how to determine whether a partition is 10 days old. You can try to look at your columns and select the partitions to be deleted.
If your Hive table is partitioned by date, then it's going to more straightforward.
// hive script
alter table table_a drop partition(date_partition < "${dt_delete}");
alter table table_b drop partition(date_partition < "${dt_delete}");
alter table table_c drop partition(date_partition < "${dt_delete}");
...

Cassandra Apache query

I have a problems with a table in cassandra. Below is what I did:
CREATE KEYSPACE tfm WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 };
I'm working in one machine.
CREATE TABLE tfm.foehis(hocpny text, hocol text,honumr text,holinh text,hodtto text,hotour text,hoclic text, hooe text,hotpac text,hodtac text,hohrac text,hodesf text,hocdan text,hocdrs text,hocdsl text, hoobs text,hotdsc text,honrac text,holinr text,housca text,hodtea text,hohrea text,housea text,hodtcl text,hohrcl text,houscl text,hodtrc text,hohrrc text,housrc text,hodtra text,hohrra text,housra text,hodtcm text,hohrcm text,houscm text,hodtua text,hohrua text,houser text, PRIMARY KEY((hooe,hodtac,hohrac),hoclic));
Until this point everything is OK. But when I try to do some select queries, I get warnings and errors:
cqlsh> select count(*) from tfm.foehis;
count
-------
56980
(1 rows)
Warnings :
Aggregation query used without partition key
Read 100 live rows and 1055 tombstone cells for query SELECT * FROM tfm.foehis LIMIT 100 (see tombstone_warn_threshold)
Read 100 live rows and 1066 tombstone cells for query SELECT * FROM tfm.foehis WHERE token(hooe, hodtac, hohrac) >= token(1045161613, 20180502, 2304) LIMIT 100 (see tombstone_warn_threshold)
And
cqlsh> select count(*) from tfm.foehis where hoclic=1011;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid INTEGER constant (1011) for "hoclic" of type text"
cqlsh> select count(*) from tfm.foehis where hotpac=ANOE;
SyntaxException: line 1:49 no viable alternative at input ';' (...from tfm.foehis where hotpac=[ANOE];)
I supposed that the problems is in the definition of table, but I don't know where the problems is.
Actually your issue is in the queries. Since all your columns are text you need to use simple quotes around values.
Also, according to your table definition, the partition key is formed by hooe,hodtac,hohrac columns which means that all your queries must include this columns with exact values (=). hoclic will be the clustering column and on this one you will be able to use other operators and ordering.
Also, have in mind that running queries without the partition key is not recommended in Cassandra (like your select) since this will trigger a full cluster scan and you can run in all sorts of problems (for instance, garbage collection issues).
I would recommend some basic reading: https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key and https://docs.datastax.com/en/cql/3.3/index.html
When executing the query select count(*) from tfm.foehis;, Cassandra will try to look up all the partitions to calculate the count. Cassandra works best when the queries made are pointed, hence the warning.
You have defined the column hoclic as text and are trying to search using an integer value.
First of all avoid select queries where full table scan is required. As performance will be impacted as cassandra need to make scan on all the partitions.
1) select count(*) from tfm.foehis where hoclic=1011; here the value provided is wrong as hoclic is text type . below is the correction :
select count(*) from tfm.foehis where hoclic='1011';
2) select count(*) from tfm.foehis where hotpac=ANOE I don't see hotpac as part of primary key. Cassandra is required to provide the partition key in search based on parameters.

How can I query Cassandra without knowing what I'm going to find?

Just typing that title makes me wonder if I'm in a dimension where everything I've every know about databases is wrong.
I have [several tables but this is an example]:
CREATE TABLE stream (
source_id uuid,
yyyymmdd int,
event_type text,
time timestamp,
data text,
PRIMARY KEY ((source_id, yyyymmdd, event_type), time)
)
I have an idea of what might be in yyyymmdd but for the other partition keys, I do not. Without knowing what the possible values for source_id and event_type are, I can't query it. What I ultimately want to know is:
What is the oldest yyyymmdd and the newest yyyymmdd in the db?
It's almost like I need a database of what is in my database.
In cqlsh go to your keyspace (use <keyspace_name>) and run
copy stream(yyyymmdd) to 'stream-yyyymmdd.csv' with NUMPROCESSES = 1 and MAXREQUESTS = 1;
Or prefix the table name in the copy command with keyspace_name (< keyspace_name>.stream) if you don't want to run use <keyspace_name>.
For NUMPROCESSES and MAXREQUESTS you can use the values that suits you. Please refer to COPY documentatation here. NUMPROCESSES is number of worker processes. Maximum value is 16. Default value: -1. MAXREQUESTS is maximum number of requests each worker can process in parallel. Default value: 6.
Afterwards do a sort and extract first and last line
sort -n -o stream-yyyymmdd-sorted.csv stream-yyyymmdd.csv
head -1 stream-yyyymmdd-sorted.csv
tail -1 stream-yyyymmdd-sorted.csv
HTH

Delete points with unwanted field values from InfluxDB measurement

InfluxDB lets you delete points based on WHERE tag='value' conditions, but not by field value.
For example, if you have accidentally stored a measurement with a value of -1 in a series of positive floats (e.g. CPU utilization), DELETE FROM metrics WHERE cpu=-1 will return this error:
fields not supported in WHERE clause during deletion
This is still (2015 - 2020) not possible in InfluxDB - see ticket 3210.
You could overwrite the point with some other values by inserting in the measurement a point with the same timestamp and tag set:
A point is uniquely identified by the measurement name, tag set, and timestamp. If you submit a new point with the same measurement, tag set, and timestamp as an existing point, the field set becomes the union of the old field set and the new field set, where any ties go to the new field set. This is the intended behavior.
Since you're not supposed to insert nulls, you'll probably want to repeat the values from the previous point(s).
You might think about inserting a point with the same timestamp, and setting a unique value for one of the tags, then running a delete against that tag:
DELETE FROM measurement WHERE some_existing_tag='deleteme'
This won't work though. When you insert that second deleteme point, it has a different tag set due to the deleteme tag, so InfluxDB will create a new point for it. Then the DELETE command will delete it, but not the original point you wanted to delete.
Expensive approaches
Without timerange
# Copy all valid data to a temporary measurement
SELECT * INTO metrics_clean FROM metrics WHERE cpu!=-1 GROUP BY *
# Drop existing dirty measurement
DROP measurement metrics
# Copy temporary measurement to existing measurement
SELECT * INTO metrics FROM metrics_clean GROUP BY *
With timerange
# Copy all valid data to a temporary measurement within timerange
SELECT * INTO metrics_clean FROM metrics WHERE cpu!=-1 and time > '<start_time>' and time '<end_time>' GROUP BY *;
# Delete existing dirty data within timerange
DELETE FROM metrics WHERE time > '<start_time>' and time '<end_time>';
# Copy temporary measurement to existing measurement
SELECT * INTO metrics FROM metrics_clean GROUP BY *
Ugly and slow but fairly robust solution: store timestamps, then delete entries by timestamp, optionally filtering DELETE statement with additional tags.
N.B. this only works if fields have unique timestamps! E.g. if there are multiple fields for one timestamp, all these fields are deleted with below command. Using epoch=ns practically mitigates this, unless you have ~billion data points/second
curl -G 'http://localhost:8086/query?db=DATABASE&epoch=ns' \
--data-urlencode "q=SELECT * FROM metrics WHERE cpu=-1" |\
jq -r "(.results[0].series[0].values[][0])" > delete_timestamps.txt
for i in $(cat delete_timestamps.txt); do
echo $i;
curl -G 'http://localhost:8086/query?db=DATABASE&epoch=ns' \
--data-urlencode "q=DELETE FROM metrics WHERE time=$i AND cpu=-1";
done

Cassandra time range query

Before you downvote I would like to state that I looked at all of the similar questions but I am still getting the dreaded "PRIMARY KEY column cannot be restricted" error.
Here's my table structure:
CREATE TABLE IF NOT EXISTS events (
id text,
name text,
start_time timestamp,
end_time timestamp,
parameters blob,
PRIMARY KEY (id, name, start_time, end_time)
);
And here's the query I am trying to execute:
SELECT * FROM events WHERE name = ? AND start_time >= ? AND end_time <= ?;
I am really stuck at this. Can anyone tell me what I am doing wrong?
Thanks,
Deniz
This is a query you need to remodel your data for, or use a distributed analytics platform (like spark). Id describes how your data is distributed through the database. Since it is not specified in this query a full table scan will be required to determine the necessary rows. The Cassandra design team has decided that they would rather you not do a query at all rather than do a query which will not scale.
Basically whenever you see "COLUMN cannot be restricted" It means that the query you have tried to perform cannot be done efficiently on the table you created.
To run the query, use the ALLOW FILTERING clause,
SELECT * FROM analytics.events WHERE name = ? AND start_time >= ? AND end_time <= ? ALLOW FILTERING;
The "general" rule to make query is you have to pass at least all partition key columns, then you can add each key in the order they're set." So in order for you to make this work you'd need to add where id = x in there.
However, it appears what this error message is implying is that once you select 'start_time > 34' that's as far "down the chain" you're allowed to go otherwise it would require the "potentially too costly" ALLOW FILTERING flag. So it has to be "only equality" down to one < > combo on a single column. All in the name of speed. This works (though doesn't give a range query):
SELECT * FROM events WHERE name = 'a' AND start_time = 33 and end_time <= 34 and id = '35';
If you're looking for events "happening at minute y" maybe a different data model would be possible, like adding an event for each minute the event is ongoing or what not, or bucketing based on "hour" or what not. See also https://stackoverflow.com/a/48755855/32453

Resources