InfluxDB lets you delete points based on WHERE tag='value' conditions, but not by field value.
For example, if you have accidentally stored a measurement with a value of -1 in a series of positive floats (e.g. CPU utilization), DELETE FROM metrics WHERE cpu=-1 will return this error:
fields not supported in WHERE clause during deletion
This is still (2015 - 2020) not possible in InfluxDB - see ticket 3210.
You could overwrite the point with some other values by inserting in the measurement a point with the same timestamp and tag set:
A point is uniquely identified by the measurement name, tag set, and timestamp. If you submit a new point with the same measurement, tag set, and timestamp as an existing point, the field set becomes the union of the old field set and the new field set, where any ties go to the new field set. This is the intended behavior.
Since you're not supposed to insert nulls, you'll probably want to repeat the values from the previous point(s).
You might think about inserting a point with the same timestamp, and setting a unique value for one of the tags, then running a delete against that tag:
DELETE FROM measurement WHERE some_existing_tag='deleteme'
This won't work though. When you insert that second deleteme point, it has a different tag set due to the deleteme tag, so InfluxDB will create a new point for it. Then the DELETE command will delete it, but not the original point you wanted to delete.
Expensive approaches
Without timerange
# Copy all valid data to a temporary measurement
SELECT * INTO metrics_clean FROM metrics WHERE cpu!=-1 GROUP BY *
# Drop existing dirty measurement
DROP measurement metrics
# Copy temporary measurement to existing measurement
SELECT * INTO metrics FROM metrics_clean GROUP BY *
With timerange
# Copy all valid data to a temporary measurement within timerange
SELECT * INTO metrics_clean FROM metrics WHERE cpu!=-1 and time > '<start_time>' and time '<end_time>' GROUP BY *;
# Delete existing dirty data within timerange
DELETE FROM metrics WHERE time > '<start_time>' and time '<end_time>';
# Copy temporary measurement to existing measurement
SELECT * INTO metrics FROM metrics_clean GROUP BY *
Ugly and slow but fairly robust solution: store timestamps, then delete entries by timestamp, optionally filtering DELETE statement with additional tags.
N.B. this only works if fields have unique timestamps! E.g. if there are multiple fields for one timestamp, all these fields are deleted with below command. Using epoch=ns practically mitigates this, unless you have ~billion data points/second
curl -G 'http://localhost:8086/query?db=DATABASE&epoch=ns' \
--data-urlencode "q=SELECT * FROM metrics WHERE cpu=-1" |\
jq -r "(.results[0].series[0].values[][0])" > delete_timestamps.txt
for i in $(cat delete_timestamps.txt); do
echo $i;
curl -G 'http://localhost:8086/query?db=DATABASE&epoch=ns' \
--data-urlencode "q=DELETE FROM metrics WHERE time=$i AND cpu=-1";
done
Related
I would like to storing the data in the DWH in a consistent matter. Every week I need to load data in AzureDW from on-Prem SQLDB.
The thing is that I have primary key in a table which I get every week. The example of table
I want to design in such a way that all 4 records gets stored in DW.
Shall I use surrogate key or is there some other better way?
If this is staged source data I wouldn't add a surrogate key, typically you only create surrogate keys in your dimensional model.
If your data volume is growing by semi-exponentially every time the process is run (unlikely) I would process as a CTAS, otherwise I would do a
INSERT INTO dbo.table
SELECT *, SYSUTCDATETIME() AS RECORD_INSERT_DATE FROM dbo.table_external_table
So you would just insert all incoming data and add a timestamp for the insert date. Your NK and timestamp become your unique key on the table.
If your requirements involve easily returning the current version of the record you could use a typeII SCD pattern to set a end date for the most recent version of the record and start date + active flag for the new version of the record.
We have a requirement to load last 30 days updated data from the table.
One of the potential solution below does not allow to do so.
select * from XYZ_TABLE where WRITETIME(lastupdated_timestamp) > (TOUNIXTIMESTAMP(now())-42,300,000);
select * from XYZ_TABLE where lastupdated_timestamp > (TOUNIXTIMESTAMP(now())-42,300,000);
The table has columns as
lastupdated_timestamp (with an index on this field)
lastupdated_userid (with an index on this field)
Any pointers ...
Unless your table was built with this query in mind, your query will search every partition of the database, which will become very costly once your dataset has become large and will probably result in a timeout.
To efficiently complete this query, the XYZ_TABLE should have a primary key something like so:
PRIMARY KEY ((update_month, update_day), lastupdated_timestamp)
This is so Cassandra knows right where to go find the data. It has month and day buckets it can quickly find, then you can run queries like this to find updates on a certain day.
SELECT * FROM XYZ_TABLE WHERE update_month = 07-18 and update_day = 06
I have roughly 1500 records in an Access database. I have a field ID that acts as the primary key, and as such cannot be manually changed. After looking through the original Excel sheet these records were kept in, I noticed that a few records in Excel were missing from the Access database. After going through all of them, I added the three missing records into Access.
This database stores records in date order, grouped by a manufacturer. Ex. records from Manufacturer1 collected during week 1 of June '16 are all located together, and records from Manufacturer2 collected during week 2 of June '16 are stored directly afterwards. This is important for us because the data in this database often needs to be looked at visually, so keeping things in date order is essential. There is also a macro that export the data to an Excel sheet and formats it to be easier to read, which exports the records in the order in which they are stored (by the ID field). This is a problem because the three missing records are from years past - now they are in the middle of records from 2018. The IDs they were assigned upon entry keeps them in that location.
Is there a way to reliably insert these records into the database in the location at which they should be? Such as shifting the values of other records ID fields down by 3 to allow room for the missing records? I know I can probably manually have those three records move to the desired location in the macro that exports to Excel, but I'd rather have a less hacky solution that could work if a similar problem happens again.
The order of data in a database is of no interest to the database - it's the relation between data that matters.
To always view your data in the order you want use the ORDER BY clause in an SQL statement. Generally you can add data to the underlying table directly through the query - unless you've got many-to-one type queries where your update would need to affect more than one record.
SELECT FieldName1, FieldName2, . . . .
FROM MyDataTable
ORDER BY Manufacturer, Date
Edit: Even here you'll be adding new records to the bottom of the dataset, but refreshing the query will move the records to the correct order.
I am struggling with data order of Cassandra data. I have a table like this
tbl_data
- yymmddhh (text)
- data (text)
parting key is 'yymmddhh'
I am adding data like this
'16-11-17-01', 'a'
'16-11-17-01', 'b'
'16-11-17-02', 'c'
'16-11-17-03', 'xyz'
'16-11-17-03', 'e'
'16-11-17-03', 'f'
select * from tbl_data limit 10;
I am expecting data in the order in which I added data. But it is giving data like this
'16-11-17-03', 'f'
'16-11-17-03', 'e'
'16-11-17-01', 'a'
i.e. latest record first or some random order. I need data in the same order in which I added. I am not able to figure out the default order of the data in my case. Also I don't want to pass partition key in where condition because its overhead to remember that value for me. Kindly suggest me the solution.
I'm afraid you will struggle forever on this.
As per comments, you can't decide the order "outside" a partition, unless you really understand what you're doing by changing the partitioner.
Please have a read at the suggested link, and at this and this SO answers to understand why you are getting your records in this specific order (yes, they ARE ordered...).
A possible solution, however, is to add a timestamp clustering key, and change the partition key to a simpler "yymmdd":
tbl_data
- yymmdd (timestamp)
- hhmmssMMM (timestamp)
- data (text)
Now you'd store data on day by day basis (that is you need to know the day you are querying data for), and the order of your data inside each partition (that is each day) is sorted by the timestamp column, so for your requirements you'd store there the insertion time of the record.
Now, if you don't insert data every day, you really need to keep track the insertion dates into another (very simple) table:
CREATE TABLE inserted_days (
yymmdd timestamp PRIMARY KEY
);
Issuing a
SELECT * FROM inserted_days
would scan all this partition, returning records in random order (from you app point of view, so you need to sort it), but here we are talking of 365 records in year, something you don't need to worry about. It's easy to do and you'd not incur into unmanageable overheads.
HTH.
i have a table in Cassandradb as mentioned below:
CREATE TABLE remaining (owner varchar,buddy varchar,remain counter,primary key(owner,buddy));
generally i do some inc/dec operations on REMAIN field ,using cql like below:
update remaining set remain=remain + 1 where owner='userA' and buddy='userB';
update remaining set remain=remain + 1 where owner='userA' and buddy='userC';
....
and now i need to find out all buddies for userA which it's REMAIN field greater then 1. when i using:
select buddy,remain from remaining where owner='userA' and remain > 0;
gives me an error:
No indexed columns present in by-columns clause with Equal operator
how to do this in a cassandradb way?
The short answer to this is that you cannot do queries with conditionals on counter columns in Cassandra.
The reason behind this is that all Cassandra queries need to be modeled around the primary key of that particular table. Counter columns are not allowed as parts of the primary key of a table (their changing values would cause constant reorganization of the dat on disk). Counter columns are more used for tracking the state of a known piece of data, for example number of times a photo has been up-voted. This could be quickly recalled as long as we knew which photo we were interested in. To actually sort photos by numbers of votes you would need to perform an analytics style query using spark or Hadoop.