How to overwrite fields when you COPY FROM data? - cassandra

Is it possible to somehow overwrite existing counters-fields when COPY FROM data (from CSV),
or completely delete rows from the database?
When I COPY FROM data to existing rows, the counters are summarized.
I can’t completely DELETE these rows as well:
although it seems that the rows are deleted, when you re-COPY FROM the data from CSV,
the counters-fields continue to increase.

You can't set counters to the specific value - for them the only supported operation is either increase, or decrease. To set them to specific value you need either to decrease it to its current value, and then increase to desired value, but this will require that you read value. Or you need to delete corresponding cells (or whole row), and perform increase operation using desired number.
The second approach could be implemented easier, but will require that you first generate a file with CQL DELETE commands based on the content of your CSV file, and then use COPY FROM - if nobody increased values since deletion, then counters will get correct values.

Related

Most efficient way to avoid injecting duplicate rows into Postgres db

This is more of a conceptual question. I'm building a relational db, using python and the psycopg2 library, and have a table that has over 44 million rows (and growing) that I want to try and inject sanitized rows from a csv file into the table without injecting duplicate rows; each row has an auto incrementing unique id from it's origin db table.
The current way I'm injecting the data is using the COPY table(columns...) FROM '/path/to/file' command; which is working like a charm. This occurs after we've sanitized all rows in the the csv file to match the datatypes in the rows to the appropriate column's datatypes in the table.
There are a few ideas that I have in mind, and one I've tried, but want to see what the most efficient option is before implementation.
The one I tried ended up being a tremendous burden on the server's cpu and memory; which we have decided not to proceed on. I ended up creating a script that makes a query to the db that searches for the unique id in the table (over 44 million rows).
My other idealistic solutions:
Allow injection of duplicates then create a script to clean up any duplicate rows in the table.
Create a temporary table with the data from the csv. Compare the temp table with the existing table, removing any duplicate values from the temp table, then injecting the temp table into the existing table.
Step 2 might be simplified with this issue. Instead of comparing the two tables we just use the INSERT INTO command along with the ON CONFLICT option.
This one might be more of a stretch of the imagination, and probably pretty unique to our situation. But, since we know that the unique id field will be auto incrementing, we can set a global variable to equal the largest unique id value in the table, then before sanitizing the data we make a query to check if the unique id value is less than the global variable data, and if that is True, we throw out the row from being injected. (No longer an option)

Hash of a complete table

The Excel macro I've created is checking the value of some tables for data coherence before running the actual code.
At first, the computation time was not perceptible, but my tables are getting bigger and bigger...
What I'm wanting to do, is checking the data coherence only if their contents were modified since last check. And I though of hash.
But, I was wondering if it's possible to create quickly a hash of an entire table? If I start to create a hash of each cells, I'm afraid the computation time will be similar.
Thanks in advance for your help!
What you can do is after every checking of the data coherence make a copy of that table into a hidden sheet (to freeze that state of data).
Next time you run your code you just compare your data against the hidden copy to check which data changed. Then you only need to check coherence of the changed data.
Comparings like this can be done quickly by reading both (data and hidden copy) into arrays and compare the arrays.
You can read a full range of data into an array with one singele line of code
Dim DataArray() As Variant
DataArray = ThisWorkbook.Worksheets("Data").Range("A1:C10").Value
DataArray is now an array containg the data of range A1:C10 and you can access it using:
DataArray(row, column)

Does Cassandra store only the affected columns when updating a record or does it store all columns every time it is updated?

If the answer is yes,
Does that mean unlike Mongo or RDMS, whether we retrieve every column or some column will have big performance impact in Cassandra?(I am not talking about transfer time over network as it will affect all of the above)
Does that mean during compaction, it cannot just stop when it finds the latest row for a primary key, it has to go through the full set in SSTables? (I understand there will be optimisations as previously compacted SSTable will have maximum one occurrence for row)
Please ask only one question per question.
That is entirely up to you. If you write one column value, it'll persist just that one. If you write them all, they will all persist, even if they are the same as the current value.
whether we retrieve every column or some column will have big performance impact
This is definitely the case. Queries for column values that are small or haven't been written to or deleted will be much faster than the opposite.
during compaction, it cannot just stop when it finds the latest row for a primary key, it has to go through the full set in SSTables?
Yes. And not just during compaction, but read queries will also check multiple SSTable files.

Custom parallel extractor - U-SQL

I try create a custom parallel extractor, but i have no idea how do it correctly. I have a big files (more than 250 MB), where data for each row are stored in 4 lines. One file row store data for one column. Is this possible to create working parallely extractor for large files? I am afraid that data for one row, will be in different extents after file splitting.
Example:
...
Data for first row
Data for first row
Data for first row
Data for first row
Data for second row
Data for second row
Data for second row
Data for second row
...
Sorry for my English.
I think, you can process this data using U-SQL sequentially not in parallel. You have to write a custom applier to take a single/multiple rows and return single/multiple rows. And then, you can invoke it with CROSS APPLY. You can take help from this applier.
U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.
Today, you have to upload your files as row-structured files to make sure that the rows are aligned with the extent boundaries (although we are going to provide support for rows spanning extent boundaries in the near future). In either way though, the extractor UDO model would not know if your 4 rows are all inside the same extent or across them.
So you have two options:
Mark the extractor as operating on the whole file with adding the following line before the extractor class:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
Now the extractor will see the full file. But you lose the scale out of the file processing.
You extract one row per line and use a U-SQL statement (eg. using Window Functions or a custom REDUCER) to merge the rows into a single row.
I have discovered that I cant use static method to get an instance of IExtractor implementation in USING statement if I want use AtomicFileProcessing set on true.

Azure Table Storage Delete where Row Key is Between two Row Key Values

Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.

Resources