Bigtable duplicate data - GC policy

Bigtable duplicate data - GC policy - garbage-collection

When I insert value to a cell in bigtable, it doesn't overwrite the previous value but instead just add the same value to the same column identifier in the same row. The only difference is the timestamp.
Is this normal? The GC policy on my table is default so I expect it to only retain 1 version of my data. Am I misunderstanding something?
common:delete_flag # 2018/03/05-18:19:21.638000
"0"
common:delete_flag # 2018/03/05-19:51:52.933000
"0"
common:delete_flag # 2018/03/05-18:34:09.517000
"0"
common:delete_flag # 2018/03/05-18:28:21.614000
"0"
common:delete_flag # 2018/03/05-18:30:41.711000
"0"
Edit: Maybe this is my answer https://stackoverflow.com/a/46861250/3398347?

Your edit has it right. Bigtable garbage collection happens opportunistically in the background, so more than one version could be kept around at any point in time.
Be sure to use filters to restrict the results of a Read operation, to ensure that you don't see more data than you require.

Related

Auto populate a table based from datas from another table

this is the another version of my first question and I hope I can best explain my problem this time.
From the Table 1, I want to auto populate Table 2 based on this conditions and criteria (below)
From the example, I basically have 3 initial criteria, ON CALL, AVAILABLE, and BREAK
Now for the conditions, I want all Agents from status ON CALL, AVAILABLE, BREAK from Table 1 to be populated on Table 2 (optional: If possible, I wanted only to show agents that HAS a duration of 4 minutes and above from each status). My problem is I always refresh TABLE 1 so I can get an updated data. My goal here is to monitor our agents their current Status and Running Duration, and from that I only need to check on the table 2 so I would see right away who has the highest running duration from each status to be called out.
I only tried MAXIFS function but my problem with it, I can only show 1 result from each status.
What I wanted is to fully populate Table 2 from the data on Table 1. If this is possible with ROW function that would be great, because what I really wanted is a clean Table, and it should only load data if the criteria is met.
Thank you

Something you may be interested in doing is utilizing HSTACK. I am not sure how you are currently obtaining the Agents name in the adjacent column to the results but this would populate both the Agent along with the Duration.
=HSTACK(INDEX(A:C,MATCH(SORT(FILTER(C:C,(C:C>=TIMEVALUE("00:04:00"))*(B:B=H2),""),1,1),C:C,0),1),TEXT(SORT(FILTER(C:C,(C:C>=TIMEVALUE("00:04:00"))*(B:B=H2),""),1,1),"[h]:mm:ss"))
This formula checks Table 1 for any Agent with the status referenced in H2 (Available) that also has a time greater than or equal to 4 mins. It then sorts the results in ascending order and populates the Agent Name that is associated with it. It is dynamic and will produce a table like the following:
Just update the formula to check for "On Call" and "BreaK" as desired for the other two.
UPDATE:
As for conditional formatting, this is utilizing the custom formula posted in the comments. If the formatting of the times are of [h]:mm:ss then you would be looking to do something like this. Notice the 2 cells are highlighted for being between 4 mins and 5 mins.

This is an array solution that spill all the results at once. We use a user LAMBDA function GET to avoid repetition of the same calculation using as input parameter the status (s). The formula works for durations in time format or in text format with a minor modification. On cell E2 put the following formula for durations in time format:
=LET(GET, LAMBDA(s, FILTER(HSTACK(A:A, C:C), (B:B=s)
* IFERROR(C:C >= TIME(0,4,0), FALSE))),
IFERROR(HSTACK(GET("ON CALL"), GET("Available"), GET("Break")),""))
Here is the output:
For durations as text in hh:mm:ss format just replace: C:C >= TIME(0,4,0) with TIMEVALUE(C:C) >= TIME(0,4,0).
The GET function is reused to generate the result for each status. The last IFERROR call is used to remove #N/A values generated by HSTACK when the column doesn't have the maximum number of rows of the output.
The first IFERROR is used to treat the case when the value is not numeric, such has the header. This is because we are using the entire column as input range. Using entire columns produce more concise formulas with less maintenance effort, but it is less efficient, unless you have a good reason to have an open range. If you want to use a specific range instead for the data of the table, then you can remove it and update the ranges accordingly.

Azure Search, Track deletions don't work

As you can see bellow, The "IsActive" column define to detect deletion.
If I go to the DB and change a record "CreationTime" and some data, after running the indexer the changes are applyied in the search service.
Though if I go to the DB and change the IsActive column to 0 (false, since it is a bit column) and the creation time off course, after running the indexer I expect the record to disapear from the search service but it is still there.

When updating IsActive column, you need to also update CreationTime to indicate that the row has changed.
Also, Azure Search sees BIT columns as boolean values instead of 0/1 - so try using "false" as the delete marker value.
Note that SQL integrated change tracking policy would take care of both updates and deletes - consider using it if possible.

The value needs to be set as a string in quotes, so for you, the exact value that you should put in the field "Delete Market Value is "false" including quotes.

IsActive column to 0 (false, since it is a bit column)
Here's the problem. According to the newest documentation:
if you have an integer column where deleted rows are marked with the
value 1, use "1". If you have a BIT column where deleted rows are
marked with the Boolean true value, use the string literal True or
true, the case doesn't matter.
So in my case (BIT column), I've set true (without quotes) as delete marker value and everything works like a charm.

Finding values that are causing NaN for Mean statistic function in Graylog2

I have a simple request log where each record has an execution time in seconds, under exec_time property. It should always be a number (lower the better). Based on that property I have dashboard widget that shows its Mean value and it was working just fine until recently. Now it shows NaN.
My guess is that there's one or more records with exec_time that's not numeric. How can I find these records?

Take your request logs, and iterate over them with a simple script to check every value individually. Print out the ones that are wrong. With no other information my guess is you have a mis-tagged non-number field feeding into "exec_time", an empty (null, none) value, something too large, something too small, or a corrupted entry somewhere.

Multiple Date Comparison Queries

Is there an efficient way of identifying which date is the maximum date across 12 columns of data which all have different dates? Naturally the easiest approach is MAX(RANGE A: RANGE L) and then pull that value down to get the rest of the rows. However, this isn't what I want.
What I want to do is to create a function where I can compare the dates across rows and if it is the max - highlight that value. This is because each column is responsible for a specific part of a process and I want to identify where is the largest delay.
My initial thoughts were defining 2 variables and having them each hold one value temporarily and performing a check to see if var 1 > var 2. If it is, then move to the next one (FOR EACH) loop - otherwise max value is reached. Highlight that value.
Would anyone be able to assist me?

SSIS Text was truncated with status value 4

I am developing a SSIS package, trying to update an existing SQL table from a CSV flat file. All of the columns are successfully updating except for one column. If I ignore this column on truncate, my package completes successfully. So I know this is a truncate problem and not error.
This column is empty for almost every row. However, there are a few rows where this field is 200-300 characters. My data conversion task identified this field as a DT_WSTR, but from what I've read elsewhere maybe this should be DT_NTEXT. I've tried both and I even set the DT_WSTR to 500. But none of this fixed my problem. How can I fix? What data type should this column be in my SQL table?
Error: 0xC02020A1 at Data Flow Task 1, Source - Berkeley812_csv [1]: Data conversion failed. The data conversion for column "Reason for Delay in Transition" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".
Error: 0xC020902A at Data Flow Task 1, Source - Berkeley812_csv [1]: The "output column "Reason for Delay in Transition" (110)" failed because truncation occurred, and the truncation row disposition on "output column "Reason for Delay in Transition" (110)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component.
Error: 0xC0202092 at Data Flow Task 1, Source - Berkeley812_csv [1]: An error occurred while processing file "D:\ftproot\LocalUser\RyanDaulton\Documents\Berkeley Demographics\Berkeley812.csv" on data row 758.

One possible reason for this error is that your delimiter character (comma, semi-colon, pipe, whatever) actually appears in the data in one column. This can give very misleading error messages, often with the name of a totally different column.
One way to check this is to redirect the 'bad' rows to a separate file and then inspect them manually. Here's a brief explanation of how to do that:
http://redmondmag.com/articles/2010/04/12/log-error-rows-ssis.aspx
If that is indeed your problem, then the best solution is to fix the files at the source to quote the data values and/or use a different delimeter that isn't in the data.

I've had this issue before, it is likely that the default column size for the file is incorrect. It will put a default size of 50 characters but the data you are working with is larger. In the advanced settings for your data file, adjust the column size from 50 to the table's column size.

I suspect the
or one or more characters had no match in the target code page
part of the error.
If you remove the rows with values in that column, does it load?
Can you identify, in other words, the rows which cause the package to fail?
It could be the data is too long, or it could be that there's some funky character in there SQL Server doesn't like.

If this is coming from SQL Server Import Wizard, try editing the definition of the column on the Data Source, it is 50 characters by default, but it can be longer.
Data Soruce -> Advanced -> Look at the column that goes in error -> change OutputColumnWidth to 200 and try again.

I've had this problem before, you can go to "advanced" tab of "choose a data source" page and click on "suggested types" button, and set the "number of rows" as much as you want. after that, the type and text qualified are set to the true values.
i applied the above solution and can convert my data to SQL.

In my case, some of my rows didn't have the same number of columns as the header. Example, Header has 10 columns, and one of your rows has 8 or 9 columns. (Columns = Count number of you delimiter characters in each line)

If all other options have failed, trying recreating the data import task and/or the connection manager. If you've made any changes since the task was originally created, this can sometimes do the trick. I know it's the equivalent of rebooting, but, hey, if it works, it works.

I have same problem, and it is due to a column with very long data.
When I map it, I changed it from DT_STR to Text_Stream, and it works

In the destination, in advanced, check that the length of the column is equal to the source.

OuputColumnWidth of column must be increased.
Path: Source Connection manager-->Advanced-->OuputColumnWidth

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string