How to update Sphinx main and delta indexes - search

I've read the Sphinx documentation and various resources, but I am confused about the process of maintaining main and delta indexes. Please let me know if this is correct:
Have a table that partitions the search index by last_update_time (NOT id as in the tutorial http://sphinxsearch.com/docs/1.10/delta-updates.html)
Update the delta index every 15 minutes. The delta index only grabs records that have been updated > last_update_time:
indexer --rotate --config /opt/sphinx/etc/sphinx.conf delta
Update the main index every hour by merging delta using:
indexer --merge main delta --merge-dst-range deleted 0 0 --rotate
The pre query SQL will update last_update_time to NOW(), which re-partitions the indexes
Confusion: Will the merge run the pre query SQL?
After the main index is updated, immediately update the delta index to clean it up:
indexer --rotate --config /opt/sphinx/etc/sphinx.conf delta
EDIT: How would deletion of records even work? Since the delta index would contain deleted records, records would only be removed from search queries after the delta index was merged into main?

To deal with the deletes you need to take a look at the killlist, it basically defines removal criteria:
http://sphinxsearch.com/docs/manual-1.10.html#conf-sql-query-killlist
In an example I have we build our main daily, early morning then simply run a delta update (including the killlist) every 5 minutes.
On the merge stuff, I'm not sure as I've never used it.

This is only half of the job. Deleted stuff must be taken care by kill list (kbatch now it is called) and then delta will not show the deleted results. But if you merge - they will reappear. To fix this - you have to do
indexer --merge main delta --merge-dst-range deleted 0 0 --rotate
But in order for this to work - you need an attribute "deleted" to be added to every result that was deleted. Then merge process will filter out results that have deleted=1 and main index will not have deleted results in it.

Related

Difference in counts on a Delta Table immediately after write operation

I've a Databricks job the writes to a certain Delta Table. After the write has been completed, the job is calling another function that reads and calculates some metrics(counts to be specific) on top of the same Delta table.
But the counts/metrics being calculated are less than the actual counts for the the given partition_dates.
Just to verify that the function calculating the metrics is working fine or not, I called the function after the Databricks job got completed and found that the counts were correct in that run.
I feel Delta table is not completely updated by the time I try to read from it, even though the write operation had completed successfully.
Any help would be appreciated.

What to do to prevent Delta Lake checkpoints to be removed in Azure Databricks?

I noticed that I have only 2 checkpoints files in a delta lake folder. Every 10 commits, a new checkpoint is created and the oldest one is removed.
For instance this morning, I had 2 checkpoints: 340 and 350. I was available to time travel from 340 to 359.
Now, after a "write" action, I have 2 checkpoints: 350 and 360. I'm now able to time travel from 350 to 360.
What can remove the old checkpoints? How can I prevent that?
I'm using Azure Databricks 7.3 LTS ML.
Ability to perform time travel isn't directly related to the checkpoint. Checkpoint is just an optimization that allows to quickly access metadata as Parquet file without need to scan individual transaction log files. This blog post describes the details of the transaction log in more details
The commits history is retained by default for 30 days, and could be customized as described in documentation. Please note that vacuum may remove deleted files that are still referenced in the commit log, because data is retained only for 7 days by default. So it's better to check corresponding settings.
If you perform following test, then you can see that you have history for more than 10 versions:
df = spark.range(10)
for i in range(20):
df.write.mode("append").format("delta").save("/tmp/dtest")
# uncomment if you want to see content of log after each operation
#print(dbutils.fs.ls("/tmp/dtest/_delta_log/"))
then to check files in log - you should see both checkpoints and files for individual transactions:
%fs ls /tmp/dtest/_delta_log/
also check the history - you should have at least 20 versions:
%sql
describe history delta.`/tmp/dtest/`
and you should be able to go to the early version:
%sql
SELECT * FROM delta.`/tmp/dtest/` VERSION AS OF 1
If you want to keep your checkpoints X days, you can set delta.checkpointRetentionDuration to X days this way:
spark.sql(f"""
ALTER TABLE delta.`path`
SET TBLPROPERTIES (
delta.checkpointRetentionDuration = 'X days'
)
"""
)

How to count the number of messages fetched from a Kafka topic in a day?

I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?
Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.
You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).
Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Cassandra sstables accumulating

I've been testing out Cassandra to store observations.
All "things" belong to one or more reporting groups:
CREATE TABLE observations (
group_id int,
actual_time timestamp, /* 1 second granularity */
is_something int, /* 0/1 bool */
thing_id int,
data1 text, /* JSON encoded dict/hash */
data2 text, /* JSON encoded dict/hash */
PRIMARY KEY (group_id, actual_time, thing_id)
)
WITH compaction={'class': 'DateTieredCompactionStrategy',
'tombstone_threshold': '.01'}
AND gc_grace_seconds = 3600;
CREATE INDEX something_index ON observations (is_something);
All inserts are done with a TTL, and should expire 36 hours after
"actual_time". Something that is beyond our control is that duplicate
observations are sent to us. Some observations are sent in near real
time, others delayed by hours.
The "something_index" is an experiment to see if we can slice queries
on a boolean property without having to create separate tables, and
seems to work.
"data2" is not currently being written-- it is meant to be written by
a different process than writes "data1", but will be given the same
TTL (based on "actual_time").
Particulars:
Three nodes (EC2 m3.xlarge)
Datastax ami-ada2b6c4 (us-east-1) installed 8/26/2015
Cassandra 2.2.0
Inserts from Python program using "cql" module
(had to enable "thrift" RPC)
Running "nodetool repair -pr" on each node every three hours (staggered).
Inserting between 1 and 4 million rows per hour.
I'm seeing large numbers of data files:
$ ls *Data* | wc -l
42150
$ ls | wc -l
337201
Queries don't return expired entries,
but files older than 36 hours are not going away!
The large number SSTables is probably caused by the frequent repairs you are running. Repair would normally only be run once a day or once a week, so I'm not sure why you are running repair every three hours. If you are worried about short term downtime missing writes, then you could set the hint window to three hours instead of running repair so frequently.
You might have a look at CASSANDRA-9644. This sounds like it is describing your situation. Also CASSANDRA-10253 might be of interest.
I'm not sure why your TTL isn't working to drop old SSTables. Are you setting the TTL on a whole row insert, or individual column updates? If you run sstable2json on a data file, I think you can see the TTL values.
Full disclosure: I have a love/hate relationship with DTCS. I manage a cluster with hundreds of terabytes of data in DTCS, and one of the things it does absolutely horribly is streaming of any kind. For that reason, I've recommended replacing it ( https://issues.apache.org/jira/browse/CASSANDRA-9666 ).
That said, it should mostly just work. However, there are parameters that come into play, such as timestamp_resolution, that can throw things off if set improperly.
Have you checked the sstable timestamps to ensure they match timestamp_resolution (default: microseconds)?

Resources