hive: how long does compaction run?

hive: how long does compaction run? - apache-spark

Hive version: 3.1.0.3.1.4.0-315
spark version: 2.3.2.3.1.4.0-315
Basically, i am trying to read transactional table data from spark. As per this page [https://stackoverflow.com/questions/50254590/how-to-read-orc-transaction-hive-table-in-spark][1], found that transactional table has to be compacted. Hence, i want to try this approach.
I am new to this and was trying compaction on delta files but it always shows "initiated" and never complete.
This is happening for both Major and Minor compaction. Any help will be highly appreciated.
I want to know whether is this good approach.
Also, how to monitor the compaction job process other than show compactions? i can only see the line "Compaction enqueued with id 1" from the hiveserver_stdout.log.
Generally, how long does this compaction takes to complete?
is there any way to stop the compactions?
TIA.
[Edited]
SHOW COMPACTIONS;
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
| compactionid | dbname | tabname | partname | type | state | workerid | starttime | duration | hadoopjobid |
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
| CompactionId | Database | Table | Partition | Type | State | Worker | Start Time | Duration(ms) | HadoopJobId |
| 1 | tmp | shop_na2 | dt=2014-00-00 | MAJOR | initiated | --- | --- | --- | --- |
| 2 | tmp | na2_check | dt=2014-00-00 | MINOR | initiated | --- | --- | --- | --- |
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
3 rows selected (0.408 seconds)
The same compactions result has been showing for past 36 hours, though retention period has been set as 86400 sec.

It is advised to perform this operation when the load on the cluster is less, maybe initiate over a weekend when there are less jobs running, it is a resource intensive operation and amount of time depends on the data but a moderate quantity of deltas would span multiple hours. You can use the query SHOW COMPACTIONS; to get an update on the status of compaction including the following details
Database name
Table name
Partition name
Major or minor compaction
Compaction state:
Initiated - waiting in queue
Working - currently compacting
Ready for cleaning - compaction completed and old files scheduled for removal
Thread ID
Start time of compaction

Related

Azure data factory - large dependency pipeline

I have a "master" pipeline in Azure Data factory, which looks like this:
One rectangle is Execute pipeline activity for 1 destination (target) Table, so this "child" pipeline takes some data, transform it and save as a specified table. Essentialy this means that before filling table on the right, we have to fill previous (connected with line) tables.
The problem is that this master pipeline contains more than 100 activities and the limit for data factory pipeline is 40 activities.
I was thinking about dividing pipeline into several smaller pipelines (i.e. first layer (3 rectangles on the left), then second layer etc.), however this could cause pipeline to run a lot longer as there could be some large table in each layer.
How to approach this? What is the best practice here?

Had a similar issue at work but I didn't used Execute Pipeline because it is a terrible approach in my case. I have more than 800 PLs to run with multiple parent and child dependencies that can go multiple levels deep depending the complexity of the data plus several restrictions (starting with transforming data for 9 regions in the US reusing PLs). A simplified diagram of one of many cases I have can easily look like this:
The solution:
A master dependency table where to store all the dependencies:
| Job ID | dependency ID | level | PL_name |
|--------|---------------|-------|--------------|
| Token1 | | 0 | |
| L1Job1 | Token1 | 1 | my_PL_name_1 |
| L1Job2 | Token1 | 1 | my_PL_name_2 |
| L2Job1 | L1Job1,L2Job2 | 2 | my_PL_name_3 |
| ... | ... | ... | ... |
From here it is a tree problem:
There are ways of mapping trees in SQL. Once you have all the dependencies mapped from a tree put them in a stage or tracker table:
| Job ID | dependency ID | level | status | start_date | end_date |
|--------|---------------|-------|-----------|------------|----------|
| Token1 | | 0 | | | |
| L1Job1 | Token1 | 1 | Running | | |
| L1Job2 | Token1 | 1 | Succeeded | | |
| L2Job1 | L1Job1,L2Job2 | 2 | | | |
| ... | ... | ... | ... | ... | ... |
We can easily query this table using a Look up activity to get the PLs level 1 to run and use a For Each activity to trigger the target PL to run with a dynamic Web Activity. Then Update the tracker table status, start_date, end_date, etc accordantly per PL.
There are only two PLs orchestrating:
one for mapping the tree and assign some type of unique ID for that batch.
two for validation (verifies status of parent PLs and controls which PL to run next)
Note: Both call a store procedure with some logic depending the case
I have a recursive call to the validation PL each time a target pipeline ends:
Lets assume L1Job1 and L1Job2 are running in parallel:
L1Job1 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
If L1Job2 hasn't ended the validation PL ends without triggering L2Job1.
Then L1Job2 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
L2Job1 starts running after passing the validations.
Repeat for each level.
This works because we already mapped all the PL dependencies in the job tracker and we know exactly which PLs should run.
I know this looks complicated and maybe can't apply to your case but I hope this can give you or others a clue on how to solve complex data workflows in Azure Data Factory.

Yes as per documentation, Maximum activities per pipeline, which includes inner activities for containers is 40 only.
So, there is only option left is splitting your pipeline in to multiple small pipelines.
Please check below link to know limitations on ADF
https://github.com/MicrosoftDocs/azure-docs/blob/master/includes/azure-data-factory-limits.md

Timestamp data persistence in cassandra

In our crawling system we have 5 node cassandra cluster. I have a scenario in which I want to delete cassandra data as soon as it becomes older than x days.
Ex:
id | name | created_date
1 | Dan | "2017-08-01"
2 | Monk | "2017-08-02"
3 | Shibuya | "2017-08-03"
4 | Rewa | "2017-08-04"
5 | Himan | "2017-08-05"
if x = 3 Then following should be the scenario:
id | name | created_date
1 | Dan | "2017-08-01" --------------> DELETE
2 | Monk | "2017-08-02" --------------> DELETE
3 | Shibuya | "2017-08-03" -------------->(REMAIN)Latest 3 days data
4 | Rewa | "2017-08-04" -------------->(REMAIN)Latest 3 days data
5 | Himan | "2017-08-05" -------------->(REMAIN)Latest 3 days data
If new data is added then id=3 should be deleted.
Is there any Cassandra configuration or any approach to do this?

Cassandra has a TTL feature, that allows you to specify how long each CQL cell is valid. Details are available on the INSERT docs, but it also applies to UPDATE.

You can use TTL.
But be careful with tombstouns and compaction procedure

How to get tombstone count for a cql query?

I am trying to evaluate number of tombstones getting created in one of tables in our application. For that I am trying to use nodetool cfstats. Here is how I am doing it:
create table demo.test(a int, b int, c int, primary key (a));
insert into demo.test(a, b, c) values(1,2,3);
Now I am making the same insert as above. So I expect 3 tombstones to be created. But on running cfstats for this columnfamily, I still see that there are no tombstones created.
nodetool cfstats demo.test
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Now I tried deleting the record, but still I don't see any tombstones getting created. Is there any thing that I am missing here? Please suggest.
BTW a few other details,
* We are using version 2.1.1 of the Java driver
* We are running against Cassandra 2.1.0

For tombstone counts on a query your best bet is to enable tracing. This will give you the in depth history of a query including how many tombstones had to be read to complete it. This won't give you the total tombstone count, but is most likely more relevant for performance tuning.
In cqlsh you can enable this with
cqlsh> tracing on;
Now tracing requests.
cqlsh> SELECT * FROM ascii_ks.ascii_cs where pkey = 'One';
pkey | ckey1 | data1
------+-------+-------
One | One | One
(1 rows)
Tracing session: 2569d580-719b-11e4-9dd6-557d7f833b69
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------+--------------+-----------+----------------
execute_cql3_query | 08:26:28,953 | 127.0.0.1 | 0
Parsing SELECT * FROM ascii_ks.ascii_cs where pkey = 'One' LIMIT 10000; | 08:26:28,956 | 127.0.0.1 | 2635
Preparing statement | 08:26:28,960 | 127.0.0.1 | 6951
Executing single-partition query on ascii_cs | 08:26:28,962 | 127.0.0.1 | 9097
Acquiring sstable references | 08:26:28,963 | 127.0.0.1 | 10576
Merging memtable contents | 08:26:28,963 | 127.0.0.1 | 10618
Merging data from sstable 1 | 08:26:28,965 | 127.0.0.1 | 12146
Key cache hit for sstable 1 | 08:26:28,965 | 127.0.0.1 | 12257
Collating all results | 08:26:28,965 | 127.0.0.1 | 12402
Request complete | 08:26:28,965 | 127.0.0.1 | 12638
http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2

Is it possible to tell cassandra to run a query only on the local node

I've got two nodes that are fully replicated. When I run a query on a table that contains 30 rows, cqlsh trace seems to indicate it is fetching some rows from one server and some rows from the other server.
So even though all the rows are available on both nodes, the query takes 250ms+ rather than 1ms for other queries.
I've already got consistency level set to "one" at the protocol level, what else do you have to do to make it only use one node for the query?
select * from organisation:
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+--------------+--------------+----------------
execute_cql3_query | 04:21:03,641 | 10.1.0.84 | 0
Parsing select * from organisation LIMIT 10000; | 04:21:03,641 | 10.1.0.84 | 68
Preparing statement | 04:21:03,641 | 10.1.0.84 | 174
Determining replicas to query | 04:21:03,642 | 10.1.0.84 | 307
Enqueuing request to /10.1.0.85 | 04:21:03,642 | 10.1.0.84 | 1034
Sending message to /10.1.0.85 | 04:21:03,643 | 10.1.0.84 | 1402
Message received from /10.1.0.84 | 04:21:03,644 | 10.1.0.85 | 47
Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 04:21:03,644 | 10.1.0.85 | 461
Read 1 live and 0 tombstoned cells | 04:21:03,644 | 10.1.0.85 | 560
Read 1 live and 0 tombstoned cells | 04:21:03,644 | 10.1.0.85 | 611
………..etc….....

It turns out that there was a bug in Cassandra versions 2.0.5-2.0.9 that would make Cassandra more likely to request data on two nodes when it only needed to talk to one.
Upgrading to 2.0.10 or greater resolves this problem.
Refer: CASSANDRA-7535

where does query logs store in Cassandra

My question is related to Logs of Queries ran in Cassandra.
I have a cassandra Cluster. Now , if i run any query on it which if takes good amount of time ( say 1 hour ) to completely execute, then is there any way with which I can trace the status of the query and that too without using any cassandra API.
What I found regarding this is that we can turn 'tracing ON;' in Cassandra-cli, and then if I run any query, then I'll get the proper step-by-step status of the query.
For example :
**cqlsh> use demo;
cqlsh:demo> CREATE TABLE test ( a int PRIMARY KEY, b text );
cqlsh:demo> tracing on;
Now tracing requests.
cqlsh:demo> INSERT INTO test (a, b) VALUES (1, 'example');
Unable to complete request: one or more nodes were unavailable.
Tracing session: 4dc5f950-6625-11e3-841a-b7e2b08eed3e
activity | timestamp | source | source_elapsed
--------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 13:10:15,627 | 192.168.171.87 | 0
Parsing INSERT INTO test (a, b) VALUES (1, 'example'); | 13:10:15,640 | 192.168.171.87 | 13770
Preparing statement | 13:10:15,657 | 192.168.171.87 | 30090
Determining replicas for mutation | 13:10:15,669 | 192.168.171.87 | 42689
Unavailable | 13:10:15,682 | 192.168.171.87 | 55131
Request complete | 13:10:15,682 | 192.168.171.87 | 55303**
But it does not satisfy my requirement as I need to see the status of any previously ran query.
Please provide any solution.
Thanks
Saurabh

Take a look at the system_traces keyspace events and sessions tables.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

hive: how long does compaction run? - apache-spark

Related

Azure data factory - large dependency pipeline

Timestamp data persistence in cassandra

How to get tombstone count for a cql query?

Is it possible to tell cassandra to run a query only on the local node

where does query logs store in Cassandra

Categories

Resources