Fetch record countr from a list of tables in Netezza using Python - python-3.x

I have to fetch record counts from tables starting with WC_* from database "TEST_DB".
Currently I'm using below code to do that, but it's taking too long as there are billions of records in many tables. Is there any way to improve performance
for item in list_tables[]:
total_count_query="select count(*) from TEST_DB.."+item[0]+"
cur.execute(total_count_query)
total_record_count=cur.fetchone()[0]
print (item[0]," : ",total_record_count)

There's a column called reltuples in System table "_V_TABLE" which gives the count of all tables provided to statistics are regularly collected. You don't have to calculate the count each each
total_count_query="select reltuples from "+database+".._v_table

Related

Power Query - Alternative for Join to filter Records

I have two tables:
Table: One Row per Order with the Status (Online / Offline)
Table: Multiple Rows per Order
Now I would like to reduce the number of record/ rows in the second table based on the status (Offline) from Table 1.
Is there any alternative to a right join? The first table is filtered on Status 'Offline'
We are talking about several millions of rows which takes some time to Join.
Any thoughts on this from your sight?
Some thoughts:
Create a relationship between these two tables and filter to "Offline".
You could create a join (Merge queries) in Power query and only select the On/Off State column to append. Then the import needs more time, but you're getting a flat dataset in PowerBI
Create a new column in PowerBI with DAX and use LOOKUPVALUE
Without seeing the data I think I would try the first one. If it's too slow, the I think the only way is the second point. Even it takes some more time for importing.
The third one might be the slowest.

How to get all the operations done on an oracle table be imported into hive for processing?(not just actual data in table, but the operations also)

I have a table in oracle db which gets multiple transactions done (lets say around 100 million inserts,updates or deletes in a day). I want to get all the transactions happening in that table to be brought into hive for processing through spark or hive.
For example:
lets say a record in that oracle table goes through initial insert operation followed by 5 updates to same/different columns and finally gets deleted. I want to capture all such operations for all the records in that table and import into hive.
We want to find records with number of operations that exceed a threshold for specific columns and pull a report on them.
Has anyone come across such a use case? Appreciate any help in achieving this.

How to invert a merge query in power query

I have a single column table of customer account numbers and a main table containing 400,000 records pulling from an access database. I want to remove all records from the table where the customer account number can be found in the single column table.
The merge query capability in power query allows me to return only the records where there is a match on the customer list (in addition to a variety of other variations on this theme) but I would like to know whether there is a way to invert this so that I return all records where the customer number does not appear in this list.
I have achieved this already by using the List.Contains function and adding a custom column to identify the rows to exclude and then filtering them out, but I think this is severely impacting the performance of my workbook. Refreshing the table that initially has 400,000 rows prior to this series of transformations takes a very long time, and all queries that depend on this table then also take a long time to refresh.
Thank you
If you do a Left Anti Join of your table with a single column, this will give you your table filtered to only have the rows which do not match to the single column.

Wide rows vs Collections in Cassandra

I am trying to model many-to-many relationships in Cassandra something like Item-User relationship. User can like many items and item can be bought by many users. Let us also assume that the order in which the "like" event occurs is not a concern and that the most used query is simply returning the "likes" based on item as well as the user.
There are a couple of posts dicussing data modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
An alternative would be to store a collection of ItemID in the User table to denote the items liked by that user and do something similar in the Items table in CQL3.
Questions
Are there any hits in performance using the collection? I think they translate to composite columns? So the read pattern, caching and other factors should be similar?
Are collections less performant for write heavy applications? Is updating the collection frequently less performant?
There are a couple of advantages of using wide rows over collections that I can think of:
The number of elements allowed in a collection is 65535 (an unsigned short). If it's possible to have more than that many records in your collection, using wide rows is probably better as that limitation is much higher (2 billion cells (rows * columns) per partition).
When reading a collection column, the entire collection is read every time. Compare this to wide row where you can limit the number of rows being read in your query, or limit the criteria of your query based on clustering key (i.e. date > 2015-07-01).
For your particular use case I think modeling an 'items_by_user' table would be more ideal than a list<item> column on a 'users' table.

how to get last n results by updated time in cassandra?

I want to fetch last n, say last 5 updated rows i.e. order by updated_time desc in cassandra. Is there any good way of doing it?
Exact use case is like, I want to update the count of event whenever it occurs in the event table and fetch the last five events by updated time along with the count.
table structure:-
event_name text, updated_time timestamp, count counter
In Cassandra you can retrieve the editing time with writetime (cell_name). But as you have multiple columns and Cassandra is fast-reads only you may consider doing another view providing exactly the data needed in an ordered manner. On that new table you want to limit read results and periodically trim it down.
It may be possible doing it with writetime() -- but this was not the Cassandra way as it is too slow in production. Another table with just your data is the denormalized Cassandra way of solving it.

Resources