How to extract basic descriptive stats on tables using schemacrawler - schemacrawler

Is there a way to extract stat information such as
Below based on type of data type like string int etc.
Count
Mean
Min
distinct values
Max
Media
STD
null values
Avg
Top 10 values
Without using query on each table or something more efficient way without overloading the DB.
Also can we fetch few records for sample for each table when crawling.

SchemaCrawler can show you the output of arbitrary database queries. Please download the latest release, and work through the example that shows you how to get output from a database query.
Sualeh Fatehi, SchemaCrawler

Related

Can i do logical Query inside Blob column field in cassandra Query?

Can i do logical Query inside Blob column field in cassandra Query ?
like i have a file inside Blob field called purchase amount : 500$ i want to do search and fetch results purchase amount which is greater than 500$.
is there way i can do this logical search inside my blob.
No, it's not possible out of box. For Cassandra, blob type is just a set of bytes. You can potentially use user-defined functions to extract necessary data, but it could be tricky from performance standpoint.
P.S. I feel that Cassandra may not be correct product for you if you need to search by substring or something like this. In Cassandra you need to model your data based on queries, and then select column types, etc.

DSE (Cassandra) - Range search on int data type

I am a beginner using Cassandra. I created a table with below details and when I try to perform range search using token, I am not getting any results. Am I doing something wrong or is it my understanding of data model?
Query select * from test where token(header)>=2 and token(header)<=4;
the token function calculates the token from the value based on the configured partitioner. The calculated value is the hash that is used to identify the node where the data is located, this is not a data itself.
Cassandra can perform range search on values only on clustering columns (only for some designs) only inside the single partition. If you need to perform range on arbitrary column (also for partition keys), there is a DSE Search that allows you to index the table and perform different types of search, including range... But take into account that it will be much slower than traditional Cassandra queries.
In your situation, you can run 3 queries in parallel (to cover values 2,3,4), like this:
select * from test where header = value;
and then combine results in your code.
I recommend to take DS201 & DS220 courses on DataStax Academy to understand how Cassandra performs queries, and how to model data to make this possible.

Unable to coerce to a formatted date - Cassandra timestamp type

I have the values stored for timestamp type column in cassandra table in format of
2018-10-27 11:36:37.950000+0000 (GMT date).
I get Unable to coerce '2018-10-27 11:36:37.950000+0000' to a formatted date (long) when I run below query to get data.
select create_date from test_table where create_date='2018-10-27 11:36:37.950000+0000' allow filtering;
How to get the query working if the data is already stored in the table (of format, 2018-10-27 11:36:37.950000+0000) and also perform range (>= or <=) operations on create_date column?
I tried with create_date='2018-10-27 11:36:37.95Z',
create_date='2018-10-27 11:36:37.95' create_date='2018-10-27 11:36:37.95'too.
Is it possible to perform filtering on this kind of timestamp type data?
P.S. Using cqlsh to run query on cassandra table.
In first case, the problem is that you specify timestamp with microseconds, while Cassandra operates with milliseconds - try to remove the three last digits - .950 instead of .950000 (see this document for details). The timestamps are stored inside Cassandra as 64-bit number, and then formatted when printing results using the format specified by datetimeformat options of cqlshrc (see doc). Dates without explicit timezone will require that default timezone is specified in cqlshrc.
Regarding your question about filtering the data - this query will work only for small amounts of data, and on bigger data sizes will most probably timeout, as it will need to scan all data in the cluster. Also, the data won't be sorted correctly, because sorting happens only inside single partition.
If you want to perform such queries, then maybe the Spark Cassandra Connector will be the better choice, as it can effectively select required data, and then you can perform sorting, etc. Although this will require much more resources.
I recommend to take DS220 course from DataStax Academy to understand how to model data for Cassandra.
This is works for me
var datetime = DateTime.UtcNow.ToString("yyyy-MM-dd HH:MM:ss");
var query = $"SET updatedat = '{datetime}' WHERE ...

Does spark dataframe.filter(...).select(...) use sequential search or hash algorithms?

Scenario: I have a lookup table created (input is JSON file of around 50 Mb) and cached in memory so that it can be looked up while processing each row of the input file (around 10000 data points in each input file).
Problem: Does dataframe.filter(...).select(...) method in spark perform a sequential search or hash search? How can we retrieve data faster in this case? Also, I was wondering if i need to create a index on it or create a hash table of it (if i need to, i am not sure how its done for dataframes).
As far as I know - neither of them. Select in DataFrames only projects the selected columns, it is not choosing specific records so no searching algorithm is required.
To obtain specific records as you would do with the WHERE clause in standard SQL, you have to select() columns you are interested in and then filter them with filter() method.

Handling the following use case in Cassandra?

I've been given the task of modelling a simple in Cassandra. Coming from an almost solely SQL background, though, I'm having a bit of trouble figuring it out.
Basically, we have a list of feeds that we're listening to that update periodically. This can be in RSS, JSON, ATOM, XML, etc (depending on the feed).
What we want to do is periodically check for new items in each feed, convert the data into a few formats (i.e. JSON and RSS) and store that in a Cassandra store.
So, in an RBDMS, the structure would be something akin to:
Feed:
feedId
name
URL
FeedItem:
feedItemId
feedId
title
json
rss
created_time
I'm confused as to how to model that data in Cassandra to facilitate simple things such as getting x amount of items for a specific feed in descending created order (which is probably the most common query).
I've heard of one strategy that mentions having a composite key storing, in this example, the the created_time as a time-based UUID with the feed item ID but I'm still a little confused.
For example, lets say I have a series of rows whose key is basically the feedId. Inside each row, I store a range of columns as mentioned above. The question is, where does the actual data go (i.e. JSON, RSS, title)? Would I have to store all the data for that 'record' as the column value?
I think I'm confusing wide rows and narrow (short?) rows as I like the idea of the composite key but I also want to store other data with each record and I'm not sure how to meld the two together...
You can store everything in one column family. However If the data for each FeedItem is very large, you can split the data for each FeedItem into another column family.
For example, you can have 1 column familyfor Feed, and the columns of that key are FeedItem ids, something like,
Feeds # column family
FeedId1 #key
time-stamp-1-feed-item-id1 #columns have no value, or values are enough info
time-stamp-2-feed-item-id2 #to show summary info in a results list
The Feeds column allows you to quickly get the last N items from a feed, but querying for the last N items of a Feed doesn't require fetching all the data for each FeedItem, either nothing is fetched, or just a summary.
Then you can use another column family to store the actual FeedItem data,
FeedItems # column family
feed-item-id1 # key
rss # 1 column for each field of a FeedItem
title #
...
Using CQL should be easier to understand to you as per your SQL background.
Cassandra (and NoSQL in general) is very fast and you don't have real benefits from using a related table for feeds, and anyway you will not be capable of doing JOINs. Obviously you can still create two tables if that's comfortable for you, but you will have to manage linking data inside your application code.
You can use something like:
CREATE TABLE FeedItem (
feedItemId ascii PRIMARY KEY,
feedId ascii,
feedName ascii,
feedURL ascii,
title ascii,
json ascii,
rss ascii,
created_time ascii );
Here I used ascii fields for everything. You can choose to use different data types for feedItemId or created_time, and available data types can be found here, and depending on which languages and client you are using it can be transparent or require some more work to make them works.
You may want to add some secondary indexes. For example, if you want to search for feeds items from a specific feedId, something like:
SELECT * FROM FeedItem where feedId = '123';
To create the index:
CREATE INDEX FeedItem_feedId ON FeedItem (feedId);
Sorting / Ordering, alas, it's not something easy in Cassandra. Maybe reading here and here can give you some clues where to start looking for, and also that's really depending on the cassandra version you're going to use.

Resources