Why subqueries cannot be used in the SQL override for uncached lookup transformations in PowerCenter? - subquery

As the question being asked in title:
Why subqueries cannot be used in the SQL override for uncached lookup transformations in PowerCenter?
Is that because subqueries are not allowed in the lookup transformation, or is that because it is being set as "uncached"? and Why?

I guess they can - unless there's something I'm missing. Here's a snapshot of a successfully completed session that is using Non-cached lookup with a subquery, as you can see in the log. Furthermore, I also made sure some data went through indeed, as you can see in the Performance section.

Related

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

Dynamic Cassandra queries

I have a messenger application with a history page, on which you can see your sent and received messages.
Since the amount of messages has lowered my performance I have been thinking about using Cassandra.
After researching on the topic of Cassandra, I found out that you have to build tables to satisfy your queries.
Now the problem: on the history page you can use x amount of different filters at the same time. e.g filter by date,receiver and sender.
If I were to use Cassandra, would I need to create a table for every combination of these filters?
Or is this a bad use case for Cassandra in general?
If so, are there any alternatives?
Why don't you just make a SELECT statement.
You should definately have a look into CQL (Cassandra Query Language).
While CQL and SQL share a similar syntax queries are a lot different.
The reasons for these differences is the fact that Cassandra is dealing with distributed data and aims to prevent inefficient queries.
See this link for reference. It shows queries you can or cannot do.

How can I explain() an upsert in MongoDB to see if indexes are used?

When code changes, a quick way that tells me if indexes are still proper for find() statements is (nodejs)
collection.find(query).explain(function(err, explaination) {
console.log('MongoDebug: ' + explaination.cursor);
});
If the cursor is of type BtreeCursor, indexes are used.
How do I check this when using insert() with upsert: true?
explain() is a function on the cursor and is not available on inserts. There's also a $explain query modifier, but it's still a query modifier.
However, there's a big load of work filed as explain 2.0, one of the subtasks is to provide explain() for updates - SERVER-14101. That's listed as fixed in version 2.7.7.
As a note, performing explain for every operation might be a bad idea, because it forces MongoDB to reevaluate query plans all the time, thereby increasing the server load on the database.
You can use the integrated profiler and db.currentOp() to analyze performance of non-query operations for now, but the insights are limited. Try a simple find().explain() for manual optimization, the indexes used should be the same.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Not able to run MKS integrity query

Getting below error while executing MKS integrity query.
Cannot show view information: Your query was stopped because it was using too may system resources.
Your query is likely taking longer than the time alotted by the Integrity server to queries. By default this value is 15 seconds. This usually indicates that your query is very broad or that an index needs to be created in the database to help increase the performance of the query. The latter requires the assistance of your database administrator.
DISCLAIMER: I am employed by the PTC Integrity Business Unit (formerly MKS).
one thing that you can check is if your query could have a very big list of items as results. try adding more restrictive filters first and then ease them step by step. At least this was my use case :)
Try to use filter as much as can, when you use filters it’s limiting unnecessary results.

Resources