Sourcing data from DocumentDB in Hadoop - azure-hdinsight

I have a hadoop application that source data from two different DocumentDB collection. However, the json schema of documents belonging to these two collections are different. Both has a field showing time, but one is called TimeStamp and the other one is called UpdatedOn. I'd like to know how I can specify a query which is based on this time field and retrive only those json documents satisfying the condition in my query. I specify my query like below
String query = "SELECT * FROM c WHERE c.Timestamp > " + timestamp;
conf.set(ConfigurationUtil.QUERY, query);
This query applies on one of the collection. I need a query like below
"SELECT * FROM collection1 as c1, collection2 as c2 WHERE c1.Timestamp > x1 OR c2.UpdatedOn > x1"
Is this supported in DocumentDB?

This is not supported since it is not documented, your best bet is two execute these two queries and then merge the results using Linq or any other technique to get one result set.
Hope this helps.

Related

[Shopware6]: How can I add SQL Filter to Criteria?

So, the criteria are already quite powerful. Yet I came across a case I seem to not be able to replicate on the criteria object.
I needed to filter out all entries that were not timely relevant.
In a world, where you'd be able to mix SQL with the field definition, it would look like this:
...->addFilter(
new RangeFilter('DATEDIFF(NOW(), INTERVAL createdAt DAY)', [RangeFilter::LTE => 1])
)
Unfortunately that doesn't work in our world.
When i pass the criteria to a searchfunction, i only get:
"DATEDIFF(NOW(), INTERVAL createdAt DAY)" is not a field on xyz
I tried to do it with ->addExtensions and several other experiments, but i couldn't get it to work. I resorted to using the queryBuilder from Doctrine, using queryParts, but the data i'm getting is not very clean and not assigned to an ORMEntity like it should be.
Is it possible to write a criteria that incooperates native SQL filtering?
The DAL is designed in a way that should explicitly not accept SQL statements as it is a core concept of the abstraction. As the DAL offers extendibility for third party extensions it should be preferred to raw SQL in most cases. I would suggest writing a lightweight query that only fetches the IDs using your SQL query and then use these pre-filtered IDs to fetch complete data sets using the DAL.
$ids = (new QueryBuilder($connection))
->select(['LOWER(HEX(id))'])
->from('product')
->where('...')
->execute()
->fetchFirstColumn();
$criteria = new Criteria($ids);
This should offer the best of both worlds, the freedom of using raw SQL and the extendibility features of the DAL.
In your specific case you could also just take the current day, remove the amount of days that should have passed and use this threshold date to compare it to the creation date:
$now = new \DateTimeImmutable();
$dateInterval = new \DateInterval('P1D');
$thresholdDate = $now->sub($dateInterval);
// filter to get all with a creation date greater than now -1 day
$filter = new RangeFilter(
'createdAt',
[RangeFilter::GTE => $thresholdDate->format(Defaults::STORAGE_DATE_TIME_FORMAT)]
);

How to set QueryExecutionContext in boto3 when the query contains joining of tables from multiple databases?

I am using Boto3 package in python3 to execute an Athena query. From the documentation of Boto3, I understand that I can specify a query execution context, i.e. a database name under which the query has to be executed. With a properly specified query execution context, we can omit the fully qualified table name(db_name.table_name) from the query and instead use just the table name.
So the query SELECT * FROM db1.tab1 can be converted to SELECT * FROM tab1 with QueryExecutionContext : {'database':'db1'}
The problem: I need to run a query on Athena from python which looks something like this
SELECT *
FROM ((SELECT *
FROM db1.tab1 AS Temp1)
INNER JOIN (SELECT *
FROM db2.tab2 AS Temp2)
ON temp1.id = temp2.id)
As we can see, the query joins tables from two different databases. If I want to omit the database names from this query, how do I specify the QueryExecutionContext ?
QueryExecutionContext accepts only one database as an argument.So if you want to run a query across multiple databases then you have to pass fully qualified table name along with database.

Cant get identifier and Max Value CosmostDb

I would like to do some reporting on my CosmosDb
my Query is
Select Max(c.results.score) from c
That works but i want the id of the highest score then i get an exception
Select c.id, Max(c.results.score) from c
'c.id' is invalid in the select list because it is not contained in an
aggregate function
you can execute following query to archive what you're asking (thought it can be not very efficient in RU/execution time terms):
Select TOP 1 c.id, c.results.score from c ORDER BY c.results.score DESC
Group by isn't supported natively in Cosmos DB so there is no out of the box way to execute this query.
To implement this using the out of the box functionality you would need to create a new document type that contains the output of your aggregation e.g.
{
"id" : 1,
"highestScore" : 1000
}
You'd then need a process within your application to keep this up-to-date.
There is also documentdb-lumenize that would allow you to do this using stored procedures. I haven't used it myself but it may be worth looking into as an alternative to the above solution.
Link is:
https://github.com/lmaccherone/documentdb-lumenize

NodeJS - azure-storage-node- , how to retrieve addition of two columns, and apply filtering condition

Sorry for being newbie for NodeJs and table query, my question's,
How I could create a query using Nodejs pakcage "azure-storage-node", which selects the sum/addition of two coloumns 'start' and 'period' , if the addition is greater than a threshold it will take the whole raw, my tries which didn't work is something like this,
var query = new azure.TableQuery();
total = query.select(['start']) + query.select(['period']);
query.where('total > ?' , 50000);
or may be something like this,
var query = new azure.TableQuery()
.where('start + period gt 50000');
but it throws an error of '+'.
Thanks
What you're trying to accomplish is not possible with Azure Tables at least as of today as Azure Tables has limited querying support and support for computed columns (if I may say so) is not there.
There are two possible solutions:
Have an attribute called total in your entities that will contain the value i.e. start + period. You calculate this value when you're inserting or updating the entity and store it at that time.
Do this filtering on the client side. For this you will need to download all related entities and then apply this filtering on the client side on the data that you fetched.

Cassandra Searching for a RowKey

I am very new to Cassandra and this time still I have not done my part on reading much about the architecture. I have a simple question for which I am not getting an answer for.
This is a sample data when I do a list abcColumnFamily:
RowKey:Message_1
=> (column=word, value=Message_1, timestamp=1373976339934001)
RowKey:Message_2
=> (column=word, value=Message_2, timestamp=1373976339934001)
How can I search for the Rowkey having say Message_1
In SQL world: Select * from Table where Rowkey = 'Message_1' (= OR like). I want to simply search on full string.
My intention is to just check whether a particular data of my interest is there in a rowkey or not.
For CQL try:
select * from abcColumnFamily where KEY = 'Message_1'
If You want to query that data using CLI try the following:
assume abcColumnFamily keys as utf8;
get abcColumnFamily['Message_1'];

Resources