Cloud Spanner JSON function size limit - google-cloud-spanner

Is it possible to return the query results from Spanner to a JSON document? For example, if the query is to "read all transactions for this account", and return those row / record results into a single JSON strong or document, is this possible ? If so, what is the size limitation? ie how many transaction rows can this hold ? example - of there wee 400 transactions in a table, could those all go into a JSON doc or string?
https://cloud.google.com/spanner/docs/json_functions

As Hailong comments, the link you shared retrieved data stored in Json formatted string. Cloud Spanner does not provide a native way to perform this task. If you want to do it, you need to implement it in your code. For example, you can take a look at this other question as a reference. But it will be depend of the language you want to use.
On the other hand, using a gcloud command you can perform it using gcloud spanner databases execute-sql and using the --format flag to select the json format as --format=json You can take a look at here and here for more information about it.

Related

Azure CosmosDB SQL Record counts

I have a CosmosDB Collection which I'm querying using the REST API.
I'd like to access the total number of documents which match my query. I know I can do a count, but that means two calls, one for the count and a subsequent one to retrieve the actual records.
I would assume this is not possible in a single call, BUT.. the Data Explorer in Azure Portal seems to manage it, so just wondering if anyone has been able to figure out what calls it makes, to get this:
Showing Results 1 - 10
Retrieved document count 342
Retrieved document size 2868425 bytes
Output document count 10
It's the Retrieved Document Count I need - if the portal can do it, there ought to be a way :)
I've tried the JAVA SDK as well as REST but can't see any useful options in there either
As so often is the case in this game, asking a question triggers the answer... so apologies in advance.
The answer is to send the x-ms-documentdb-populatequerymetrics header in the request.
The response then gives a whole bunch of useful stuff in x-ms-documentdb-query-metrics.
What I would like to understand still is whether this has any performance impact?

HBase schema design in storing query log

Recently, I'm working on make a solution for storing user's search log/query log into a HBase table.
Let's simple the raw Query log:
query timestamp req_cookie req_ip ...
Data access patterns:
scan through all querys within a time range.
scan through all search history with a specified query
I came up with the following row-key design:
<query>_<timestamp>
But the query may be very long or in different encoding, put query directly into the rowkey seems unwise.
I'm looking for help in optimizing this schema, anybody handling this scenario before?
1- You can do a full table scan with a timerange. In case you need realtime responses you have to maintain a reverse row-key table <timestamp>_<query> (plan your region splitting policy carefully first).
Be warned that sequential row key prefixes will get some of your
regions very hot if you have a lot of concurrence, so it would be wise
to buffer writes to that table. Additionally, if you get more writes than a single region can handle you're going to implement some sort of sharding prefix (i.e modulo of the timestamp), although this will make your
retrievals a lot more complex (you'll have to merge the results of
multiple scans).
2- Hash the query string in a way that you always have a fixed-length row key without having to care about encoding (MD5 maybe?)

Search a large amount of data in DynamoDB

I have a use case, in which my data is stored in DynamoDB with hashkey as UniqueID and range key as Date. The same data is also present in simple storage service of Amazon(S3). I want to search all the data based on time range. I want this to be fast enough. I can think of the following possible approaches:
- Scrap the full S3 and sort them based on time(it is not satisfying my latency requirements)
-Using DynamoDB scan filters will not help, as they scan the whole table. Consider data to be of large amount.
Requirements : fast(can get the result in less than 1 minute) ,
do not access a large amount of data,
can't use any other DB source
I think AWS Elasticsearch might be the answer to your problem. DynamoDB is now integrated with Elasticsearch, enabling you to perform full-text queries on your data.
Elasticsearch is a popular open source search and analytics engine designed to simplify real-time search and big data analytics.
Elasticsearch integration is easy with the new Amazon DynamoDB Logstash Plugin.
You should use Query instead of Scan, see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html.

the fastest way to save lucene-solr search result?

Currently SQL '%like%' search is used to get all the rows which contains certain keywords. we're trying to replace MySQL like search with Lucene-Solr.
We constructed indexes,
queried to solr with a keyword,
retrieved the primary keys of all corresponding records,
queried to mysql with PK
and fetched the result.
and it got slower. damn!
I suppose that bandwidth used in 1, 2, 3 is the cause (since the result is really huge, like 1 million+), but I cannot figure any better ways.
Is there any other ways to get solr search result except CSV over http? (like file dump in mysql)
We did the same procedure to combine solr and mysql which was 100-1000x faster than single mySql fulltext search .
So your workflow/procedure is not a problem in general.
The question is: where is your bottleneck.
To investigate that, you should take a look to the catalina out to see the query time of each solr request. Same on MySQL - take a look to query-time/long running queries.
We had an performance problem because the returned number of PK was very large -> so the mySQL query was very large because of an very long where in () clause.
Followed by an very large MySQL statement there where lots of rows returned 200-1.000.000+
But the point is, that the application/user does not need such a big date at onces.
So we decided to work with pagination and offset (on solr side). Solr now returns only 30-50 results (depending of the pagination setting of the users application environment).
This works very fast.
//Edit: Is there any other ways to get solr search result except CSV over http?
There are different formats, like XML, PHP, CSV, Python, Ruby and JSON. To change this, you can use the wtparameter, like ....&wt=json
http://wiki.apache.org/solr/CoreQueryParameters#wt
http://wiki.apache.org/solr/QueryResponseWriter
//Edit #2
An additional way could be not only indexing the data to solr. You could (additional) store the data to solr in order to fetch the data from solr and live without MySQL data.
it depends on your data, if that is an way for you...
Solr provides a way to export the results as CSV and JSON
1 million+ is still a very large set. You can always do it in batches.
Can't you retrieve all your MySQL database to Solr?
You can use DIH ( Data Import Handler ) to retrieve all the data from MySQL and add to Solr pretty easy.
Then you will have all information you need in just one place and I think you will get a better performance.

Limited results when using Excel BigQuery connector

I've pulled a data extract from BigQuery using the Excel connector but my results have been limited to 230,000 records.
Is this a limitation of the connector or something I have not done properly?
BigQuery does have a maximum response size of 64MB (compressed). So, depending on the size of your rows, it's quite possible that 230,000 is the maximum size response BigQuery can return.
See more info on quotas here:
https://developers.google.com/bigquery/docs/quota-policy
What's the use case -- and how many rows are you expecting to be returned? Generally BigQuery is used for large aggregate analysis, rather than results which return tons of unaggregated results. You can dump the entire table as a CSV into Google Cloud Storage if you're looking for your raw dataset too.
Also, you may want to try running the query in the UI at:
https://bigquery.cloud.google.com/

Resources