Search a large amount of data in DynamoDB - search

I have a use case, in which my data is stored in DynamoDB with hashkey as UniqueID and range key as Date. The same data is also present in simple storage service of Amazon(S3). I want to search all the data based on time range. I want this to be fast enough. I can think of the following possible approaches:
- Scrap the full S3 and sort them based on time(it is not satisfying my latency requirements)
-Using DynamoDB scan filters will not help, as they scan the whole table. Consider data to be of large amount.
Requirements : fast(can get the result in less than 1 minute) ,
do not access a large amount of data,
can't use any other DB source

I think AWS Elasticsearch might be the answer to your problem. DynamoDB is now integrated with Elasticsearch, enabling you to perform full-text queries on your data.
Elasticsearch is a popular open source search and analytics engine designed to simplify real-time search and big data analytics.
Elasticsearch integration is easy with the new Amazon DynamoDB Logstash Plugin.

You should use Query instead of Scan, see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html.

Related

Columnar storage: Cassandra vs Redshift

How is columnar storage in the context of a NoSQL database like Cassandra different from that in Redshift. If Cassandra is also a columnar storage then why isn't it used for OLAP applications like Redshift?
The storage engines of Cassandra and Redshift are very different, and are created for different cases.
Cassandra's storage not really "columnar" in wide known meaning of this type of databases, like Redshift, Vertica etc, it is much more closer to key-value family in NoSQL world. The SQL syntax used in Cassandra is not any ANSI SQL, and it has very limited set of queries that can be ran there. Cassandra's engine built for fast writing and reading of records, based on key, while Redshift's engine is built for fast aggregations (MPP), and has wide support for analytical queries, and stores,encodes and compresses data on column level.
It can be easily understood with following example:
Suppose we have a table with user id and many metrics (for example weight, height, blood pressure etc...).
I we will run aggregate the query in Redshift, like average weight, it will do the following (in best scenario):
Master will send query to nodes.
Only the data for this specific column will be fetched from storage.
The query will be executed in parallel on all nodes.
Final result will be fetched to master.
Running same query in Cassandra, will result in scan of all "rows", and each "row" can have several versions, and only the latest should be used in aggregation. If you familiar with any key-value store (Redis, Riak, DynamoDB etc..) it is less effective than scanning all keys there.
Cassandra many times used for analytical workflows with Spark, acting as a storage layer, while Spark acting as actual query engine, and basically shouldn't be used for analytical queries by its own. With each version released more and more aggregation capabilities are added, but it is very far from being real analytical database.
I encountered the same question today, and found that this resource on AWS: https://aws.amazon.com/nosql/columnar/

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

HBase schema design in storing query log

Recently, I'm working on make a solution for storing user's search log/query log into a HBase table.
Let's simple the raw Query log:
query timestamp req_cookie req_ip ...
Data access patterns:
scan through all querys within a time range.
scan through all search history with a specified query
I came up with the following row-key design:
<query>_<timestamp>
But the query may be very long or in different encoding, put query directly into the rowkey seems unwise.
I'm looking for help in optimizing this schema, anybody handling this scenario before?
1- You can do a full table scan with a timerange. In case you need realtime responses you have to maintain a reverse row-key table <timestamp>_<query> (plan your region splitting policy carefully first).
Be warned that sequential row key prefixes will get some of your
regions very hot if you have a lot of concurrence, so it would be wise
to buffer writes to that table. Additionally, if you get more writes than a single region can handle you're going to implement some sort of sharding prefix (i.e modulo of the timestamp), although this will make your
retrievals a lot more complex (you'll have to merge the results of
multiple scans).
2- Hash the query string in a way that you always have a fixed-length row key without having to care about encoding (MD5 maybe?)

the fastest way to save lucene-solr search result?

Currently SQL '%like%' search is used to get all the rows which contains certain keywords. we're trying to replace MySQL like search with Lucene-Solr.
We constructed indexes,
queried to solr with a keyword,
retrieved the primary keys of all corresponding records,
queried to mysql with PK
and fetched the result.
and it got slower. damn!
I suppose that bandwidth used in 1, 2, 3 is the cause (since the result is really huge, like 1 million+), but I cannot figure any better ways.
Is there any other ways to get solr search result except CSV over http? (like file dump in mysql)
We did the same procedure to combine solr and mysql which was 100-1000x faster than single mySql fulltext search .
So your workflow/procedure is not a problem in general.
The question is: where is your bottleneck.
To investigate that, you should take a look to the catalina out to see the query time of each solr request. Same on MySQL - take a look to query-time/long running queries.
We had an performance problem because the returned number of PK was very large -> so the mySQL query was very large because of an very long where in () clause.
Followed by an very large MySQL statement there where lots of rows returned 200-1.000.000+
But the point is, that the application/user does not need such a big date at onces.
So we decided to work with pagination and offset (on solr side). Solr now returns only 30-50 results (depending of the pagination setting of the users application environment).
This works very fast.
//Edit: Is there any other ways to get solr search result except CSV over http?
There are different formats, like XML, PHP, CSV, Python, Ruby and JSON. To change this, you can use the wtparameter, like ....&wt=json
http://wiki.apache.org/solr/CoreQueryParameters#wt
http://wiki.apache.org/solr/QueryResponseWriter
//Edit #2
An additional way could be not only indexing the data to solr. You could (additional) store the data to solr in order to fetch the data from solr and live without MySQL data.
it depends on your data, if that is an way for you...
Solr provides a way to export the results as CSV and JSON
1 million+ is still a very large set. You can always do it in batches.
Can't you retrieve all your MySQL database to Solr?
You can use DIH ( Data Import Handler ) to retrieve all the data from MySQL and add to Solr pretty easy.
Then you will have all information you need in just one place and I think you will get a better performance.

Limited results when using Excel BigQuery connector

I've pulled a data extract from BigQuery using the Excel connector but my results have been limited to 230,000 records.
Is this a limitation of the connector or something I have not done properly?
BigQuery does have a maximum response size of 64MB (compressed). So, depending on the size of your rows, it's quite possible that 230,000 is the maximum size response BigQuery can return.
See more info on quotas here:
https://developers.google.com/bigquery/docs/quota-policy
What's the use case -- and how many rows are you expecting to be returned? Generally BigQuery is used for large aggregate analysis, rather than results which return tons of unaggregated results. You can dump the entire table as a CSV into Google Cloud Storage if you're looking for your raw dataset too.
Also, you may want to try running the query in the UI at:
https://bigquery.cloud.google.com/

Resources