So i have a large collection storing messages and i would like to produce time series data from this collection.
Now i had issues with time series data before when i had 10 million records to group by time interval and count / average the values.
Timestamp => values
I sort of fixed it by putting all my data into one collection by day so now i have less documents but larger documents. This helped reduce the seek and search time the db needs to find the relevant document. However i am not sure how could i speed up my queries on documents that are not time series. Also i want to search text in this large document, so i have to seek all documents no exepction.
As i said i am storing messages in a single document and the schema looks something like this:
Id: string
Author: string
MessageType: string,
Group: string,
Message: string
Votes: number
Date: date
I would like to count all the records that contain a word in the message or all the records that has the author Joe. Or sum the votes and so on.
So i would end up with time series data that i can put on a chart.
Now if i have to go through one year data that is about 50 million records. And the query is gona take forever since it has to fetch so many records and filter out the ones i am interested in.
How could i achieve better performance?
I have indexing set up on the date and author fields only. Yet my queries are slow and the database is super busy processing one query.
Should i pre aggregate my data somehow, what would be a good way?
Or generate the time series data in a background worker?
Can someone direct me to the right way so i can implement a proper solution that can either reduce the load on the database, or increase query performamce?
What are the best practices for handling such a large collection that contains messages?
How could i segment this kind of data?
Would it be a good idea to set up a replica set and shard the database between multiple machines already?
Any help and input would be appriciated.
Related
I have a time series data, for example myData, and display it to my ui, it could be Day, Week, Month, Year.
How it's better to store in MongoDB, should I create separate Collections for this like:
myDataDay
myDataWeek
...
or it's better to store it in one Collections with Day, Week, Month, Year keys?
How could it impact the performance?
You will need to answer following questions:
Number and type of paralel queries you send to the database?
Are there other fields that data will be searched on?
Are the 90% of all queries in the range of last year/month/date/hour or other?
If you split the data between many collections logic on app side will become more complex , from the other hand if you keep everything in same collection at some point in time your database will become bigger and more difficult to mantain...
You may take a look to the special collection types dedicated to time series data , but in general it very depends on the amount of data and and distribution you expect ...
The title pretty much sums up my question, but for more detail, say I have a collection of houses, and each house can have multiple owners/tenants. These are just an ID reference to my owners collection.
So I want to query ALL my houses, but i want to populate all of their owners/tenants. If my query, something like housesModel.find({}) returns N number of results, but i ask it to .populate('owner'), will it perform an additional query to fetch those owners? keep in mind it can have many owners (for the sake of this question)
So if I query all my houses, and i get back 300 results, but i ask it to populate each doc's owners field, and each house has say on average 1.5 owners per house, is that 450 additional individual queries to the database? It feels like this is not very optimal, but that's what im trying to understand.
If it is actually doing N number of db queries for a parent query of N results, how does that affect the performance? Even more so when we get to the 1000s of results?
I'm processing a twitter feed by storing the tweets into a table in memsql. The table has fields like tweet_id, posted_time, body, etc...
The table contains around 5 million tweets per day. total of a billion tweet for the whole period stored so far
The table is stored as a columnstore, with the tweet_id as a sharding key, and the posted_time as the columnstore clustering column.
It is working fine for all real-time analytics so far, and returns answers in sub-second if you query one day. The wider your date filters, the slower the query
The requirement is to generate a word cloud from the body field of the tweet. My question is; what is the best way to do it? I need the query to be efficient (takes seconds not minutes)
Keep in mind the following
joins are not efficient for this big table.
taking the body field for a few million tweets and break it down
to words and then aggregate words and come up with the top ones is not efficient.
I believe a separate table will be needed, what could be the design for this table? suggestions please
Finally, my MemSQL cluster has 5 nodes, total of 1 TB of RAM, and 192 cores
I don't think MemSQL is the best way to do this. Your best bet is to index it with a search server/library like Apache Solr, or just use Apache Lucene as your backend. That way, the queries needed for a word cloud, like "Give me all the counts of the top ranked n-words sorted by count" would return in seconds.
I am trying to model many-to-many relationships in Cassandra something like Item-User relationship. User can like many items and item can be bought by many users. Let us also assume that the order in which the "like" event occurs is not a concern and that the most used query is simply returning the "likes" based on item as well as the user.
There are a couple of posts dicussing data modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
An alternative would be to store a collection of ItemID in the User table to denote the items liked by that user and do something similar in the Items table in CQL3.
Questions
Are there any hits in performance using the collection? I think they translate to composite columns? So the read pattern, caching and other factors should be similar?
Are collections less performant for write heavy applications? Is updating the collection frequently less performant?
There are a couple of advantages of using wide rows over collections that I can think of:
The number of elements allowed in a collection is 65535 (an unsigned short). If it's possible to have more than that many records in your collection, using wide rows is probably better as that limitation is much higher (2 billion cells (rows * columns) per partition).
When reading a collection column, the entire collection is read every time. Compare this to wide row where you can limit the number of rows being read in your query, or limit the criteria of your query based on clustering key (i.e. date > 2015-07-01).
For your particular use case I think modeling an 'items_by_user' table would be more ideal than a list<item> column on a 'users' table.
I am storing account information in Cassandra. Each account has lists of data associated with it. For example, an account may have a list of friends and a list of liked books. Queries on accounts will always want all friends or all liked books or all of both. No filtering or searching is needed on either. The list of friends and books can grow and shrink.
Is it better to use a set column type or composite columns for this scenario?
I would suggest you not to use sets if
You are concerned about disk space(as each value is allocated a cell in disk + data space for metadata of each cell which is 15 bytes if am not wrong. Now that consumes a lot if your data is a growing one).
Not going to grow a lot of data in that particular row as each time ,the cells are to be fetched from different sstable .
In these kind of cases, the more preferred option would be a json array. You shall store it as a text and back the data from that.
Set (or any other collections ) use case was brought in for a completely different perspective. If you are having a particular value inside the list or a value has to be updated frequently inside the same collection, you shall make use of the collections .
My take on your query will be this.
Store all account specific info in a json object of friends that has a value as list of books .
Sets are good for smaller collections of data, if you expect your friends / liked books lists to grow constantly and get large (there isn't a golden number here) it would be better to go with composite columns as that model scales out better than collections and allows for straight up querying compared to requiring secondary indexes on collections.