I have a table that has about 3 millions row in 1 partition key. I need to load all those data and save it as file.
Actually it is data recorded for 1 day sensor input. And what I want to build is a playback service to replay the sensor event at least in the past 3 months. So I am thinking it will be really big volume of data.
I am new in NoSql database, any approach how to achieve these goal?
Related
EDIT1: download file with 2 days of real data
My home automation controller collects data from several 4-in-1 motion sensors in different rooms of my house. The sensor prioritizes motion, sending motion reports every few seconds, but also independently reports temperature, humidity, and illuminance. I am trying to determine if the temp and humidity reports are sent frequently enough to automate control of heaters and exhaust fans.
Sensors independently report each category to the controller, which sends data to excel. Sample data below, but without motion reports that clutter up the real data.
A pivot table generated from the raw data:
Answering the question of frequency takes me several manual steps. Sorting/filtering the dataset for temp/humidity by room, then manually adding a time diff column
where time diff = (<current Date-Time cell> - <prev Date-Time cell>)*24*60. I then calculate the average and stdev of minutes between reports by manually selecting, in turn, each room/category subset in the time diff column; once for the average and once for the stdev.
After a few more manual steps, I end up with this desired result:
BUT I have to do it all over every time new data is added to the table. I'm certain excel can do this automatically, but I didn't find a solution through pivots, power pivots, slicing, or queries. I'm hoping one of you excel gurus can help. Thanks!
I have one table in Excel 365 that I transform using a few queries to output as 2 pivot tables and 1 table. One of these queries is also loaded to the data model as I need to create measures.
The operation is taking extremely and erratically long - approximately 25 to 70 minutes just to load a query involving 14 columns and 100 rows. The same operation in Power BI consistently takes 2.5 minutes in comparison. After spending 2 weeks on diagnostics, I eventually isolate the problem to
Output query to pivot table and Load to data model.
Row-wise operations in query (if loaded to data model).
I had spent 2 week refactoring my code. This includes buffering what can be buffer and optimising every operation as best as I can. Also tried "Fast Data Load" but doesn't help.
Not loading the query to data model means everything only takes 2-3 minutes. But I cannot not load the query to the data model because I need measures.
Is there anything else I can do to fix this extremely long load to data model operation?
I'd like to store daily prices on Apache Druid in a way that is possible to group it by daily, weekly and monthly. I was thinking of defining open and close prices as metrics, not dimensions, but for that I would have to use first and last aggregators to group open and close prices, respectively. The documentation says:
(Double/Float/Long) First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.
So it seems I cannot load the data in this way already. I would like to know what would be the best way to load daily prices data in a way that allows me to use all of druid's capabilities when grouping by day, week or month.
How to choose which partition strategy in spark on dates. I have a column in data frame as the date in 2020-02-19 format. should specify the date in partition columns while writing or create multiple columns from the date as dd, mm,yyyy in the table and specify columns yyyy, mm, dd in repartition?
What kind of issues will come if I specify each partition strategy
There is no actual gain breaking in one partition date=yyyy-mm-dd or in multiple partitions year=yyyy/month=mm/day=dd, if you have to process the last 10 days will give the same amount of data at the same time. The biggest difference is the way you query or the way you will maintain your data.
With one single partition your life will be easy to write queries for an specific day. I need to run for something 3 days ago. Or I need to query a date range from 1st of Jan to 1st of May. Having one partition with the date make your life much easier for that.
Having multiple partitions is easy to make monthly analysis, is easy to query a whole month or a whole year in a easy way. But you will loose the capability of query the data in a range.
Besides those features from each type of format, in a performance perspective this will not create any overhead for you, both solutions would bring the data in the same speed because you will not going to break the data in smaller files. I prefer to break just with one partition with the day due to be easy to maintain in point of view.
We run a SAAS product. One of our customer is an enterprise client and while others generate 8-10,000 rows of data for a bill report for 6 months. They generate about 300,000 rows of data. Each row has 30 columns. So when they try to goto our reporting module and export their 6 months report, our servers go unresponsive as the CPU jumps to 80%, and the entire process fails. Fetching 300K from SQL database after joins etc takes 30 minutes, followed by 3 hours to write the 300K rows with 30 columns in excel. My worry is that just 1 client is bringing the entire site down, as we have 40 reports, and if they start going back 6 months and exporting data, other users will have a terrible experience.
What is the best way to handle such large exports of data in SAAS products?
Tx
Replicate your production database into a "end-user's sand-box database", serve your end-users' reporting from there, where they can slow down the DB to a crunching halt, if they need to.
Such large exports shouldn't be done from a production system.