I have a vhd (disk) of size 500 MB, of which only 10 MB of data is written, followed by empty chunks and finally one more block of 10 MB towards the end.
So , the total data present is just 20 MB out of 500 MB.
I am trying to find a utility in node.js , to find out the number of data bytes, not succeeded though.
There is a function fs.fstatSync(file).size, which gives the total size.
Is there any utility/functions to calculate the data written?
You will probably have to use require("child_process") to access some system utility cli.
Related
I'm exploring Feast and I’m trying to load data to my Feature Store offline locally from .parquet file stored on my computer. The FileSource is only one – my one .parquet file.
I’m using this code to load data:
offline_feature_data = store.get_historical_features( entity_df=entity_df, features=features_bucket ).to_df()
However I got MemoryError: Unable to allocate 22.4 GiB for an array with shape (30, 100000000) and data type int64
My dataset: 66 columns x 10 000 rows
First of all why I got wired dimension of array in my message error - (30, 100000000)?
After I reduced data to 66 columns x 1000 rows, loading fs offline is ok.
In pandas, I’m able to work with much larger datasets without any memory problem on the same machine.
Does FEAST are not able to deal with larger datasets? What is the limit?
Anyway, 66 columns x 10 000 rows are still not a big data…
Machine: Ubuntu 20.04 LTS, python3.8, feast0.28, 16 GB RAM
It gives that memory error but memory capacity is never reached. I have 60 GB of ram on the SSH and the full dataset process consumes 30
I am trying to train an autoendcoder with k-fold. Without k-fold the training works fine. The raw dataset contains 250,000 data in hdf5.
With K-fold it works if I use less than 100000 of total data.
I have converted it to float32 but still does not work.
I have tried echo 1 as well but that kill the python program automatically
Taking into account the dimensions of the dataset you provided (725000 x 277 x 76) and its data type (float64 - 8 bytes), it seems that you need (at minimum) around 114 GB to have the dataset loaded/stored in RAM.
A solution to overcome this limitation is to: 1) read a certain amount of the dataset (e.g. a chunk of 1 GB at the time) through a hyperslab selection and load/store it in memory, 2) process it, and 3) repeat the process (i.e. go to step 1) until the dataset is completely processed. This way, you will not run out of RAM memory.
I am currently ingesting multiple TB of data into the DB of an Azure Data Explorer Cluster (ADX aka Kusto DB). In total, I iterate over about 30k files. Some of them are a few kB, but some as big as many GB.
With some big files I am running into errors due to their file sizes:
FailureMessage(
{
...
"Details":"Blob size in bytes: '4460639075' has exceeded the size limit allowed for ingestion ('4294967296' B)",
"ErrorCode":"BadRequest_FileTooLarge",
"FailureStatus":"Permanent",
"OriginatesFromUpdatePolicy":false,
"ShouldRetry":false
})
Is there anything I can do to increase the allowed ingestion size?
There's a non-configurable 4GB limit.
you should split your source file(s) (ideally, so that each file has between 100MB-1GB of uncompressed data).
see: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/api/netfx/kusto-ingest-best-practices#optimizing-for-throughput
I am using cassandra in my app and it started eating up disk space much faster than I expected and much faster than defined in manual. Consider this most simple example:
CREATE TABLE sizer (
id ascii,
time timestamp,
value float,
PRIMARY KEY (id,time)
) WITH compression={'sstable_compression': ''}"
I am turning off compression on purpose to see how many bytes will each record take.
Then I insert few values, I run nodetool flush and then I check the size of data file on disk to see how much space did it take.
Results show huge waste of space. Each record take 67 bytes, I am not sure how that is possible.
My id is 13 bytes long at it is saved only once in data file, since it is always the same for testing purposes.
According to: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/architecture/architecturePlanningUserData_t.html
Size should be:
timestamp should be 8 bytes
value as column name takes 6 bytes
column value float takes 4 bytes
column overhead 15 bytes
TOTAL: 33 bytes
For testing sake, my id is always same, so I have actually only 1 row if I understood correctly.
So, my questions is how do I end up on using 67 bytes instead of 33.
Datafile size is correct, I tried inserting 100, 1000 and 10000 records. Size is always 67 bytes.
There are 3 overheads discussed in the file. One is the column overhead, which you have accommodated for. The second is the row overhead. And also if you have replication_factor greater than 1 there's an over head for that as well.
I'm implementing a medium scale marketing e-commerce affiliation site, which has following estimates,
Total Size of Data: 5 - 10 GB
Indexes on Data: 1 GB approx (which I wanted to be in memory)
Disk Size (fast I/O): 20-25 GB
Memory: 2 GB
App development: node.js
Working set estimation of Query: Average 1-2 KB, Maximum 20-30 KB of text base article
I'm trying to understand whether MongoDB would be right choice for database or not. Index is going to be fairly downsize of Memory but I have noticed that after querying that MongoDB, it has occupied the memory (size of result set) for caching query. In 8 hours I'm expecting that all queries' depth would cover almost 95% of data, in that scenario how will MongoDB manage limited memory scenario also app instance of node.js running on same server.
Would a MongoDB a right choice for this scenario or I should go for other JSON based no-SQL Databases.