Problem:
I have huge .csv files with 16,000+ lines of data and I intend to upload them to a Google Sheet via AWS Lambda (Node.js). The only problem is that the Google Sheet API has a read/ write limit of 300 actions per minute (it would take 53 minutes until all 16,000 lines are added) and it would take me too long to split the dataset into pieces of 300, add them and wait. I have tried uploading it in one go (which im currently doing) but then the data is just in one cell, that's not my goal. Is there a way (I'd be grateful for any documentation or article) for me to upload the data in one go and split it later? Or maybe tell Google Sheets to put the data in single cells itself instead of me doing that?
What have I tried?
I have tried several things already. I have uploaded the entire file in one go but that resulted in just one cell to be written with roughly 16,000 lines of data.
I have also tried waiting a minute after my read/write limit of 300 is reached and then write again but this resulted in huge wait times of 50+ minutes (that would be too expensive since it runs on AWS Lambda).
I'm at my wits end and can't seem to find a solution to my problem. I'd be really grateful for a solution, piece of documentary or even an article. I tried finding any resources of my own but to no avail. Thank you in advance, if question arise, feel free to ask me and I'll provide you with more information.
Related
I currently have approximately 10M rows, ~50 columns in a table that I wrap up and share as a pivot. However, this also means that it takes approximately 30mins-1hour to download the csv or much longer to do a powerquery ODBC connection directly to Redshift.
So far the best solution I've found is to use Python -- Redshift_connector to run update queries and perform an unload a zipped resultset to an S3 bucket then use BOTO3/gzip to download and unzip the file, then finally performing a refresh from the CSV. This resulted in a 600MB excel file compiled in ~15-20 mins.
However, this process still feel clunky and sharing a 600MB excel file among teams isn't the best either. I've searched for several days but I'm not closer to finding an alternative: What would you use if you had to share a drillable table/pivot among a team with a 10GB datastore?
As a last note: I thought about programming a couple of PHP scripts, but my office doesn't have the infrastructure to support that.
Any help would or ideas would be most appreciated!
Call a meeting with the team and let them know about the constraints, you will get some suggestions and you can give some suggestions
Suggestions from my side:
For the file part
reduce the data, for example if it is time dependent, increase the interval time, for example an hourly data can be reduced to daily data
if the data is related to some groups you can divide the file into different parts each file belonging to each group
or send them only the final reports and numbers they require, don't send them full data.
For a fully functional app:
you can buy a desktop PC (if budget is a constraint buy a used one or use any desktop laptop from old inventory) and create a PHP/Python web application that can do all the steps automatically
create a local database and link it with the application
create the charting, pivoting etc modules on that application, and remove the excel altogether from your process
you can even use some pre build applications for charting and pivoting part, Oracle APEX is one examples that can be used.
Looking for a solid way to limit the size of Kismet's database files (*.kismet) through the conf files located in /etc/kismet/. The version of Kismet I'm currently using is 2021-08-R1.
The end state would be to limit the file size (10MB for example) or after X minutes of logging the database is written to and closed. Then, a new database is created, connected, and starts getting written to. This process would continue until Kismet is killed. This way, rather than having one large database, there will be multiple smaller ones.
In the kismet_logging.conf file there are some timeout options, but that's for expunging old entries in the logs. I want to preserve everything that's being captured, but break the logs into segments as the capture process is being performed.
I'd appreciate anyone's input on how to do this either through configuration settings (some that perhaps don't exist natively in the conf files by default?) or through plugins, or anything else. Thanks in advance!
Two interesting ways:
One could let the old entries be taken out, but reach in with SQL and extract what you wanted as a time-bound query.
A second way would be to automate the restarting of kismet... which is a little less elegant.. but seems to work.
https://magazine.odroid.com/article/home-assistant-tracking-people-with-wi-fi-using-kismet/
If you read that article carefully... there are lots of bits if interesting information here.
Disclaimer - I am not a software guy so please bear with me while I learn.
I am looking to use node red as a parser/translator by taking data from a CSV file and sending out the rows of data at 1Hz. Let's say 5-10 rows of data being read and published per second.
Eventually, I will publish that data to some Modbus registers but I'm not there yet.
I have scoured the web and tried several examples, however, as soon as I trigger the flow, Node.Red stops responding and I have to delete the source CSV,(so it can't run any more) and restart node.red in order to get it back up in running.
I have many of the Big Nodes from this guy installed and have tried a variety of different methods but I just can't seem to get it.
If I can get a single column of data from a CSV file being sent out one row at a time, I think that would keep me busy for a bit.
There is a file node that will read a file a line at a time, you can then feed this through the csv node to parse out the fields in the CSV into an object so you can work with it.
The delay node has a rate limiting function that can be used to limit the flow to processing 1 message per second to achieve the rate you want.
All the nodes I've mentioned should be in the core set that ships with Node-RED
I currently have an excel based data extraction method using power query and vba (for docs with passwords). Ideally this would be programmed to run once or twice a day.
My current solution involves setting up a spare laptop on the network that will run the extraction twice a day on its own. This works but I am keen to understand the other options. The task itself seems to be quite a struggle for our standard hardware. It is 6 network locations across 2 servers with around 30,000 rows and increasing.
Any suggestions would be greatly appreciated
Thanks
if you are going to work with increasing data, and you are going to dedicate a exclusive laptot for the process, i will think about install a database in the laptot (MySQL per example), you can use Access too... but Access file corruptions are a risk.
Download to this db all data you need for your report, based on incremental downloads (only new, modified and deleted info).
then run the Excel report extracting from this database in the same computer.
this should increase your solution performance.
probably your bigger problem can be that you query ALL data on each report generation.
Sorry if I get any of the terminology wrong here, but hopefully you will get what I mean.
I am using Windows Azure Cloud Storage to store a vast quantity of small files (images, 20Kb each).
At the minute, these files are all stored in the root directory. I understand it's not a normal file system, so maybe root isn't the correct term.
I've tried to find information on the long-term effects of this plan but with no luck so if any one can give me some information I'd be grateful.
Basically, am I going to run into problems if the numbers of files stored in this root end up in the hundreds of thousands/millions?
Thanks,
Steven
I've been in a similar situation where we were storing ~10M small files in one blob container. Accessing individual files through code was fine and there weren't any performance problems.
Where we did have problems was with managing that many files outside of code. If you're using a storage explorer (either the one that comes with VS2010 or anyone of the others), the ones I've encountered don't support the return files by prefix API, you can only list the first 5K, then the next 5K and so on. You can see how this might be a problem when you want to look at the 125,000th file in the container.
The other problem is that there is no easy way of finding out how many files are in your container (which can be important for knowing exactly how much all of that blob storage is costing you) without writing something that simply iterates over all the blobs and counts them.
This was an easy problem to solve for us as our blobs had sequential numeric names, so we've simply partitioned them into folders of 1k items each. Depending on how many items you've got you can group 1K of these folders into sub folders.
http://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/d569a5bb-c4d4-4495-9e77-00bd100beaef
Short Answer: No
Medium Answer: Kindof?
Long Answer: No, but if you query for a file list it will only return 5000. You'll need to requery every 5k to get a full listing according to that MSDN page.
Edit: Root works fine for describing it. 99.99% of people will grok what you're trying to say.