As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a server which generates some output, like this:
http://192.168.0.1/getJPG=[ID]
I have to go through ID 1 to 20M.
I see that most of delay is in storing file, currently I do store every request result as a separate file in a folder. in form of: [ID].jpg
Server responses quickly, generator server is really fast, but I can't handle received data rapidly.
What is best way of storing data for later processing?
I can do all type of storing, like in DB, like in SINGLE file and later parsing big file, etc.
I can code in .NET, PHP, C++, etc. No restrictions in programming language. Please advice.
Thanks
So you're downloading 20 million files from a server, and the speed at which you can save them to disk is a bottleneck? If you're accessing the server over the Internet, that's very strange. Perhaps you're downloading over a local network, or maybe the "server" is even running locally.
With 20 million files to save, I'm sure they won't all fit in RAM, so buffering the data in memory won't help. And if the maximum speed at which data can be written to your disk is really a bottleneck, using MS SQL or any other DB will not change anything. There's nothing "magic" about a DB -- it is limited by the performance of your disk, just like any other program.
It sounds like your best bet would be to use multiple disks. Download multiple files in parallel, and as each is received, write it out to a different disk, in a round-robin fashion. The more disks you have, the better. Use multiple threads OR non-blocking I/O so downloads and disk writes all happen concurrently.
To do this efficiently, I would multi-thread your application (c++).
The main thread of your application will make these web-requests and push them to the back of a std::list. This is all your main application thread will do.
Spawn (and keep it running, do not spawn repeatedly) a pthread(my preferred threading method, even on windows...) and set it up to check the same std::list in a while loop. In the loop, make sure to check the size of the list and if there are things to be processed, pop the front item off of the list (these can be done in different threads without needing a mutex... most of the time...) and write it to disk.
This will allow you to queue up the responses in memory and at the same time be asynchronously saving the files to disk. If your server really is as quick as you say it is, you might run out of memory. Then I would implement some 'waiting' if the number of items to be processed are over a certain threshold, but this will only run a little better than doing it serially.
The real way to 'improve' the speed of this is to have many worker threads (each with their own std::list and 'smart' pushing onto the list with the least items or one std::list shared with a mutex) processing the files. If you have a multi-core machine with multiple hard drives, this will greatly increase the speed of saving these files to disk.
The other solution is to off-load the saving of the files to many different computers as well (if the number of disks on your current computer is limiting the writes). By using a message passing system such as ZMQ/0MQ, you'd be able to very easily push off the saving of files to different systems (which are setup in a PULL fashion) with more hard drives being accessible than just what is currently on one machine. Using ZMQ makes the round-robin style message passing trivial as a fan-out architecture is built in and is literally minutes to implement.
Yet another solution is to create a ramdisk (easy done natively on linux, for windows... I've used this). This will allow you to parallelize the writing of the files with as many writers as you want without issue. Then you'd need to make sure to copy those files to a real storage location before you restart or you'd lose the files. But during the running, you'd be able to store the files in real-time without issue.
Probably it helps to access the disk sequentially. Here is a simple trick to do this: Stream all incoming files to an uncompressed ZIP file (there are libraries for that). This makes all IO sequential and there is only one file. You can also split off a new ZIP file after 10000 images or so to keep the individual ZIPs small.
You can later read all files by streaming out of the ZIP file. Little overhead there as it is uncompressed.
It sounds like you are trying to write an application which downloads as much content as you can as quickly as possible. You should be aware that when you do this, chances are people will notice as this will suck up a good amount of bandwidth and other resources.
Since this is Windows/NTFS, there are some things you need to keep in mind:
- Do not have more than 2k files in one folder.
- Use async/buffered writes as much as possible.
- Spread over as many disks as you have available for best I/O performance.
One thing that wasn't mentioned that is somewhat important is file size. Since it looks like you are fetching JPEGs, I'm going to assume an average files size of ~50k.
I've recently done something like this with an endless stream of ~1KB text files using .Net 4.0 and was able to saturate a 100mbit network controller on the local net. I used the TaskFactory to generate HttpWebRequest threads to download the data to memory streams. I buffered them in memory so I did not have to write them to disk. The basic approach I would recommend is similar - Spin off threads that each make the request, grab the response stream, and write it to disk. The hardest part will be generating the sequential folders and file names. You want to do this as quickly as possible, make it thread safe, and do your bookkeeping in memory to avoid hitting the disk with unnecessary calls for directory contents.
I would not worry about trying to sequence your writes. There are enough layers of the OS/NTFS that will try and do this for you. You should be saturating some piece of your pipe in no time.
Related
We have a backend expressjs server that will read off of the disk for many files whenever a front-end client connects.
At the OS level, are these reads blocking?
I.E., if two people connect at the same time, will whoever gets scheduled second have to wait to read the file until the first person who is currently reading it finishes?
We are just using fs.readFile to read files.
EDIT: I'm implementing caching anyway (it's a legacy codebase, don't hate me), I'm just curious if these reads are blocking and this might improve response time from not having to wait until the file is free to read.
fs.readFile() is not blocking for nodejs. It's a non-blocking, asynchronous operation. While one fs.readFile() operation is in progress, other nodejs code can run.
If two fs.readFile() calls are in operation at the same time, they will both proceed in parallel.
Nodejs itself uses a native OS thread pool with a default size of 4 for file operations so it will support up to 4 file operations in parallel. Beyond 4, it queues the next operation so when one of the 4 finishes, then the next one in line will start to execute.
Within the OS, it will time slice these different threads to achieve parallel operation. But, at the disk controller itself for a spinning drive, only one particular read operation can be occurring at once because the disk head can only be on one track at a given time. So, the underlying read operations reading from different parts of a spinning disk will eventually be serialized at the disk controller as it moves the disk head to read from a given track.
But, if two separate reads are trying to read from the same file, the OS will typically cache that info so the 2nd read won't have to read from the disk again, it will just get the data from an OS cache.
I inherited this codebase and am going to implement some caching anyway, but was just curious if caching would also improve response time since we would be reading from non-blocking process memory instead of (potentially) blocking filesystem memory.
OS file caching is heavily, heavily optimized (it's a problem operating systems have spent decades working on). Implementing my own level of caching on top of the OS isn't where I would think you'd find the highest bang for the buck for improving performance. While there may be a temporary lock used in the OS file cache, that lock would only exist for the duration of a memory copy from cache to target read location which is really, really short. Probably not something anything would notice. And, that temporary lock is not blocking nodejs at all.
I have a multi-threaded perl script which does the following:
1) One boss thread searches through a folder structure on an external server. For each file it finds, it adds its path/name to a thread queue. If the path/file is already in the queue, or being processed by the worker threads, the enqueuing is skipped.
2) A dozen worker threads dequeue from the above queue, process the files, and remove them from the hard disk.
It runs on a single physical server, and everything works fine.
Now I want to add a second server, which will work concurrently with the first one, searching through the same folder structure, looking for files to enqueue/process. I need a means to make both servers aware of what each one is doing, so that they don't process the same files. The queue is minimal, ranging from 20 to 100 items. The list is very dynamic and changes many times per second.
Do I simply write to/read from a regular file to keep them sync'ed about the current items list? Any ideas?
I would be very wary of using a regular file - it'll be difficult to manage locking and caching semantics.
IPC is a big and difficult topic, and when you're doing server to server - it can get very messy indeed. You'll need to think about much more complicated scenarios, like 'what if host A crashes with partial processing'.
So first off I would suggest you need to (if at all possible) make your process idempotent. Specifically - set it up so IF both servers do end up processing the same things, then no harm is done - it's 'just' inefficient.
I can't tell you how to do this, but the general one is to permit (and discard) duplication of effort.
In terms of synchronising your two processes on different servers - I don't think a file will do the trick - shared filesystem IPC is not really suitable for a near real time sort of operation, because of caching. Default cache lag on NFS is somewhere in the order of 60s.
I would suggest that you think in terms of sockets - they're a fairly standard way of server to server IPC. As you already check 'pending' items in the queue, expanding this to query the other host (note - consider what you'll do if it's offline or otherwise unreachable) before enqueing.
The caveat here is parallelism works better the less IPC is going on. Talking across a network is generally a bit faster than talking to a disk, but it's considerably slower than the speed at which a processor runs. So if you can work out some sort of caching/locking mechanism, where you don't need to update for each and every file - then it'll run much better.
What type of usage is IPC intended for and is it is OK to send larger chunks of JSON (hundreds of characters) between processes using IPC? Should I be trying to send as tiny as message as possible using IPC or would the performance gains coming from reducing message size not be worth the effort?
What type of usage is IPC intended for and is it is OK to send larger chunks of JSON (hundreds of characters) between processes using IPC?
At it's core, IPC is what it says on the tin. It's a tool to use when you need to communicate information between processes, whatever that may be. The topic is very broad, and technically includes allocating shared memory and doing the communication manually, but given the tone of the question, and the tags, I'm assuming you're talking about the OS provided facilities.
Wikipedia does a pretty good job discussing how IPC is used, and I don't think I can do much better, so I'll concentrate on the second question.
Should I be trying to send as tiny as message as possible using IPC or would the performance gains coming from reducing message size not be worth the effort?
This smells a bit like a micro-optimization. I can't say definitively, because I'm not privy to the source code at Microsoft and Apple, and I really don't want to dig through the Linux kernel's implementation of IPC, but, here's a couple points:
IPC is a common operation, so OS designers are likely to optimize it for efficiency. There are teams of engineers that have considered the problem and figured out how to make this fast.
The bottleneck in communication across processes/threads is almost always synchronization. Delays are bad, but race conditions and deadlocks are worse. There are, however, lots of creative ways that OS designers can speed up the procedure, since the system controls the process scheduler and memory manager.
There's lots of ways to make the data transfer itself fast. For the OS, if the data needs to cross process boundaries, then there is some copying that may need to take place, but the OS copies memory all over the place all the time. Think about a command line utility, like netstat. When that executable is run, memory needs to be allocated, the process needs to be loaded from disk, and any address fixing that the OS needs to do is done, before the process can even start. This is done so quickly that you hardly even notice. On Windows netstat is about 40k, and it loads into memory almost instantly. (Notepad, another fast loader is 10 times that size, but it still launches in a tiny amount of time.)
The big exception to #2 above is if you're talking about IPC between processes that aren't on the same computer. (Think Windows RPC) Then you're really bound by the speed of the networking/communication stack, but at that point a few kb here or there isn't going to make a whole lot of difference. (You could consider AJAX to be a form of IPC where the 'processes' are the server and your browser. Now consider how fast Google Docs operates.)
If the IPC is between processes on the same system, I don't think that it's worth a ton of effort shaving bytes from your message. Make your message easy to debug.
In the case that the communication is happening between processes on different machines, then you may have something to think about, having spent a lot of time debugging issues that would have been simple with a better data format, a few dozen extra milliseconds transit time isn't worth making the data harder to parse/debug. Remember the three rules of optimization1:
Don't.
Don't... yet. (For experts)
Profile before you do.
1 The first two rules are usually attributed to Michael Jackson. (This one not this one)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Tasks:
Scrape html from a webpage
Parse the html
Clean the data (remove white space, perform basic regex)
Persist the data to a SQL database.
Goal is to complete these 4 tasks as quickly as possible and here are some possible example approaches.
Possible Sample Approaches
Multi-Step 1: Scrape all pages and store html as .txt files. After all html is stored as text, run a separate module that parses/cleans/persists the data.
Multi-step 2: Scrape/Parse/Clean data and store in .txt files. Run a separate module to insert the data into a database.
Single-Step: Scrape/Parse/Clean/Persist data all in one step.
Assumptions:
1 dedicated server being used for scraping
disk space is unlimited
internet connection is your average home connection
memory (8GB)
No rate limiting on any web pages
User wants to scrape 1 million pages
I haven't done enough testing with node.js to establish a best practice but any insight on optimizing these tasks would be greatly appreciated.
Obviously, there are some unanswered questions (how much html is on a typical page, how much are your parsing, request/response latency, what frameworks are being used to parse data...etc), but a high level best practice/key considerations would be beneficial. Thanks.
With a problem like this, you can forsee only certain aspects of what will really control where your bottlenecks will be. So, you start with a smart, but not complicated implementation and you spend a fair amount of time figuring out how you can measure your performance and where the bottlenecks are.
Then, based on the knowledge of where the bottlenecks are, you come up with a proposed design change, implement that change and see how much of a difference you made in your overall throughput. You then instrument again, measure again and see where your new bottleneck is, come up with a new theory on how to beat that bottleneck, implement, measure, theorize, iterate, etc...
You really don't want to overdesign or overcomplicate the first implementation because it's very easy to be wrong about where you think the real bottleneck will be.
So, I'd probably start out with a design like this:
Create one node.js process that doesn't absolutely nothing but download pages and write them to disk. Use nothing by async I/O everywhere and make it configurable for how many simultaneous page downloads it has in-flight at once. Do no parsing, just write the raw data to disk. You will want to find some very fast way of storing which URL is which file. That could be something as simple as appending info to a text file or it could be a database write, but the idea is you just want it to be fast.
Then, create another node.js process that repeatedly grabs files from disk, parses them, cleans the data and persists the data to your SQL database.
Run the first node.js process by itself and let it run until it collects either 1,000 web pages or for 15 minutes (whichever comes first) to measure how much throughput you're initially capable of. While it's running, note the CPU utilization and the network utilization on your computer. If you're already in the ballpark of what you might need for this first node.js process, then you're done with the first node.js process. If you want it go much faster, then you need to figure out where your bottleneck is. If you're CPU-bound (unlikely for this I/O task), then you can cluster and run multiple of these node.js processes, giving each one a set of URLs to fetch and a separate place to write their collected data. More than likely you're I/O bound. This may be either because you aren't fully saturating your existing network connection (the node.js process spends too much time waiting for I/O) or you have already saturated your network connection and it is now the bottleneck. You will have to figure out which of these it is. If you add more simultaneous web page fetches and the performance does not increase or even goes down, then you've probably already saturated your web connection. You will also have to watch out for saturation the file I/O sub-system in node.js which uses a limit thread pool to implement async I/O.
For the second node.js process, you follow a similar process. Give it 1,000 web pages and see how fast it can process them all. Since you do have I/O to read the files form disk and to write to the database, you will want to have more than one page parsing at a time so you can maximize usage of the CPU when one page is being read in or written out. You can either write one node.js process to handle multiple parse projects at once or you can cluster a single node.js process. If you have multiple CPUs in your server, then you will want to have at least as many process as you have CPUs probably. Unlike the URL fetcher process, the code for parsing is likely something that could be seriously optimized to be faster. But, like other performance issues, don't try to overly optimize that code until you know you are CPU bound and it is holding you up.
Then, if your SQL database can be on another box or at least using another disk, that's probably a good thing because it separates out the disk writes there from your other disk writes.
Where you go after the first couple steps will depend entirely upon what you learn from the first few steps. Your ability to measure where the bottlenecks are and design quick experiments to test bottleneck theories will be hugely important for making rapid progress and not wasting development time on the wrong optimizations.
FYI, some home internet connection ISPs may set off some alarms with the amount and rate of your data requests. What they will do with that info likely varies a lot from one ISP to the next. I would think that most ultimately have some ability to rate limit your connection to protect the quality of service for others sharing your same pipe, but I don't know when/if they would do that.
This sounds like a really fun project to try to optimize and get the most out of. It would make a great final project for a medium to advanced software class.
I'm learning how to use the TPL for parellizing an application I have. The application processes ZIP files, exctracting all of the files held within them and importing the contents into a database. There may be several thousand zip files waiting to be processed at a given time.
Am I right in kicking off a separate task for each of these ZIP files or is this an inefficient way to use the TPL?
Thanks.
This seems like a problem better suited for worker threads (separate thread for each file) managed with the ThreadPool rather than the TPL. TPL is great when you can divide and conquer on a single item of data but your zip files are treated individually.
Disc I/O is going to be your bottle neck so I think that you will need to throttle the number of jobs running simultaneously. It's simple to manage this with worker threads but I'm not sure how much control you have (if nay) over the parallel for, foreach as far as how parallelism goes on at once, which could choke your process and actually slow it down.
Anytime that you have a long running process, you can typically gain additional performance on multi-processor systems by making different threads for each input task. So I would say that you are most likely going down the right path.
I would have thought that this would depend on if the process is limited by CPU or disk. If the process is limited by disk I'd thought that it might be a bad idea to kick off too many threads since the various extractions might just compete with each other.
This feels like something you might need to measure to get the correct answer for what's best.
I have to disagree with certain statements here guys.
First of all, I do not see any difference between ThreadPool and Tasks in coordination or control. Especially when tasks runs on ThreadPool and you have easy control over tasks, exceptions are nicely propagated to the caller during await or awaiting on Tasks.WhenAll(tasks) etc.
Second, I/O wont have to be the only bottleneck here, depending on data and level of compression the ZIPping is going to take msot likely more time than reading the file from the disc.
It can be thought of in many ways, but I would best go for something like number of CPU cores or little less.
Loading file paths to ConcurrentQueue and then allowing running tasks to dequeue filepaths, load files, zip them, save them.
From there you can tweak the number of cores and play with load balancing.
I do not know if ZIP supports file partitioning during compression, but in some advanced/complex cases it could be good idea especially on large files...
WOW, it is 6 years old question, bummer! I have not noticed...:)