Splitting office document with large number of pages into multiple files

Splitting office document with large number of pages into multiple files - linux

We use libreoffice in headless mode inside docker to convert Office documents (.docx, .pptx etc) into PDF before going into the next step in the pipeline. We found that depending on the size and complexity of the document specially when we have a number of pages (~100), the memory consumption goes so high that the instance crashes. Is there any tool that we can use to split Office documents into multiple chunks so that libreoffice only has to deal with files with small footprint?
Of course the tool has to be able to run in headless mode inside docker.
Thanks

Related

Is there a way to dynamically determine the vhdSize flag?

I am using the MSIX manager tool to convert a *.msix (an application installer) to a *.vhdx so that it can be mounted in an Azure virtual machine. One of the flags that the tool requires is -vhdSize, which is in megabytes. This has proven to be problematic because I have to guess what the size should be based off the MSIX. I have ran into numerous creation errors due to too small of a vhdSize.
I could set it to an arbitrarily high value in order to get around these failures, but that is not ideal. Alternatively, guessing the correct size is an imprecise science and a chore to do repeatedly.
Is there a way to have the tool dynamically set the vhdSize, or am I stuck guessing a value that is both large enough to accommodate the file, but not too large as to waste disk space? Or, is there a better way to create a *.vhdx file?
https://techcommunity.microsoft.com/t5/windows-virtual-desktop/simplify-msix-image-creation-with-the-msixmgr-tool/m-p/2118585

There is an MSIX Hero app that could select a size for you, it will automatically check how big the uncompressed files are, add an extra buffer for safety (currently double the original size), and round it to the next 10MB. Reference from https://msixhero.net/documentation/creating-vhd-for-msix-app-attach/

How to make Excel run in multi-CPU mode

I have big excel files, thousands lines and rows.
Gigs in size.
When i do work inside this files in excel, it throttling and lagging. Sometimes just get stucked ans freeze.
When i open a task manager, i see that Excel didnt eat even a one CPU.
RAM usage also not overloaded.
How to make excel use all my cores?
Excel 2019.

I see you are familiar with Python. Why don't you move your databases to Python? Actually I don't know capability of it. I work in R. Some of my data in Excel and even in *.csv take 100-200 Mb, while in *.Rdata format the same information is less then 5Mb. R functions work faster then heavy Excel.
And yes, how to make excel use all cores of processor is interesting to me too.

Avoid data base open when excel is running

I developed a vb.net program that uses excel file to generate some reports.
Once the program takes too much time to generate a report, I usually do other things while the program is running. The problem is that sometimes I need to open other excel files and the excel files used in the program are shown to me. I want to still hide those files being processed even when I run other excel files. Is this possible? Thanks

The FileSystem.Lock Method controls access by other processes to all or part of a file opened by using the Open function.
The My feature gives you better productivity and performance in file I/O operations than Lock and Unlock. For more information, see FileSystem.
More information here.

VBA: Coordinate batch jobs between several computers

I have a vba script that extract information from huge text files and does a lot of data manipulation and calculations on the extracted info. I have about 1000 files and each take an hour to finish.
I would like to run the script on as many computers (among others ec2 instances) as possible to reduce the time needed to finish the job. But how do I coordinate the work?
I have tried two ways: I set up a dropbox as a network drive with one txt file with the current last job number thart vba access, start the next job and update the number but there is apparently too much lag between an update on a file on one computer is updated throughout the rest to be practical. The second was to find a simple "private" counter service online that updated for each visit so han would access the page, read the number and the page would update the number for the next visit from another computer. But I have found no such service.
Any suggestions on how to coordinate such tasks between different computers in vba?

First of all if you can use a proper programming language, forexample c# for easy parallel processing.
If you must use vba than first optimize your code first. Can you show us the code?
Edit:
If you must than you could do the following. First you need some sort of fileserver to store all text files in a folder.
Then in the macro, foreach every .txt file in folder,
try to open the first in exclusive mode if the file can be opened, then ("your code" after your code is finished move the file elsewhere) else Next .txt file.

Resize image when uploading to server or when serving from server to client?

My website uses many images. On a weak day users will upload hundreds of new images.
I'm trying to figure out what is the best-practice for manipulating sizes of images.
This project uses Node.js with gm module for manipulating images, but I don't think this question is node or gm specific.
I came up with several strategies, but I can't make a decision as to which is the best, and I am not sure if I am missing an obvious best-practice strategy.
Please enlighten me with your thoughts and experience.
Option 1: Resize the file with gm on every client request.
Option 1 pros:
If I run gm function every time I serve a file, I can control the size, quality, compression, filters and so on whenever I need it.
On the server I only save 1, full quality - full size version of the file and save storage space.
Option 1 cons:
gm is very resource intensive, and that means that I will be abusing my RAM for every single image server to every single client.
It means I will be always working from a big file, which makes things even worse.
I will always have to fetch the file from my storage (in my case S3) to the server, then manipulate it, then serve it. It seems like it would create redundant bandwidth issues.
Option 2: resize the file on first upload and keep multiple sizes of the file on the server.
Option 2 pros:
I will only have to use gm on uploads.
Serving the files will require almost no resources.
Option 2 cons:
I will use more storage because I will be saving multiple versions of the same file (i.e full, large, medium, small, x-small) instead of only one version.
I will be limited to using only the sizes that were created when the user uploaded their image.
Not flexible - If in the future I decide I need an additional size version (x-x-small for instance) I will have to run a script that processes every image in my storage to create the new version of the image.
Option 3:
Use option 2 to only process files on upload, but retain a resize module when serving file sizes that don't have a stored version in my storage.
Option 3 pros:
I will be able to reduce resource usage significantly when serving files in a selection of set sizes.
Option 3 cons:
I would still take more storage as in option 2 vs option 1.
I will still have to process files when I serve them in cases where I don't have the file size I want
Option 4: I do not create multiple versions of files on upload. I do resize the images when I serve them, BUT when ever an image size was requested, this version of the file will be saved in my storage and for future requests I will not have to process the image again.
Option 4 pros:
I will only use storage for the versions I use.
I could add a new file size when ever I need, it will be automatically created on a need-basis if it doesn't already exists.
Will use a lot of resources only once per file
Option 4 cons:
Files that are only accessed once will be both resource intensive AND storage intensive. Because I will access the file, see that the size version I need doesn't exist, create the new file version, use the resources needed, and save it to my storage wasting storage space for a file that will only be used once (note, I can't know how many times files will be used)
I will have to check if the file already exists for every request.
So,
Which would you choose? Why?
Is there a better way than the ways I suggested?

Solution highly depends on the usage you have for your resources. If you have an intensive utilisation then option 2 is from far the better. If not, option 1 could work nicely also.
From a qualitative point of view I think option 4 is the best of course. But for a question of simplicity and automation, I think option 2 is way better.
Because simplicity matter, I suggest to mix the option 2 and 4 : you will have a list of size (e.g. large,medium,small), but will not process them on upload but when requested as in option 4.
So that in the end, in the worst case you will arrive to the option 2 solution.
My final word would be that you should also use the <img> and/or <canvas> object in your website to perform the final sizing, so that the small computation overhead is not done on the server side.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string