How do I extract text and formatting from a ppt file in python? - python-3.x

I'm trying to write a context-sensitive text parser in python. To do that, I need to be able to open ppt files and extract both the text and information about how it was formatted. I need to be able to tell if a sentence was in a header or if it's bolded, for instance.
It's supposed to run on large batches of files, so manually converting all the ppts to pptxs isn't practical.
I tried tika, but it doesn't give formatting information.
I tried python-pptx, but it doesn't seem like it can open ppts.
And I'm hoping to make the parser OS agnostic, so the command-line converters I've seen proposed on other variants of this question won't work for me, unless they'll somehow work on linux, mac, and windows.

Related

Update linked excel path in PowerPoint via Python

I want to automate creating of a powerpoint ppt via linking template charts to some Excel files. Updating the excel file values changes the powerpoint slides automatically. I have created my powerpoint template and linked charts to sample excel files data.
I want to send the folder with the powerpoint and excel files to someone else. But this will break the link to excel files due to change in the path. (As path is not relative). I can edit the paths manually by going under the "edit links to files" option under File Menu but this is tedious as charts are numerous with multiple files.
I want to update the same via Python code using the Python-Pptx package.
Please help!
There's no API support for this in the current version of python-pptx.
You would need to modify the underlying XML directly, perhaps using python-pptx internals as a starting point and using lxml calls on the appropriate element objects. If you search on "python-pptx workaround function" you will find some examples.
Another thing to consider is modifying the XML by cruder but still possibly effective means by accessing the XML files in the .pptx package directly (the .pptx file is a Zip archive of largely XML files) and using regular expressions or perhaps a command line tool like sed or awk to do simple text substitution.
Either way you're going to need to want it pretty badly, depending on your Python skill level. You'll also of course need to discover just which strings in which parts of the XML are the ones that need changing. opc-diag can be helpful for that, but it's a bit of detective work even with the best tools.

CSV to Excel Doc?

I was curious what the community thinks is the easiest way to take a CSV file and 'save as' a Excel document with only a couple formulas pasted in?
I am trying to do this behind the scenes, and not physically navigating. e.g. opening, selecting save As, etc -- even though this is already VERY simple I **need to do this in code (Think automation)
Background: I have a c++ command line program generating the .csv, and a C# GUI starting this process. Either programs could hold the code, but I figure this is easiest in C# (InterOp?) The reasons I don't directly send code into the csv is because of the amount of comma characters that will mess up the csv and because other Excel documents need to reference the sheets so they need to be in .xls format.
=AVERAGE(C2:C999)
=COUNTIFS(C:C,">0",C:C,"<31")
=COUNTIFS(C:C,">31",C:C,"<55")
=COUNTIF(C:C,">55")
Have a look and see whether command-line scipting of openoffice will do the job. It can do quite a lot of conversions very easily. Otherwise there are a lot of Excel-producing libraries, for example PHPExcel, but you'd need to wrap some programming around them.

make swf from fla without ever opening it

is it possible to change text and images in a fla file without ever opening it up and then making the swf via command line? I want to make a flash template and save the fla. Then be able to update my text and image name and convert it to swf. I have one template but tons of different text options and background images. It would be nice to be able to copy the master.fla twenty times and just change the source code (will do this from command line) and then convert to swf (via command line).
Any help would be appreciated.
With CS5, you can do half of what you're asking today, by using the XFL file format instead of FLA. Instead of a binary blob, you get an editable XML file and a tree of separate asset files: PNGs, AS3 files, etc. You can then modify the XML or AS3 files programmatically to get your variants.
(A CS5 FLA file is really just a zipped up version of the XFL, but there's no advantage to using that instead of an XFL. In CS4 and previous, FLA was a proprietary binary format.)
The missing piece is an XFL compiler. Adobe currently provides no such thing, and the third party market hasn't yet produced one.
You could use a systems automation tool to drive the Flash Professional environment through the compilation steps. On OS X, for example, either Automator or AppleScript should be able to do what you want. It'll just have more overhead than the command line compiler you were hoping for.
I agree with Jason, there are a lot of alternatives to what you suggest. Keeping content out of the SWF is good practice actually. This is a good way to avoid large files!
Depending on what you 're looking to achieve, there are a lot of solutions available. XML is an option, JSON another.
If you're looking to build a template, any of the above would seem appropriate.
It sounds like you're working from the Flash IDE, as Jason suggests you may want to have a look at another IDE, such as FlashDevelop, FDT or FlashBuilder as they make coding with AS3 a lot easier.

Batch convert xls-Files to csv

I need to convert over 100 Excel files to CSV. Worse these files consist of multiple sheets and I only need one of them.
At first I stumbled upon the Perl program xls2csv. Luckily I even found on XLS file conversion at the bottom a convenient script that converts all sheets into seperate csv files. But unluckily this converter is broken and skips lines.
I also tried pyodconverter but that only converts the first sheet.
Any suggestions? It would be ok if that conversion had to be done on Windows though I would really prefer Linux. And if it has to be Windows it would be nice if it wouldn't need an Excel installation.
There's a very useful java library called Apache POI at http://poi.apache.org/
The following link provides an example application that converts xls to csv.
http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/hssf/eventusermodel/examples/XLS2CSVmra.java
If you know java you can adjust it to your needs. Since it's java it runs also on linux.
you could also have a look at StatTransfer... (Win only, I'm afraid)
I know this is late but there is actually an HTA (HTML Application) which can do this. The details and download link can be found here.

Converting troublesome delimited PDF to Excel

I'm trying to convert this delimited PDF to an excel (or some other delimited format). Using Adobe Acrobat 9, I attempt to save it and copy it) as Excel but it gives the error message "BAD PDF; error in processing fonts. [348]".
I'm open to any solution that will create a delimited file, ranging from using Adobe Acrobat, to programming to using other apps. The only limitation is that I don't have a budget to buy anything (such as Able2Extract).
The way I was able to export my images and fonts without buying any extra software to do the conversion was this way. go to Advanced, PDG Optimizer, select all of the options you want on the LEFT COLUMN and where it says MAKE COMPATIBLE WITH select Acrobat 8.0 and later, OK....you are in your road to success
Note: not really an answer, but some suggestions.
Sounds to me that Crystal Reports is not following the PDF spec close enough.
I'd make sure CR is fully updated/patched and try genning another file making sure that "tagging" is enabled - tagging defines the layout structure. I don't have a copy of CR handy, but you may have to define a distiller template to use so when you print to PDF you can select that job option.
You can also tell its a bad PDF by using Preflight in Acrobat, it says there is no tag structure and you can do it manually (draw boxes around each item...). Also that there is no language set, and it is somehow compatible with Acrobat 1.3? which isn't supported anymore and should be 4 at the lowest?
Once you have a "good" pdf can export to xml/word and import that to excel. Also, with Acrobat 8+ you can highlight using the select tool, right-click and choose Open As SpreadSheet. You might be able to get away with just highlighting the whole document -- though I'd hope the xml approach would be best.
Able2Extract does some OCRing and tricky fuzzy logic not only to define tags/layout so it is exportable, but also avoids any font, encoding, etc issue - at least to my knowledge.
In the rare case that you can't get a new file, then exporting to plain text/accessible seems to generate a nice flat text file. You could write a vbscript to parse it (adding your delimiter) and import that into excel.

Resources