Extract HTML From PowerPoint (So How Did I Do It?)
I blogged earlier today a quick demo showing an original PowerPoint file and a dynamic Breeze/Connect (I forget the latest term) version of that same presentation. Granted the post was a bit cryptic - but Ray dared me to do it and hey, it was fun :)
In case you missed that post, here are the before/after demos again:
So the question is, how did I do it? Before I answer, let me state that yes, that was/is a real on-the-fly conversion. It was not a copy/paste or some sort of spoof.
The answer is I did the conversion by using a combination of the POI-HSLF library and the help of Mark Mandel's Java Loader. I've wrapped the functionality in a quick CFC (attached below).
The cool things about this package in my opinion:
- Can extract raw text from PowerPoint (Think indexing...)
- Can extract HTML formatted text (via some creativity)
- That HTML can then be used (as shown in the demo earlier) to create a dynamic SWF presentation
My package has one function exposed - convertPowerPoint(). You pass it the path to your PowerPoint (optional 2nd argument for 'textOnly') and it returns an Array of Structures with each element representing a slide in your PPT.
Here's a little taste of how easy it is. Want to convert to HTML?
<cfset pptFile = expandPath("slide_show.ppt") />
<!--- create the pptutils object --->
<cfset pptutils = createObject("component", "pptutils") />
<!--- get the ppt in html format --->
<cfset pptHTML = pptutils.convertPowerPoint(pathToPPT=pptFile) />
Want to use that HTML for a cfpresentation?
<cfpresentation title="demo showing a powerpoint presented as a cfpresentation">
<cfloop array="#pptHTML#" index="slide">
<cfpresentationslide title="#slide.slideTitle#" >
#slide.slideHTML#
</cfpresentationslide>
</cfloop>
</cfpresentation>
</cfoutput>
Just want the text? Pass a boolean 'true' in the 2nd position.
So that's all for now. I'd like to experiment more - especially with more complex presentations. As I said above, HSLF is not the friendliest when it comes to the style elements. I did my best to determine some basic stuff - the more advanced stuff (like images, shapes, etc) will probably prove to be very difficult (from what I've read about HSLF).



Does this require Powerpoint installed to work?
Peter: Nope.
Frank: Right - as I said there are some limitations. There is a *way* to extract imgs/shapes - I just haven't figured out the implementation yet. Stay tuned.
The company I work for just made a flash/powerpoint-ish type app so anytime I see those two words together it catches my eye. :) Anyway - we had some success with actually taking the powerpoint, converting it to the .pptx format and extracting out the xml data from there - at this point we were able to grab text, images, and I even wrote a geometry class to draw some basic shapes - if anybody is looking to get data from office docs this is certainly the way to go! At least it worked for us...
Although i guess you trade the horrors of HSFL for the horrors of OOXML!
What did you use to convert to pptx and extract the XML?
Also, not that it's a showstopper, but Ofiice docs didn't use XML format until what, Office 2000?
It's not public at this point but will be in the near future. :(
MS makes a "Offic 2007 migration kit" which includes a little app called ofc.exe that you can use to convert most any office file to 2007 format. Since the pptx is actually just a zip file we dig right in and start piecing the thing together from there. At first it's a little overwhelming but you start seeing chunks of your ppt and you can start seeing how it all relates.
Only issue - has to run on a windows box so the little ofc.exe can be called...
Links:
http://msdn.microsoft.com/msdnmag/issues/06/11/Bas...
http://msdn2.microsoft.com/en-us/library/aa982683....
http://cfsilence.com/blog/client/page.cfm/pptutils...