Extract HTML From PowerPoint (So How Did I Do It?)

Posted By : todd sharp Posted At : December 12, 2007 12:20 AM Posted In: ColdFusion

12

I blogged earlier today a quick demo showing an original PowerPoint file and a dynamic Breeze/Connect (I forget the latest term) version of that same presentation. Granted the post was a bit cryptic - but Ray dared me to do it and hey, it was fun :)

In case you missed that post, here are the before/after demos again:

Before

After

So the question is, how did I do it? Before I answer, let me state that yes, that was/is a real on-the-fly conversion. It was not a copy/paste or some sort of spoof. The answer is I did the conversion by using a combination of the POI-HSLF library and the help of Mark Mandel's Java Loader. I've wrapped the functionality in a quick CFC (attached below). The cool things about this package in my opinion:
  • Can extract raw text from PowerPoint (Think indexing...)
  • Can extract HTML formatted text (via some creativity)
  • That HTML can then be used (as shown in the demo earlier) to create a dynamic SWF presentation
My package has one function exposed - convertPowerPoint(). You pass it the path to your PowerPoint (optional 2nd argument for 'textOnly') and it returns an Array of Structures with each element representing a slide in your PPT. Here's a little taste of how easy it is. Want to convert to HTML?
<!--- get the path to the ppt --->
<cfset pptFile = expandPath("slide_show.ppt") />
<!--- create the pptutils object --->
<cfset pptutils = createObject("component", "pptutils") />
<!--- get the ppt in html format --->
<cfset pptHTML = pptutils.convertPowerPoint(pathToPPT=pptFile) />
Want to use that HTML for a cfpresentation?
<cfoutput>
    <cfpresentation title="demo showing a powerpoint presented as a cfpresentation">
        <cfloop array="#pptHTML#" index="slide">
            <cfpresentationslide title="#slide.slideTitle#" >
                #slide.slideHTML#
            </cfpresentationslide>
        </cfloop>
    </cfpresentation>
</cfoutput>
Just want the text? Pass a boolean 'true' in the 2nd position. So that's all for now. I'd like to experiment more - especially with more complex presentations. As I said above, HSLF is not the friendliest when it comes to the style elements. I did my best to determine some basic stuff - the more advanced stuff (like images, shapes, etc) will probably prove to be very difficult (from what I've read about HSLF).

Related Blog Entries

Comments (12)

Brian Meloche's Gravatar Very cool, Todd. I call first dibs on you doing a presentation of your use of cfpresentation!

Raul Riera's Gravatar Can this be used to extract the content of a .doc file (fully formatted?)

Peter Boughton's Gravatar Cool.

Does this require Powerpoint installed to work?

Frank Gerritse's Gravatar Nice ... but when I tried this code it did not show any pictures I used in the ppt ... ?

todd sharp's Gravatar Raul: There is a library for working with Word stuff. I hope to eventually check it out.

Peter: Nope.

Frank: Right - as I said there are some limitations. There is a *way* to extract imgs/shapes - I just haven't figured out the implementation yet. Stay tuned.

paul mignard's Gravatar Hi Todd,

The company I work for just made a flash/powerpoint-ish type app so anytime I see those two words together it catches my eye. :) Anyway - we had some success with actually taking the powerpoint, converting it to the .pptx format and extracting out the xml data from there - at this point we were able to grab text, images, and I even wrote a geometry class to draw some basic shapes - if anybody is looking to get data from office docs this is certainly the way to go! At least it worked for us...

Although i guess you trade the horrors of HSFL for the horrors of OOXML!

todd sharp's Gravatar Paul: Is that app public? I'd like to see what you guys did.

What did you use to convert to pptx and extract the XML?

Also, not that it's a showstopper, but Ofiice docs didn't use XML format until what, Office 2000?

bash's Gravatar Subscribed

paul mignard's Gravatar Hey Todd,

It's not public at this point but will be in the near future. :(

MS makes a "Offic 2007 migration kit" which includes a little app called ofc.exe that you can use to convert most any office file to 2007 format. Since the pptx is actually just a zip file we dig right in and start piecing the thing together from there. At first it's a little overwhelming but you start seeing chunks of your ppt and you can start seeing how it all relates.

Only issue - has to run on a windows box so the little ofc.exe can be called...

Scott P's Gravatar Todd - expanding on the previous comment - all of the new office 2007 files are actually zip files. You can either rename then to .zip or open with your zip manager program.

Links:
http://msdn.microsoft.com/msdnmag/issues/06/11/Bas...

http://msdn2.microsoft.com/en-us/library/aa982683....

Sean's Gravatar Very nice. I'm hoping to find a way to display inside cfpresentation tag.