Extract HTML From PowerPoint (So How Did I Do It?)

I blogged earlier today a quick demo showing an original PowerPoint file and a dynamic Breeze/Connect (I forget the latest term) version of that same presentation. Granted the post was a bit cryptic - but Ray dared me to do it and hey, it was fun :)

In case you missed that post, here are the before/after demos again:

Before

After

So the question is, how did I do it? Before I answer, let me state that yes, that was/is a real on-the-fly conversion. It was not a copy/paste or some sort of spoof.

The answer is I did the conversion by using a combination of the POI-HSLF library and the help of Mark Mandel's Java Loader. I've wrapped the functionality in a quick CFC (attached below).

The cool things about this package in my opinion:

  • Can extract raw text from PowerPoint (Think indexing...)
  • Can extract HTML formatted text (via some creativity)
  • That HTML can then be used (as shown in the demo earlier) to create a dynamic SWF presentation

My package has one function exposed - convertPowerPoint(). You pass it the path to your PowerPoint (optional 2nd argument for 'textOnly') and it returns an Array of Structures with each element representing a slide in your PPT.

Here's a little taste of how easy it is. Want to convert to HTML?

<!--- get the path to the ppt --->
<cfset pptFile = expandPath("slide_show.ppt") />
<!--- create the pptutils object --->
<cfset pptutils = createObject("component", "pptutils") />
<!--- get the ppt in html format --->
<cfset pptHTML = pptutils.convertPowerPoint(pathToPPT=pptFile) />

Want to use that HTML for a cfpresentation?

<cfoutput>
   <cfpresentation title="demo showing a powerpoint presented as a cfpresentation">
      <cfloop array="#pptHTML#" index="slide">
         <cfpresentationslide title="#slide.slideTitle#" >
            #slide.slideHTML#
         </cfpresentationslide>
      </cfloop>
   </cfpresentation>
</cfoutput>

Just want the text? Pass a boolean 'true' in the 2nd position.

So that's all for now. I'd like to experiment more - especially with more complex presentations. As I said above, HSLF is not the friendliest when it comes to the style elements. I did my best to determine some basic stuff - the more advanced stuff (like images, shapes, etc) will probably prove to be very difficult (from what I've read about HSLF).

Related Blog Entries

Comments
Very cool, Todd. I call first dibs on you doing a presentation of your use of cfpresentation!
# Posted By Brian Meloche | 12/12/07 2:44 AM
Can this be used to extract the content of a .doc file (fully formatted?)
# Posted By Raul Riera | 12/12/07 3:04 AM
Cool.

Does this require Powerpoint installed to work?
# Posted By Peter Boughton | 12/12/07 7:31 AM
Nice ... but when I tried this code it did not show any pictures I used in the ppt ... ?
# Posted By Frank Gerritse | 12/12/07 8:38 AM
Raul: There is a library for working with Word stuff. I hope to eventually check it out.

Peter: Nope.

Frank: Right - as I said there are some limitations. There is a *way* to extract imgs/shapes - I just haven't figured out the implementation yet. Stay tuned.
# Posted By todd sharp | 12/12/07 9:15 AM
Hi Todd,

The company I work for just made a flash/powerpoint-ish type app so anytime I see those two words together it catches my eye. :) Anyway - we had some success with actually taking the powerpoint, converting it to the .pptx format and extracting out the xml data from there - at this point we were able to grab text, images, and I even wrote a geometry class to draw some basic shapes - if anybody is looking to get data from office docs this is certainly the way to go! At least it worked for us...

Although i guess you trade the horrors of HSFL for the horrors of OOXML!
# Posted By paul mignard | 12/12/07 11:01 AM
Paul: Is that app public? I'd like to see what you guys did.

What did you use to convert to pptx and extract the XML?

Also, not that it's a showstopper, but Ofiice docs didn't use XML format until what, Office 2000?
# Posted By todd sharp | 12/12/07 11:51 AM
Subscribed
# Posted By bash | 12/12/07 1:31 PM
Hey Todd,

It's not public at this point but will be in the near future. :(

MS makes a "Offic 2007 migration kit" which includes a little app called ofc.exe that you can use to convert most any office file to 2007 format. Since the pptx is actually just a zip file we dig right in and start piecing the thing together from there. At first it's a little overwhelming but you start seeing chunks of your ppt and you can start seeing how it all relates.

Only issue - has to run on a windows box so the little ofc.exe can be called...
# Posted By paul mignard | 12/12/07 5:33 PM
Todd - expanding on the previous comment - all of the new office 2007 files are actually zip files. You can either rename then to .zip or open with your zip manager program.

Links:
http://msdn.microsoft.com/msdnmag/issues/06/11/Bas...

http://msdn2.microsoft.com/en-us/library/aa982683....
# Posted By Scott P | 12/12/07 7:57 PM
Very nice. I'm hoping to find a way to display inside cfpresentation tag.
# Posted By Sean | 2/20/08 9:45 PM
# Posted By todd sharp | 2/21/08 12:08 PM

Calendar

Sun Mon Tue Wed Thu Fri Sat
  12345
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31   

Subscribe

Enter your email address to subscribe to this blog.

Tags

actionscript ajax blogging cfsnippets coldfusion flash forms flex misc model-glue off topic personal project learn slidesix sql

Recent Comments

More CF+Java: Compiling Classes And Persisting Objects
Getburl said: I have been attempting to get Db4o working in my CF application and I have not succeeded. I would lo... [More]

Thoughts On Ajax Frameworks And ColdFusion/Adobe
Erast said: http://fanniecollins.10gb... emo http://gracetrevino.phree...... [More]

Extending Ext With Ext Extensions
Erast said: http://fanniecollins.10gb... emo http://gracetrevino.phree...... [More]

CF Needs An Open Source Contact List Importer
Kay Smoljak said: Heh, the fact that sites DO it doesn't mean they SHOULD. To us it's ok, but to a non-tech-savvy user... [More]

A Few Project Updates
Helena said: Now punctually what is the situation ? [More]

RSS


coldfusionbloggers

FullAsAGoog MXNA

Consumed By Feed-Squirrel.com