A real-world example
 
   You should see TWO images here, they are linked to the same image, but using two url-resolution methods, purely relative (src='images/img.png') and root-relative (src='/images/img.png')
  You should see TWO images here, they are linked to the same image, but using two url-resolution methods, purely relative (src='images/img.png') and root-relative (src='/images/img.png')
   
  
All the CODE in this page is verbatim taken from a live Drupal4 site.
Only the content has been scrambled to protect the innocent.
The structure was left mostly intact to bounce a few different layout challenges at it.
Normal HTML layout suff includes:
- Lists and things
- Embedded images
- Subsections and subheadings
And often navigation and cross-references
This stand-alone example will not include the referenced files of course, BUT:
- When an import process is run
- It will rewrite links appropriately to find the related links as appropriate
- Images come along to, although links to them may optionally be rewritten differently.
The template to use on this input is the supplied, generic catch-all html2simplehtml.xsl file included in the distribution. This template has more complexity, and a few alternative switches built in to make the best of whatever is thrown at it. For this reason it's not the best to learn from at first, although it does illustrate a few ways of solving problems encountered in page parsing.
THIS content was also basically valid
BUT most input from unknown sources needs to run through tidy before we can trust the XSL process on them.
To be honest - there was one set of invalid tags :(
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
(and other metas and rel links in the header) Had to be repaired into true XHTML with a closing singleton tag.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Damn. Validation is hard.
Alternative tags
As this file ALSO includes a comment saying <!-- end contentbody --> it would have been possible to use regexp or text tags to find the content. But that's old-school.
OK, that's enough random waffle. The page content is now representatively replaced.

