Norm Walsh wrote an article on why escaping markup, particularly in RSS feeds (mostly because this format suffers from that abuse more than any other) is a Bad Thing.
Norm, I agree that there must be a stop to escaped markup. You’ve made some points, however, that I must contend with you on. In point 1b, you write:
There are better ways of escaping content. First of all, if the content you encounter is well formed XML, no escaping is necessary. If it isn’t well formed XML, then it must be HTML. No application is allowed to accept a document that purports to be XML but is not well formed. There are well understood ways to turn HTML into XHTML (or well formed XML). I’d even prefer stripping all the markup entirely to this escaped markup “solution”.
Norm, have you given no thought to legacy HTML? XHTML is a relative newcomer, and as such, many people aren’t that familiar with what needs to be done to make their markup compliant with the standard. Also, it’s a lot of work for what visually results in no gain (excepting the RSS feeds) to update your legacy markup so that it’s compliant.
Stripping out HTML markup is likewise not a great answer. If someone designs a feed so that multiple paragraphs if information is sent my way, that is their prerogative, and may very well be done that way because their visitors demand it. If you strip out the markup, then the content will be too difficult to read, and feed viewers will blame the feed provider, not the software.
The idea of base64 encoding content is very interesting in theory. In reality, the people who construct RSS feeds by hand (and I’m sure there are more around than either of us suspect) would find that requirement a barrier too high to surmount.
My personal feeling is that if people feel the need to mark up the copy in the description tag, then if they can’t guarantee well-formedness, then they shouldn’t use an XML-based RSS format for transport. Rather, I’d suggest a YAML-formatted RSS feed. When viewing source, HTML and XML are too much alike for comfort. Mixing the two is too dangerous.
I understand, Norm, the value of making an extreme stand to make a point. It’s like being a magnet to pull a current in a different direction. I’m trying to be pragmatic here. In writing this, I’m trying to see what is a good balance between keeping barriers down for content creators, and giving consumers exactly what they want. I know I want lots of info in my feed reader. And I think it’s critical to deliver it in a way formatted for easy consumption.
YAML is a simpler, easier to read markup language that people sometimes compares to xml. Syck is an allegedly fast library with ruby bindings (bindings also exist for other scripting languages) parser for parsing YAML files.
I wonder if YAML might not be a bad way to mark up any data that may have malformed HTML in the payload - like RSS. I suspect, however, that advocating this idea, particularily in Sam Ruby’s RSS workalike initiative would be largely ineffective. Clearly, I’m not the only one with that notion...
Go YAML! You have a place in this world.