So I was checking out Slashdot the other day, and came across this article about how MSIE 7 might be shipped before Longhorn, and I must admit to being totally confused.
See, I thought the party line from most web developers was that — let me put this diplomatically — MSIE shouldn’t be recommended or endorsed. Ever. There are a lot of good reasons, like the security issues, the lack of proper and complete PNG (and CSS) support, and I’m sure there are others I’m not recalling.
So, when Microsoft decided that the resulting massive migration away from their browser (and as a web developer, I must say I missed that memo) was bad enough that some action needed to be taken Right Now, they decided to ask us what we wanted fixed in MSIE. That’s a reasonable course of action to take.
So I’d like to put a question to those of you who were toeing the aforementioned party line: Why did you answer them?!? Maybe my background as an artist has enabled me to learn a thing or two the rest of the world didn’t. Here’s a clue: the best way to ruin someone’s chances of success isn’t to hate them. No, the best way to screw someone is to ignore them.
But then again, maybe that was the smartest thing ever. Now that I’m in business for myself, I can see that this kind of situation only means that the real losers aren’t the web developers (the good ones anyway) but our clients, because they need to pay us to develop compatible solutions.
UPDATE: Geez, I didn’t even read the comments in that Slashdot article (I did skim the article it linked to, though!) Seems like the ‘mass migration’ claim may have been slightly overstated.
A recent post on the Blosxom mailing list reveals that maybe those people who build RSS aggregators should add a Accept-Language HTTP header to help those of us who have multiple-language blogs deliver the correct RSS feed. Get in touch with your favourite RSS software developer and have them tuck it in.
See this page on Content Negotiation for context.
This post will contain some tips on how to set up your web development process to use UTF-8 end to end. What happened was, I saw a pair of posts by Sam Ruby (Unicode and weblogs, Aggregator i18n tests). I can be a bit of a careful (read: slow) thinker at times, so I had to let this percolate through my brain for awhile. But these posts were published during two projects that I was working on.
The first project I was working on was a site containing English and French, with a ColdFusion based content management system, and was trying to deal with accented characters. With my co-workers, we figured out how to reliably cough out entities in the right places.
Another project was an XML/XSLT based one that was getting some fancy characters like bullets and emdashes from some form submissions, and dealing badly with them. I think we did the entity replacement trick here, too.
What we found remarkable was that we’ve been building web sites for years, and it seems like all of a sudden this has become a problem for us.
Well, finally I grokked what Sam was talking about, and I went ahead and modified my web development toolchain to work in utf-8. Let me tell you: it was far easier than I had originally thought it was going to be. This blog hasn’t been updated yet, but my other website is running as UTF-8.
See, unicode is a lot more than just accented characters, or Asian characters. It’s also got all the finer typography controls too — you want curly quotes? Emdashes? It’s all there. With a little wiki magic, you could even set up your content management system to automatically convert standard ASCII quotes to curly quotes, and dashes to emdashes, so you’d never have to learn how to input them!
Seriously, you don’t need to use HTML entities anymore, barring angle brackets and the ampersand. That’s a big win — it makes your source code readable if the HTML is stripped out. Everything still looks good in plain text.
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Content-type: text/html; charset=utf-8
AddCharset utf-8 .html
Even with all this, I still don’t feel like I’m an expert. For example, the accept-charset tags mentioned above aren’t supported in version 4 browsers. What’s the encoding of text submitted by form on those browsers? If it’s not UTF-8, then you’d need to patch your CGI’s to check for these browsers, and convert the contents to UTF-8. As I learn more, I’ll try to keep you updated.
My wife and I do not get the paper, ergo we do not get flyers for our local grocery stores. As we’re wanting to shop more price-consciously, the flyers would be nice pieces of information to have.
“Why don’t you check online?” I asked. “Too much bother,” she said, so we now have a routine where, once a week, we pick up flyers from the stores we want to shop at, and plan our trips accordingly.
So I get it into my head to go check these sites myself. I was thinking that maybe I could write a script that would grab the info, package it together in a nice, but privately accessible page, and then my wife would have exactly what she wanted in a nice convenient package. More, once I had that data, I could mix and dice it with our shopping list to help her pick products that were on sale that week.
The two stores we tend to shop at are Zehrs and Food Basics, mostly because they’re the two closest to where we live. I am not linking them here because they do not deserve it. Google “Zehrs markets” and “Food Basics” if you’re interested.
Let’s talk about Food Basics first. What they did was... annoying, but I can understand where they’re coming from. Their online version of the weekly flyer is basically 7 jpgs on 7 pages. Not exactly scrapeable information, but it would be possible to at least bookmark the first page, and the images themselves seem to have predictable URI’s.
Zehrs, now, is another thing altogether. First of all, the site’s in a frameset, which, by the way, isn’t a cardinal sin in my book, if it’s used properly (and it almost always isn’t), and so the URI is masked from view. Selecting their ‘online flyer’ link took me to a city/store selector, which in turn brings up the flyer. Great. Let’s view this frame. Uh-oh. The URI is completely opaque. After scraping the domain name, here’s what it looks like:
Cute, isn’t it? Basically, I can’t bookmark a single URI that would always take me to the first page of their flyer. I can infer that I’m looking at page 1 (the P001 part of the file name) and I can figure out that I’m on week 8 of the year, and I doubt that 2004 would represent anything BUT the year. I could look at it for a few weeks to infer the rest of the pattern, but I’m not done talking about why the Zehrs experience bugs me.
Their flyer, like the Food Basics one, is also a set of images... coincidentally, the image is stored in the same directory structure, with the same name excepting it starts with IMAGE instead of PAGE, and ends in .jpg instead of .asp. I would have been as annoyed at Zehrs as Food Basics, but, combined with the opaque URI, Zehrs looks relatively worse.
But get this: there’s a feature where, if you mouse over certain products on each page, you get a layer containing the flyer text for that item. That’s good, right? That’s scrapeable, right? Well, probably, but not easily. See, I view-sourced the file to see what they got, and instead of finding nice <div />’s with the copy, I instead find something that looks like this:
That’s right, dear readers, they hex-encoded all the characters that would make up their specials. More, they wrote this fairly impressive decoder right in the file. Heaven’s pity, but why? Why bother?
Both these stores had this (in my mind) fantastic means to create brand loyalty by potentially offering data transparently enough that anyone could conceivably shuffle it in with their own personal data (like, in this case, a shopping list). Both these stores could have created an API (like Amazon & Google) for their specials. If the idea took off, they could then reduce the amount they’d need to print for their offline audience.
What can I say? Guess I’ll continue to pick up flyers from the stores. I don’t have that much free time...
I’m really happy that Bob DuCharme wrote his article on creating backlinks, because I have been struggling to write an article using the same technique, but to achieve a different end. In his article, what Bob wanted to do was find a way to work around the web’s fundamental restriction on linking: that it be one way. What he proposed was an easy hack where if he writes an article, he could publish as part of the article a link to Google using his article as a search term. If someone else wanted to write an article that contributed to the discussion that he started, then they too could add the exact same Google link. Now, with both sites so connected, if a viewer were to visit one site, and saw the Google link to the other, they would hit the link, read the other article, hit that article’s Google link, and would then have a way to ‘go back’ to the first article — albeit in a two-step process.
I was surprised that Bob didn’t pick up on the other possibility inherent in this kind of linking. That is what I want to discuss here.
I would like to propose that if any collection of weblog posts related in content were to publish these kinds of Google links, then the search results begin to take on a new kind of significance. The results would illustrate a discussion thread. Let’s take an example, using Alice, Bob, and Carly.
Alice decides to write a post about a recipe, and publishes it to alice.ca/myrecipe.html. Alice understands the idea of using Google to store discussion threads, so she publishes a link to Google using link:alice.ca/myrecipe.html as the search term.
Bob, also a culinary artist, saw Alice’s article, and clicks on the Google link. No surprise, the search results contain a link to Alice’s article. Bob then decides to write a followup, possibly to suggest an alternate method of preparation for Alice’s recipe. He links to Alice’s post, and he also creates his link to Google, using link:bob.net/alicesrecipe.html.
Alice decides to check on her article on her web site, and clicks on that Google link she made. She’s delighted that there are now two results: one for her own site, and a new link to Bob’s. She visits Bob’s page, reads his article, and clicks on his Google link. Alas, the discussion seems to have ended there for the time being.
Carly, who’s a fan of anything Bob writes, then goes to Bob’s site and discovers the post talking about Alice’s new recipe. Carly spotted some technical errors in Bob’s post, and decides to write a post of her own suggesting corrections. She includes a link to Bob’s page, and may likewise add her own Google link.
If Alice were then to check on her Google thread link, nothing would change, but if she were to follow Bob’s Google link, she would discover that the conversation had in fact continued on.
This example illustrates several aspects of using Google to show how a discussion evolves.
The first aspect to note is that anyone using the links will not be able to get a bird’s eye view of the entire thread. This is only a real drawback if you’re trying to determine which post was the one that started it all. However, I think there’s a way to make this a bit easier. After checking on a set of Google search results, I was surprised to see that some of the search results had a date tucked in beside the page size. Curious to see how this was done, I view-sourced the link, and found out they were using a meta tag. This is what it looked like:
<meta name="Date" content="2004-01-15" />
My theory is that if web log software made a habit of a) making permalinks to pages with one article on them, and b) adding this date tag, then it would be very possible to trace a series of comments to the oldest one.
If you’re going to explore a conversational thread, then you must follow a page-to-Google-to-page-to-Google pattern of browsing. Not exactly an ideal situation from a usability perspective, but it may well be worth the tradeoff if you’re weighing this approach to creating discussion threads over allowing comments (and therefore comment spam)
There ends up being very little overhead. You don’t have to manage or maintain the thread — Google takes care of that for you. As long as people are creating their links to Google using their article URI as the search term, and also linking to the source article that they’re commenting on, then it’s possible to follow a thread to each and every article.
Things get even more interesting when you play around with some what-if scenarios. What if, for example, Bob not only created the Google link for his own URI, but also added another Google link containing Alice’s article URI? Now you’ve made it convenient for anyone visiting Bob’s page to see both comments to his page and comments to Alice’s page without directly going to Alice’s page first.
Another what-if: What if Alice and Bob were savvy web developers who got their Google API keys, and decided to create an application that displayed the search results of their URI? Now it’s possible to see on their own page the top 10 comments to their own article!
So: I’m going to put my money where my mouth is, and I’ve already adjusted my templates to include a Google Thread link at the top of every page. I’ve also written and deployed a Blosxom plug-in to add the <meta /> date tag for individual entries. In time, I’ll also write a new plug-in that will augment all links I’m citing with an extra link going to Google, so that you can see other posts talking about the article I’m commenting on. The trick, of course, is in the implementation, not the functionality.
Unfortunately, I’m not (yet?) on the who’s who list of online personalities to watch, so almost all my Google Threads will come up zilch, but I hope you’ll realize that it’s not a failure of the idea. If you want to see it in action, then please comment on something I wrote! :)
The inspiration for this plug-in came from a Google search result that turned up an HTML archive of a w3c mailling list. The particular search item was interesting because the date of the file was displayed right beside the size of the file. I thought that if anyone wanted to check how ‘new’ a particular page was, they would merely need to see that information.
This particular plug-in will work fine for all static and dynamic blosxom blogs, but only with the characteristic that permalinks will refer to one specific post, instead of a page containing a series of posts, like the date archive.
While the plug-in works, the benefits to adding a date is still a bit nebulous in my mind. Questions that come up for me include:
Mind you, when I say that the benefits of this plug-in is nebulous, it’s not to say that I don’t know what it would be good for, but rather, I’m not sure what good it will do. You may wish to read my post on Using Google to Create Comment Threads to see what use I’d put the date tag to.
Download the plug-in here. It requires Blosxom 2.0.
In my world, WSDF used to stand for Web Site Description File. Now it’ll be called Web Site MetaData. The take-away from this is: before going live with the new thing you’ve got, research the name for collisions first.
I apologize for the inconvenience this has caused, but I think this is the right thing to do.
The problem is that all the Web knows about is URIs, and the Web can’t tell whether a URI points to a home page, a picture of a cute cat, or to one of a dozen daily entries on some blog... And I bet, down the road, once we really have the notion of a site, we’ll be able to think of all sorts of other useful things to do with it.
This series of posts builds on the thinking Tim puts out, so I recommend you look at his article before continuing.
What I have done in the following series of posts was to try and map out a model of what a web site is, break it down into it’s atomic pieces, and determine if there’s a way to represent the data in a machine and human readable format. Some of these articles, like the “WSMD as RSS/WSMD as Atom” article, could safely be skipped if you’re pressed for time.
If you read through most or all of this, let me use this post to thank you (I wasn’t sure where else to put it). If you decide there’s merit enough in this proposal to implement a WSMD file for your site, please drop me a line. If you have constructive criticism related to anything here, please let me know. I’m convinced that the ideas outlined here will work, and I’d like to see where it goes.
Update: added a warning to the WSMD as XHTML page not to download the WSMD file directly.
Update: a bit of Googling revealed that the term “WSDF” which I had been using until Jan 15, 2004, was being used in a Web Services context. In order to avoid confusion, I decided to call this format Web Site MetaData.
I have code here to show you how you could leverage the value inherent in a WSMD file. Unfortunately, since this domain is not my own, I am not free to simply set up the scripts needed as I see fit. So I encourage you to download it, unzip it, and have a look. If you’re not a programmer, you may be more interested in possible use cases for WSMD files if someone should but create scripts to realize them.
The code is provided as proof-of-concept only. I’m sure that many of you reading this are probably far better programmers than I, and I hope that you’ll take what inspiration you can from them and run with it.
For those of you not interested in downloading it (yet?), this post will provide the briefest of overviews of how I wrote this search engine.
This search page was generated by a script called wsmdfindmaker.pl, and its sole purpose was to parse the WSMD and create the list of possible sections and filetypes to search within. The thinking behind doing it this way is that, as sole author of this blog, I will be updating my WSMD file, and hence my search page, only as often as I’m posting. All the other times when requests are made to the page, the representation will be exactly the same. Why bother, then, wasting CPU cycles generating the exact same output between page refreshes?
The search itself is covered by another script, called wsmdfind.cgi, which dynamically constructs the XPath for the search, pulls matching nodes, and attempts to match the search text (if present) to the result set, printing anything that appears to be a candidate result.
I encourage you to download the code and take it out for a spin. The requirements are that you have XML::XPath (and it’s prerequisites) installed. I also make use of the perl CGI module (which seems to be part of the standard Perl distribution these days). I’m running the page successfully on a Mac OS X machine, and despite the tedium of installing all the prerequisites, I have not encountered any actual problems with the install.
This isn’t the only thing you could do with a WSMD. I’ve discussed some other applications in this post that may inspire you.
When I started writing this series of posts, I already had the properties of a WSMD file already in my head. Initial research suggested to me that RSS would be the only candidate, because the structure was simple enough, and I already understood it. So it came as a surprise to me to work this out and then realize that Atom and XHTML are also good candidates. Nevertheless, I’d like to demonstrate how these principles could be implemented in RSS.
The RSS implementation is dirt easy. I chose to work with the 0.91 spec so as to use a ‘lowest common denominator’. I’m going to assume that you already know how to read RSS files.
Let’s copy a list from this post. The properties of a WSMD entry must be:
Well, we got two of the three RSS tags used in each <item> nailed already — <link /> and <description />. So that leaves the title tag to hold the indication of what it is. That feels wrong to me, somehow, because it’s a <title /> tag, not <an-indication-of-what-it-is /> tag. Still, I guess it could be argued that a title is supposed to be an indication of what something is.
I’ve created the RSS implementation of the WSMD file. You can find the one for the home page here. I seem to have lost these files. I’ll look into recreating them at some future date.
What’s really cool about this minimal implementation is that if you want to add more metadata than the minimum I defined, you need merely switch to a more modern revision of RSS — one that supports, say, the Dublin Core metadata set, and run with that.
Before I started this project, I knew about Atom, and ‘I was there’ when Sam Ruby first opened up the issue that eventually led to the fine work that has been done, and continues to be done, today. However, like many others, I was quickly overwhelmed by the frenetic pace of development, and decided to sit out until they had something to show for their efforts.
I sat down one weekend to look at the Atom Syndication Feed specification for the first time, and realized that this format was very interesting because it provided a richer markup framework for describing what a site was. A lot of this information wasn’t required, based on the model, but to provide it arguably leads to a higher-quality (and potentially more useful) description file.
Again, assuming you know how to mark up an Atom feed, consider the following:
Each entry in an Atom feed describes one asset of a web site. As in RSS, the <link />/<id /> tags point to the representation itself, the <summary /> tag will contain a human-readable description of the asset, and the title would contain the indication of what the resource is.
But the Atom syntax specs requires additional information in the feed. For example, each entry neds an <issued /> tag to indicate when the entry was, well, issued. Feeds also require at the top of the document such things as a title, a link to connect the feed to (in this case, it would be the home page of a site), a <modified /> tag to indicate the last modified date... you get the idea. There’s more information required to make up an Atom feed, much more than I think is required to describe a web site, but nevertheless, value is being added by filling in that data.
I have not at this time provided an atom version for your consumption, because in terms of structure, it is so similar to RSS that I believe nothing new could be learned by studying it in Atom.
Now let’s examine the XHTML version of a WSMD.