So I was checking out Slashdot the other day, and came across this article about how MSIE 7 might be shipped before Longhorn, and I must admit to being totally confused.
See, I thought the party line from most web developers was that — let me put this diplomatically — MSIE shouldn’t be recommended or endorsed. Ever. There are a lot of good reasons, like the security issues, the lack of proper and complete PNG (and CSS) support, and I’m sure there are others I’m not recalling.
So, when Microsoft decided that the resulting massive migration away from their browser (and as a web developer, I must say I missed that memo) was bad enough that some action needed to be taken Right Now, they decided to ask us what we wanted fixed in MSIE. That’s a reasonable course of action to take.
So I’d like to put a question to those of you who were toeing the aforementioned party line: Why did you answer them?!? Maybe my background as an artist has enabled me to learn a thing or two the rest of the world didn’t. Here’s a clue: the best way to ruin someone’s chances of success isn’t to hate them. No, the best way to screw someone is to ignore them.
But then again, maybe that was the smartest thing ever. Now that I’m in business for myself, I can see that this kind of situation only means that the real losers aren’t the web developers (the good ones anyway) but our clients, because they need to pay us to develop compatible solutions.
UPDATE: Geez, I didn’t even read the comments in that Slashdot article (I did skim the article it linked to, though!) Seems like the ‘mass migration’ claim may have been slightly overstated.
This post will contain some tips on how to set up your web development process to use UTF-8 end to end. What happened was, I saw a pair of posts by Sam Ruby (Unicode and weblogs, Aggregator i18n tests). I can be a bit of a careful (read: slow) thinker at times, so I had to let this percolate through my brain for awhile. But these posts were published during two projects that I was working on.
The first project I was working on was a site containing English and French, with a ColdFusion based content management system, and was trying to deal with accented characters. With my co-workers, we figured out how to reliably cough out entities in the right places.
Another project was an XML/XSLT based one that was getting some fancy characters like bullets and emdashes from some form submissions, and dealing badly with them. I think we did the entity replacement trick here, too.
What we found remarkable was that we’ve been building web sites for years, and it seems like all of a sudden this has become a problem for us.
Well, finally I grokked what Sam was talking about, and I went ahead and modified my web development toolchain to work in utf-8. Let me tell you: it was far easier than I had originally thought it was going to be. This blog hasn’t been updated yet, but my other website is running as UTF-8.
See, unicode is a lot more than just accented characters, or Asian characters. It’s also got all the finer typography controls too — you want curly quotes? Emdashes? It’s all there. With a little wiki magic, you could even set up your content management system to automatically convert standard ASCII quotes to curly quotes, and dashes to emdashes, so you’d never have to learn how to input them!
Seriously, you don’t need to use HTML entities anymore, barring angle brackets and the ampersand. That’s a big win — it makes your source code readable if the HTML is stripped out. Everything still looks good in plain text.
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Content-type: text/html; charset=utf-8
AddCharset utf-8 .html
Even with all this, I still don’t feel like I’m an expert. For example, the accept-charset tags mentioned above aren’t supported in version 4 browsers. What’s the encoding of text submitted by form on those browsers? If it’s not UTF-8, then you’d need to patch your CGI’s to check for these browsers, and convert the contents to UTF-8. As I learn more, I’ll try to keep you updated.
My wife and I do not get the paper, ergo we do not get flyers for our local grocery stores. As we’re wanting to shop more price-consciously, the flyers would be nice pieces of information to have.
“Why don’t you check online?” I asked. “Too much bother,” she said, so we now have a routine where, once a week, we pick up flyers from the stores we want to shop at, and plan our trips accordingly.
So I get it into my head to go check these sites myself. I was thinking that maybe I could write a script that would grab the info, package it together in a nice, but privately accessible page, and then my wife would have exactly what she wanted in a nice convenient package. More, once I had that data, I could mix and dice it with our shopping list to help her pick products that were on sale that week.
The two stores we tend to shop at are Zehrs and Food Basics, mostly because they’re the two closest to where we live. I am not linking them here because they do not deserve it. Google “Zehrs markets” and “Food Basics” if you’re interested.
Let’s talk about Food Basics first. What they did was... annoying, but I can understand where they’re coming from. Their online version of the weekly flyer is basically 7 jpgs on 7 pages. Not exactly scrapeable information, but it would be possible to at least bookmark the first page, and the images themselves seem to have predictable URI’s.
Zehrs, now, is another thing altogether. First of all, the site’s in a frameset, which, by the way, isn’t a cardinal sin in my book, if it’s used properly (and it almost always isn’t), and so the URI is masked from view. Selecting their ‘online flyer’ link took me to a city/store selector, which in turn brings up the flyer. Great. Let’s view this frame. Uh-oh. The URI is completely opaque. After scraping the domain name, here’s what it looks like:
Cute, isn’t it? Basically, I can’t bookmark a single URI that would always take me to the first page of their flyer. I can infer that I’m looking at page 1 (the P001 part of the file name) and I can figure out that I’m on week 8 of the year, and I doubt that 2004 would represent anything BUT the year. I could look at it for a few weeks to infer the rest of the pattern, but I’m not done talking about why the Zehrs experience bugs me.
Their flyer, like the Food Basics one, is also a set of images... coincidentally, the image is stored in the same directory structure, with the same name excepting it starts with IMAGE instead of PAGE, and ends in .jpg instead of .asp. I would have been as annoyed at Zehrs as Food Basics, but, combined with the opaque URI, Zehrs looks relatively worse.
But get this: there’s a feature where, if you mouse over certain products on each page, you get a layer containing the flyer text for that item. That’s good, right? That’s scrapeable, right? Well, probably, but not easily. See, I view-sourced the file to see what they got, and instead of finding nice <div />’s with the copy, I instead find something that looks like this:
That’s right, dear readers, they hex-encoded all the characters that would make up their specials. More, they wrote this fairly impressive decoder right in the file. Heaven’s pity, but why? Why bother?
Both these stores had this (in my mind) fantastic means to create brand loyalty by potentially offering data transparently enough that anyone could conceivably shuffle it in with their own personal data (like, in this case, a shopping list). Both these stores could have created an API (like Amazon & Google) for their specials. If the idea took off, they could then reduce the amount they’d need to print for their offline audience.
What can I say? Guess I’ll continue to pick up flyers from the stores. I don’t have that much free time...
In my world, WSDF used to stand for Web Site Description File. Now it’ll be called Web Site MetaData. The take-away from this is: before going live with the new thing you’ve got, research the name for collisions first.
I apologize for the inconvenience this has caused, but I think this is the right thing to do.
The problem is that all the Web knows about is URIs, and the Web can’t tell whether a URI points to a home page, a picture of a cute cat, or to one of a dozen daily entries on some blog... And I bet, down the road, once we really have the notion of a site, we’ll be able to think of all sorts of other useful things to do with it.
This series of posts builds on the thinking Tim puts out, so I recommend you look at his article before continuing.
What I have done in the following series of posts was to try and map out a model of what a web site is, break it down into it’s atomic pieces, and determine if there’s a way to represent the data in a machine and human readable format. Some of these articles, like the “WSMD as RSS/WSMD as Atom” article, could safely be skipped if you’re pressed for time.
If you read through most or all of this, let me use this post to thank you (I wasn’t sure where else to put it). If you decide there’s merit enough in this proposal to implement a WSMD file for your site, please drop me a line. If you have constructive criticism related to anything here, please let me know. I’m convinced that the ideas outlined here will work, and I’d like to see where it goes.
Update: added a warning to the WSMD as XHTML page not to download the WSMD file directly.
Update: a bit of Googling revealed that the term “WSDF” which I had been using until Jan 15, 2004, was being used in a Web Services context. In order to avoid confusion, I decided to call this format Web Site MetaData.
I have code here to show you how you could leverage the value inherent in a WSMD file. Unfortunately, since this domain is not my own, I am not free to simply set up the scripts needed as I see fit. So I encourage you to download it, unzip it, and have a look. If you’re not a programmer, you may be more interested in possible use cases for WSMD files if someone should but create scripts to realize them.
The code is provided as proof-of-concept only. I’m sure that many of you reading this are probably far better programmers than I, and I hope that you’ll take what inspiration you can from them and run with it.
For those of you not interested in downloading it (yet?), this post will provide the briefest of overviews of how I wrote this search engine.
This search page was generated by a script called wsmdfindmaker.pl, and its sole purpose was to parse the WSMD and create the list of possible sections and filetypes to search within. The thinking behind doing it this way is that, as sole author of this blog, I will be updating my WSMD file, and hence my search page, only as often as I’m posting. All the other times when requests are made to the page, the representation will be exactly the same. Why bother, then, wasting CPU cycles generating the exact same output between page refreshes?
The search itself is covered by another script, called wsmdfind.cgi, which dynamically constructs the XPath for the search, pulls matching nodes, and attempts to match the search text (if present) to the result set, printing anything that appears to be a candidate result.
I encourage you to download the code and take it out for a spin. The requirements are that you have XML::XPath (and it’s prerequisites) installed. I also make use of the perl CGI module (which seems to be part of the standard Perl distribution these days). I’m running the page successfully on a Mac OS X machine, and despite the tedium of installing all the prerequisites, I have not encountered any actual problems with the install.
This isn’t the only thing you could do with a WSMD. I’ve discussed some other applications in this post that may inspire you.
When I started writing this series of posts, I already had the properties of a WSMD file already in my head. Initial research suggested to me that RSS would be the only candidate, because the structure was simple enough, and I already understood it. So it came as a surprise to me to work this out and then realize that Atom and XHTML are also good candidates. Nevertheless, I’d like to demonstrate how these principles could be implemented in RSS.
The RSS implementation is dirt easy. I chose to work with the 0.91 spec so as to use a ‘lowest common denominator’. I’m going to assume that you already know how to read RSS files.
Let’s copy a list from this post. The properties of a WSMD entry must be:
Well, we got two of the three RSS tags used in each <item> nailed already — <link /> and <description />. So that leaves the title tag to hold the indication of what it is. That feels wrong to me, somehow, because it’s a <title /> tag, not <an-indication-of-what-it-is /> tag. Still, I guess it could be argued that a title is supposed to be an indication of what something is.
I’ve created the RSS implementation of the WSMD file. You can find the one for the home page here. I seem to have lost these files. I’ll look into recreating them at some future date.
What’s really cool about this minimal implementation is that if you want to add more metadata than the minimum I defined, you need merely switch to a more modern revision of RSS — one that supports, say, the Dublin Core metadata set, and run with that.
Before I started this project, I knew about Atom, and ‘I was there’ when Sam Ruby first opened up the issue that eventually led to the fine work that has been done, and continues to be done, today. However, like many others, I was quickly overwhelmed by the frenetic pace of development, and decided to sit out until they had something to show for their efforts.
I sat down one weekend to look at the Atom Syndication Feed specification for the first time, and realized that this format was very interesting because it provided a richer markup framework for describing what a site was. A lot of this information wasn’t required, based on the model, but to provide it arguably leads to a higher-quality (and potentially more useful) description file.
Again, assuming you know how to mark up an Atom feed, consider the following:
Each entry in an Atom feed describes one asset of a web site. As in RSS, the <link />/<id /> tags point to the representation itself, the <summary /> tag will contain a human-readable description of the asset, and the title would contain the indication of what the resource is.
But the Atom syntax specs requires additional information in the feed. For example, each entry neds an <issued /> tag to indicate when the entry was, well, issued. Feeds also require at the top of the document such things as a title, a link to connect the feed to (in this case, it would be the home page of a site), a <modified /> tag to indicate the last modified date... you get the idea. There’s more information required to make up an Atom feed, much more than I think is required to describe a web site, but nevertheless, value is being added by filling in that data.
I have not at this time provided an atom version for your consumption, because in terms of structure, it is so similar to RSS that I believe nothing new could be learned by studying it in Atom.
Now let’s examine the XHTML version of a WSMD.
I don’t have all the answers here, but I’m convinced that this proposed solution is bound to be useful in all sorts of situations. I humbly submit the following inspirations, and I hope you’ll share with me ideas of your own.
If you are contributing to a vast international website spanning multiple domains on multiple machines in multiple locations, you’ve probably had to give some thought to creating some kind of single sign-on mechanism that would empower any registered user to access any part of the site, no matter where she registered in the first place.
If such a site also required that some user have different privileges than others, then a WSMD file could be useful to establish domains where people may have access. What’s cool about this is that you can alter the files whenever you want, and transparently alter what domains are accessible to a registered user — in real time.
This idea depends on the willingness for search engines, such as Google, to cache WSMD files (but since they’re HTML anyway, that’s already been done, I suppose), then modify the search interface a bit to be able to search the contents of such files intelligently.
How will this contribute to the robustness of your site? Suppose your site spanned multiple domains, and one of your machines got slashdotted. You decide to move some of the content to another machine, link it off the homepage on that machine, and rest easy. Why? Because anyone knowledgeable enough to hit Google after discovering your slashdotted site could easily find out from the cached WSMD file for your site that you’re hosting on more than one machine, and go see if the information they seek is on the other one.
I think the WSMD file also de-emphasizes the importance of domain names. I’ve had a friend make the comment that anyone who had to split their site up among multiple machines is being sloppy. Yes, that’s arguable; there are many sites built over multiple machines that are purposefully designed that way, and to consolidate them under a single domain could well be impossible. But with a WSMD file and the right interface, where a particular piece of information comes from may well be irrelevant, as long as you can get it.
Have you ever been in the situation where you decided to comment on another story or post on someone else’s site, only to find that the thinking was so good, you wish you could move it to your own? With a WSMD file, you could simply ‘claim’ your comments as being part of your site, and people searching for something you said wouldn’t have to worry if you said it on your own hosted pages or as part of a submitted comment.
None of these ideas are in themselves the ‘killer app’ for WSMD, but that’s ok. We’ve had the web since, what? 1991? I got started in this in ’95, and I know that there existed no machine-parseable definition of what a website is in that time. If this notion were to take off and become popular, I’d rather it stay loose enough to ensure we’re marking up the right content, and tighten it up as time goes by, and our relative experiences increase. And it’s going to take a certain amount of experience before we can figure out how to make the best use of this information.
As I mentioned before, I had always assumed that I was going to present the solution in RSS format. I knew that one of the drawbacks to using RSS was that you couldn’t describe the entire website in one file — not if you wanted to preserve the notion of sections to search in. But I figured it didn’t matter. I can simply link to other wsmd.rss files, where each file described a section.
When I actually implemented the format, I discovered that it was a real pain to edit and ensure that everything was set up and linked properly. I also found that the file sizes were absolutely tiny — and while I’m not an expert in the HTTP protocol, I was wondering if perhaps they were so small that the HTTP overhead, combined with I/O bottlenecks for fetching the files, made the scheme a little inefficient.
With that experience, I tweaked the requirements a bit, as you saw here, and came up with a way to describe a site in one file, and it wasn’t that bad in terms of size. For instance, in my WSMD file, I describe 93 assets in a 14K file. Not bad, really. If a large site contained, say, 5000 assets worth putting into a WSMD file, it would occupy a file roughly 750K in size. I suppose that in the web world, that’s huge, but I don’t really see it as being a problem. For one thing, there’s nothing keeping you from breaking the WSMD file into smaller files, with the top-level file linking to other WSMD files. But that might not be necessary if the ones most likely to use the file are the ones hosting the site (and provide you as the user with a richer means of utilizing the site).
I don’t think anyone could argue intelligently that between RSS, Atom, and XHTML, more people would know XHTML more than the others. This means that the learning curve for implementing WSMD files is nearly flat — you need only learn the model, and familiarize yourself with the two tags absolutely required to make this work.
As you no doubt have already tried to do, trying to view a WSMD file is an interesting experience. Instead of getting markup, you actually get a chance to see the entire website in one page. For those of you on slower connections, that was probably a painful experience, and I’m sorry about that. But I don’t really see it as a liability, because for testing purposes, what better way to check to see if you got all your links working right? Besides, the WSMD file obviously isn’t meant to be viewed in a browser. It’s meant to be mined for useful data.
Let’s look again at the anatomy of a web site. here’s the breakdown:
Discussion threads, blogs, and wikis can all easily map to either a section (a <div /> tag) containing more granular assets, or simply be noted as an asset (an <object /> tag) which would have as it’s URI the starting point for that service. Browser-level, plug-in, and downloadable assets are definitely objects. Home pages and applications are objects too. The only thing left are sections, and as those are structural hints, not URI-deference-able, they map naturally to <div /> tags.
If a property of a web site maps to a <div /> tag, then the only information we really need is a human-readable title. On the other hand, if it’s an object, then all we really need are the location of the object, the type of object it is, and a description of the contents.
I don’t think I’ve abused the ontology of XHTML to describe a site, whereas with RSS and Atom, I had to shoehorn a required value into at least one of the fields in an unintended manner.
As described, none of the possible formats, including this one, force the notion of a home page; it would be up to the author of the file to mark which pages are home pages. If the author of a WSMD file wanted to ensure, however, that the file would be as useful as possible, then he is encouraged to adopt the generally accepted terminology wherever it is relevant.
If it’s not obvious at this point, the value of a WSMD file is directly proportional to what you put in it. You don’t have to put every single resource in it if you don’t want to — at the risk of diminishing the value of WSMD. But there are cases where this is exactly the thing to do. For example, it might not make sense to put any URI’s that come from the middle or end of an application flow, because you want to ensure that people always start at the beginning of a task. You might not want to put some of the graphics in the WSMD, because they describe the look and feel of the site, and would serve no value to anyone else.
Another notion peculiar to the XHTML implementation of WSMD is that while you must describe your site using <div /> and <object /> tags, by no means are you limited to the kind of markup you could put within an object tag. You can make citations, link to other resources, both within or without your site, or otherwise add structure to your data.
All this sounds like great theory, but what’s the utility? Is there a killer app for this? I believe so, and I’ll start the discussion here.
With the general types of sites now described, what does a site consist of, semantically speaking?
It may be that some of you would think that movies or Flash files should be Plugin/Downloadable assets, and others think that they’re Browser-level. Honestly, it doesn’t matter that much to me. You’re both right. But very likely, everyone would agree that there are some things that are more Plugin/downloadable and others are more Browser-level.
So, what that list boils down to is this: If you want a web site, you must begin with at least a homepage. Having one implies, that there’s a section. Everything else is optional.
The term ‘section’ is interesting enough to be fleshed out a bit:
That list looks familiar, doesn’t it? A web site, then, is essentially a section that may contain a number of assets, including other sections.
I wonder, though, if some of the numbers could be tweaked. For example, can a web site contain a section made up of nothing but, say, browser-level assets? I suppose so, in which case you’d have 0-n home pages. But for now, I’ll leave the numbers the way they are, because they are typical for almost any site.
Tim Bray, in this article, already talked about it, really. A web site can span domains, or it may be within a directory or two of a domain, but not at the root of that domain. It stands to reason that the only safe assumption to make is a web site must have at least one URI, and that all assets, whether its a page or an image or a binary document, is going to have a URI.
Also worth mentioning are such things like the author and last modified dates of a site. Such attributes are typically not referenced by URI, but are represented by markup within a representation. So, for the purposes of this model, they are not considered, because I don’t believe they will have an impact on the solution.
I’m not sure whether there’s any more to put in the model at this point, so let’s have a look at what falls out of this.