In my world, WSDF used to stand for Web Site Description File. Now it’ll be called Web Site MetaData. The take-away from this is: before going live with the new thing you’ve got, research the name for collisions first.
I apologize for the inconvenience this has caused, but I think this is the right thing to do.
The problem is that all the Web knows about is URIs, and the Web can’t tell whether a URI points to a home page, a picture of a cute cat, or to one of a dozen daily entries on some blog... And I bet, down the road, once we really have the notion of a site, we’ll be able to think of all sorts of other useful things to do with it.
This series of posts builds on the thinking Tim puts out, so I recommend you look at his article before continuing.
What I have done in the following series of posts was to try and map out a model of what a web site is, break it down into it’s atomic pieces, and determine if there’s a way to represent the data in a machine and human readable format. Some of these articles, like the “WSMD as RSS/WSMD as Atom” article, could safely be skipped if you’re pressed for time.
If you read through most or all of this, let me use this post to thank you (I wasn’t sure where else to put it). If you decide there’s merit enough in this proposal to implement a WSMD file for your site, please drop me a line. If you have constructive criticism related to anything here, please let me know. I’m convinced that the ideas outlined here will work, and I’d like to see where it goes.
Update: added a warning to the WSMD as XHTML page not to download the WSMD file directly.
Update: a bit of Googling revealed that the term “WSDF” which I had been using until Jan 15, 2004, was being used in a Web Services context. In order to avoid confusion, I decided to call this format Web Site MetaData.
I have code here to show you how you could leverage the value inherent in a WSMD file. Unfortunately, since this domain is not my own, I am not free to simply set up the scripts needed as I see fit. So I encourage you to download it, unzip it, and have a look. If you’re not a programmer, you may be more interested in possible use cases for WSMD files if someone should but create scripts to realize them.
The code is provided as proof-of-concept only. I’m sure that many of you reading this are probably far better programmers than I, and I hope that you’ll take what inspiration you can from them and run with it.
For those of you not interested in downloading it (yet?), this post will provide the briefest of overviews of how I wrote this search engine.
This search page was generated by a script called wsmdfindmaker.pl, and its sole purpose was to parse the WSMD and create the list of possible sections and filetypes to search within. The thinking behind doing it this way is that, as sole author of this blog, I will be updating my WSMD file, and hence my search page, only as often as I’m posting. All the other times when requests are made to the page, the representation will be exactly the same. Why bother, then, wasting CPU cycles generating the exact same output between page refreshes?
The search itself is covered by another script, called wsmdfind.cgi, which dynamically constructs the XPath for the search, pulls matching nodes, and attempts to match the search text (if present) to the result set, printing anything that appears to be a candidate result.
I encourage you to download the code and take it out for a spin. The requirements are that you have XML::XPath (and it’s prerequisites) installed. I also make use of the perl CGI module (which seems to be part of the standard Perl distribution these days). I’m running the page successfully on a Mac OS X machine, and despite the tedium of installing all the prerequisites, I have not encountered any actual problems with the install.
This isn’t the only thing you could do with a WSMD. I’ve discussed some other applications in this post that may inspire you.
When I started writing this series of posts, I already had the properties of a WSMD file already in my head. Initial research suggested to me that RSS would be the only candidate, because the structure was simple enough, and I already understood it. So it came as a surprise to me to work this out and then realize that Atom and XHTML are also good candidates. Nevertheless, I’d like to demonstrate how these principles could be implemented in RSS.
The RSS implementation is dirt easy. I chose to work with the 0.91 spec so as to use a ‘lowest common denominator’. I’m going to assume that you already know how to read RSS files.
Let’s copy a list from this post. The properties of a WSMD entry must be:
Well, we got two of the three RSS tags used in each <item> nailed already — <link /> and <description />. So that leaves the title tag to hold the indication of what it is. That feels wrong to me, somehow, because it’s a <title /> tag, not <an-indication-of-what-it-is /> tag. Still, I guess it could be argued that a title is supposed to be an indication of what something is.
I’ve created the RSS implementation of the WSMD file. You can find the one for the home page here. I seem to have lost these files. I’ll look into recreating them at some future date.
What’s really cool about this minimal implementation is that if you want to add more metadata than the minimum I defined, you need merely switch to a more modern revision of RSS — one that supports, say, the Dublin Core metadata set, and run with that.
Before I started this project, I knew about Atom, and ‘I was there’ when Sam Ruby first opened up the issue that eventually led to the fine work that has been done, and continues to be done, today. However, like many others, I was quickly overwhelmed by the frenetic pace of development, and decided to sit out until they had something to show for their efforts.
I sat down one weekend to look at the Atom Syndication Feed specification for the first time, and realized that this format was very interesting because it provided a richer markup framework for describing what a site was. A lot of this information wasn’t required, based on the model, but to provide it arguably leads to a higher-quality (and potentially more useful) description file.
Again, assuming you know how to mark up an Atom feed, consider the following:
Each entry in an Atom feed describes one asset of a web site. As in RSS, the <link />/<id /> tags point to the representation itself, the <summary /> tag will contain a human-readable description of the asset, and the title would contain the indication of what the resource is.
But the Atom syntax specs requires additional information in the feed. For example, each entry neds an <issued /> tag to indicate when the entry was, well, issued. Feeds also require at the top of the document such things as a title, a link to connect the feed to (in this case, it would be the home page of a site), a <modified /> tag to indicate the last modified date... you get the idea. There’s more information required to make up an Atom feed, much more than I think is required to describe a web site, but nevertheless, value is being added by filling in that data.
I have not at this time provided an atom version for your consumption, because in terms of structure, it is so similar to RSS that I believe nothing new could be learned by studying it in Atom.
Now let’s examine the XHTML version of a WSMD.
I don’t have all the answers here, but I’m convinced that this proposed solution is bound to be useful in all sorts of situations. I humbly submit the following inspirations, and I hope you’ll share with me ideas of your own.
If you are contributing to a vast international website spanning multiple domains on multiple machines in multiple locations, you’ve probably had to give some thought to creating some kind of single sign-on mechanism that would empower any registered user to access any part of the site, no matter where she registered in the first place.
If such a site also required that some user have different privileges than others, then a WSMD file could be useful to establish domains where people may have access. What’s cool about this is that you can alter the files whenever you want, and transparently alter what domains are accessible to a registered user — in real time.
This idea depends on the willingness for search engines, such as Google, to cache WSMD files (but since they’re HTML anyway, that’s already been done, I suppose), then modify the search interface a bit to be able to search the contents of such files intelligently.
How will this contribute to the robustness of your site? Suppose your site spanned multiple domains, and one of your machines got slashdotted. You decide to move some of the content to another machine, link it off the homepage on that machine, and rest easy. Why? Because anyone knowledgeable enough to hit Google after discovering your slashdotted site could easily find out from the cached WSMD file for your site that you’re hosting on more than one machine, and go see if the information they seek is on the other one.
I think the WSMD file also de-emphasizes the importance of domain names. I’ve had a friend make the comment that anyone who had to split their site up among multiple machines is being sloppy. Yes, that’s arguable; there are many sites built over multiple machines that are purposefully designed that way, and to consolidate them under a single domain could well be impossible. But with a WSMD file and the right interface, where a particular piece of information comes from may well be irrelevant, as long as you can get it.
Have you ever been in the situation where you decided to comment on another story or post on someone else’s site, only to find that the thinking was so good, you wish you could move it to your own? With a WSMD file, you could simply ‘claim’ your comments as being part of your site, and people searching for something you said wouldn’t have to worry if you said it on your own hosted pages or as part of a submitted comment.
None of these ideas are in themselves the ‘killer app’ for WSMD, but that’s ok. We’ve had the web since, what? 1991? I got started in this in ’95, and I know that there existed no machine-parseable definition of what a website is in that time. If this notion were to take off and become popular, I’d rather it stay loose enough to ensure we’re marking up the right content, and tighten it up as time goes by, and our relative experiences increase. And it’s going to take a certain amount of experience before we can figure out how to make the best use of this information.
As I mentioned before, I had always assumed that I was going to present the solution in RSS format. I knew that one of the drawbacks to using RSS was that you couldn’t describe the entire website in one file — not if you wanted to preserve the notion of sections to search in. But I figured it didn’t matter. I can simply link to other wsmd.rss files, where each file described a section.
When I actually implemented the format, I discovered that it was a real pain to edit and ensure that everything was set up and linked properly. I also found that the file sizes were absolutely tiny — and while I’m not an expert in the HTTP protocol, I was wondering if perhaps they were so small that the HTTP overhead, combined with I/O bottlenecks for fetching the files, made the scheme a little inefficient.
With that experience, I tweaked the requirements a bit, as you saw here, and came up with a way to describe a site in one file, and it wasn’t that bad in terms of size. For instance, in my WSMD file, I describe 93 assets in a 14K file. Not bad, really. If a large site contained, say, 5000 assets worth putting into a WSMD file, it would occupy a file roughly 750K in size. I suppose that in the web world, that’s huge, but I don’t really see it as being a problem. For one thing, there’s nothing keeping you from breaking the WSMD file into smaller files, with the top-level file linking to other WSMD files. But that might not be necessary if the ones most likely to use the file are the ones hosting the site (and provide you as the user with a richer means of utilizing the site).
I don’t think anyone could argue intelligently that between RSS, Atom, and XHTML, more people would know XHTML more than the others. This means that the learning curve for implementing WSMD files is nearly flat — you need only learn the model, and familiarize yourself with the two tags absolutely required to make this work.
As you no doubt have already tried to do, trying to view a WSMD file is an interesting experience. Instead of getting markup, you actually get a chance to see the entire website in one page. For those of you on slower connections, that was probably a painful experience, and I’m sorry about that. But I don’t really see it as a liability, because for testing purposes, what better way to check to see if you got all your links working right? Besides, the WSMD file obviously isn’t meant to be viewed in a browser. It’s meant to be mined for useful data.
Let’s look again at the anatomy of a web site. here’s the breakdown:
Discussion threads, blogs, and wikis can all easily map to either a section (a <div /> tag) containing more granular assets, or simply be noted as an asset (an <object /> tag) which would have as it’s URI the starting point for that service. Browser-level, plug-in, and downloadable assets are definitely objects. Home pages and applications are objects too. The only thing left are sections, and as those are structural hints, not URI-deference-able, they map naturally to <div /> tags.
If a property of a web site maps to a <div /> tag, then the only information we really need is a human-readable title. On the other hand, if it’s an object, then all we really need are the location of the object, the type of object it is, and a description of the contents.
I don’t think I’ve abused the ontology of XHTML to describe a site, whereas with RSS and Atom, I had to shoehorn a required value into at least one of the fields in an unintended manner.
As described, none of the possible formats, including this one, force the notion of a home page; it would be up to the author of the file to mark which pages are home pages. If the author of a WSMD file wanted to ensure, however, that the file would be as useful as possible, then he is encouraged to adopt the generally accepted terminology wherever it is relevant.
If it’s not obvious at this point, the value of a WSMD file is directly proportional to what you put in it. You don’t have to put every single resource in it if you don’t want to — at the risk of diminishing the value of WSMD. But there are cases where this is exactly the thing to do. For example, it might not make sense to put any URI’s that come from the middle or end of an application flow, because you want to ensure that people always start at the beginning of a task. You might not want to put some of the graphics in the WSMD, because they describe the look and feel of the site, and would serve no value to anyone else.
Another notion peculiar to the XHTML implementation of WSMD is that while you must describe your site using <div /> and <object /> tags, by no means are you limited to the kind of markup you could put within an object tag. You can make citations, link to other resources, both within or without your site, or otherwise add structure to your data.
All this sounds like great theory, but what’s the utility? Is there a killer app for this? I believe so, and I’ll start the discussion here.
With the general types of sites now described, what does a site consist of, semantically speaking?
It may be that some of you would think that movies or Flash files should be Plugin/Downloadable assets, and others think that they’re Browser-level. Honestly, it doesn’t matter that much to me. You’re both right. But very likely, everyone would agree that there are some things that are more Plugin/downloadable and others are more Browser-level.
So, what that list boils down to is this: If you want a web site, you must begin with at least a homepage. Having one implies, that there’s a section. Everything else is optional.
The term ‘section’ is interesting enough to be fleshed out a bit:
That list looks familiar, doesn’t it? A web site, then, is essentially a section that may contain a number of assets, including other sections.
I wonder, though, if some of the numbers could be tweaked. For example, can a web site contain a section made up of nothing but, say, browser-level assets? I suppose so, in which case you’d have 0-n home pages. But for now, I’ll leave the numbers the way they are, because they are typical for almost any site.
Tim Bray, in this article, already talked about it, really. A web site can span domains, or it may be within a directory or two of a domain, but not at the root of that domain. It stands to reason that the only safe assumption to make is a web site must have at least one URI, and that all assets, whether its a page or an image or a binary document, is going to have a URI.
Also worth mentioning are such things like the author and last modified dates of a site. Such attributes are typically not referenced by URI, but are represented by markup within a representation. So, for the purposes of this model, they are not considered, because I don’t believe they will have an impact on the solution.
I’m not sure whether there’s any more to put in the model at this point, so let’s have a look at what falls out of this.
The requirement this request matches is quoted here:
… Let’s imagine someone’s “Turned On Search” for a website; they want a way to feed queries in and get results out. Most people build web sites with one kind of templating facility or another; well-known examples are PHP, JSP, then there’s all the blogging tools and any number of different portal-ware offerings.
Before I invested too much time on a solution, I wanted to build a model of what a web site is. That way I can check my solution against the model to see if it works. As you read through the model, please understand that while I have made a bit of effort to be as complete as possible, I’m sure I missed points that are sure to be valuable. Please feel free to contribute to this discussion, because even if my proposed solution is unsatisfactory, I believe that having a model to test potential solutions against will continue to have value.
By saying Library-like, I’m imagining a web site that has as its primary characteristic a top-down organizational model strongly supported by the navigation system. If you want information, you would typically ‘drill down’ from the home page, to a relevant section of the site, to (eventually) a page containing that information. This behaviour is very much like a library (hence the term) — you start at the front door, walk to the section likely to contain the book you want, then you search the shelves, then the books.
This type of web site is designed to function, of course, like an application. It would consist of at least one page, but usually more. Its primary organizational characteristic is, I think, a left to right topology. I’m not sure if I’m using that term correctly, and I’m not implying that this type of site was built left-to-right, but what I am saying is that once the thing is done, the way it would be used is to start at the beginning and work your way through to the end.
If we ignore for the moment the single-page applications, then another of the characteristics typical of Application-like sites is that it won’t make much sense to bookmark a page in the middle of a task flow, because a bookmark can’t capture the context of the work you’re doing in that application.
A blog-like site’s main characteristic is that posts are typically sorted in reverse chronological order. Reinforcing that is, usually, a weak development of a more traditional top-down navigation system such that you’d find in the Library-like sites.
A wiki-like site’s main characteristic is a high density of links in the body content, coupled by a weak top-down navigation system more typically found in Library-like sites. A formal organization may be nonexistent, but if a wiki-like site is organized, then that organization tends to be bottom-up in nature.
Any site that is primarily a place for on-line discussion where you post to a web form and see your comments in-line with other contributions by other people is a community-like site. Posts are generally sorted in chronological order, or by a ranking system, or by a threading mechanism, and many posts typically share the same page, although there are some systems that will allow you to page through the comments if having them all on one page is unwieldily.
While I called all of these a type of site, that won’t be true of all web sites — any given web site may have a dominant type, with other types in various sections (ie: a corporate site like FedEx would be Library-like, with many Application-like areas within)
Any site designed to aggregate news covering one, or a variety of topics, superficially resembles a blog in that the most current news are posted to the main page, but also possesses characteristics of a strong top-down navigation system designed to help filter the kinds of news a consumer is interested in. Such sites may be designed to not host the news item itself, but rather simply link to the news articles on other sites.
With these classifications in mind, let’s break them down into their basic components.
Once I started looking at the possibilities of using XHTML, I realized that I could change some of the properties. Let’s review the list:
I’ll show you the syntax in a bit, but what I realized was that the first three requirements could be rewritten as follows:
As an XHTML file, there’s obviously going to be the usual required overhead: <html /> tags, <head /> tags, <title /> tags, and <body /> tags. Inside the <body /> tags, we can describe what a website looks like using only two tags: the <div /> tag and the <object /> tag. An example is provided here Note: Since this is an HTML file that links in almost everything on the site, you actually might want to “Save to disk” instead of clicking it directly.
A section can be indicated with <div /> tags, and the title of that section can be reflected in the title attribute. Sections may be nested within each other to create sub-sections or sub-categories. This nesting can be as deep as required to describe your site.
The <object /> tags are wonderfully rich, and so are perfect for describing each asset that makes up a web site. The data attribute contains the URI for the asset in question, and the type attribute provides an intelligent way of discerning what kind of asset it is. Between the open and close tags can lie any content you care to use to provide a human-readable description of what that object describes. Even more interesting is that there’s nothing keeping you from putting richer markup in this area. For example, a link can be provided to a site as a citation for the source of the information represented in that asset.
After looking at implementations of the concepts in RSS, Atom, and XHTML, I decided that the best fit was XHTML. Let me outline the thinking in this section.
Use the <link /> tag in the HTML documents on your site. You only need to use one on the home page, as that one file must link to all the other WSMD files on the site. However, if you’ve implemented a WSMD search engine in your site, then it may be very useful to have the ‘home pages’ of sections link to their section’s WSMD file for scoped searching.
Ok. Let’s explore possible implementations.
I believe that the best way to define what’s in a web site is to create a machine-readable file listing all the assets you think are important enough to be documented. And a machine-readable file, in this case, means that we’re going to talk XML.
Should we use RDF? That, for everyone who uses RDF, would be a no-brainer. If someone was to take the model I set up and created an ontology optimized for describing web sites, then great! Please share it with me when you get it finished!
In fact, I wanted to use RDF to describe this. I have no problems figuring out the model RDF uses. However, I have yet to find a good tutorial that makes learning RDF syntax as easy. The code examples I’ve seen are inscrutable.
So I don’t want to use RDF. I want the solution to be as simple as marking up HTML. Tim Bray has argued on many occasions that a successful markup language (or programming language) is one you can view-source, and hack around in with a high degree of confidence that what you’ll do will probably work. That should be the sweet spot to aim for in the implementation of the syntax.
Another approach is to build yet another markup language that captures the ontology precisely, and is easy to pick up. I didn’t want to do that either. In a world where everyone and their brother has their own XML-based tag set, yet another one isn’t going to do much good. I’d much rather try to leverage something already popular and easy to use.
So, the solution I want is going to be an ontology-free (or as free as possible), simple to use, already deployed XML markup language.
I can only think of three candidates: RSS, Atom, and XHTML. Before I discuss the implementation, I want to sketch out what I think should be represented and how.
A WSMD file should look like this:
In choosing those constraints, I feel that the resulting files will tend to be quite small and easy to read through, and quick to parse, even for extremely large sites.
One more thing to talk about before going into implementations: we need to talk about how WSMD files are discovered.