Robert Hahn

inspired by integration

I'm always interested in infrastructure that brings people together and facilitates communication. I'm currently exploring social software, markup & scripting languages, and abstract games.

Home | In This Site …
noted on Thu, 11 Dec 2003

What is a Web Site?

Furthering the discussion involving Tim Bray (where I picked it up), Dave Winer, Sam Ruby, Jeremy Zawodny, and Joe Gregorio, I started picking at the issue Tim brought up:

The problem is that all the Web knows about is URIs, and the Web can’t tell whether a URI points to a home page, a picture of a cute cat, or to one of a dozen daily entries on some blog… And I bet, down the road, once we really have the notion of a site, we’ll be able to think of all sorts of other useful things to do with it.

This series of posts builds on the thinking Tim puts out, so I recommend you look at his article before continuing.

What I have done in the following series of posts was to try and map out a model of what a web site is, break it down into it’s atomic pieces, and determine if there’s a way to represent the data in a machine and human readable format. Some of these articles, like the “WSDF as RSS/WSDF as Atom” article, could safely be skipped if you’re pressed for time.

If you read through most or all of this, let me use this post to thank you (I wasn’t sure where else to put it). If you decide there’s merit enough in this proposal to implement a WSDF file for your site, please drop me a line. If you have constructive criticism related to anything here, please let me know. I’m convinced that the ideas outlined here will work, and I’d like to see where it goes.

Justification for using XHTML

File sizes

As I mentioned before, I had always assumed that I was going to present the solution in RSS format. I knew that one of the drawbacks to using RSS was that you couldn’t describe the entire website in one file — not if you wanted to preserve the notion of sections to search in. But I figured it didn’t matter. I can simply link to other wsdf.rss files, where each file described a section.

When I actually implemented the format, I discovered that it was a real pain to edit and ensure that everything was set up and linked properly. I also found that the file sizes were absolutely tiny — and while I’m not an expert in the HTTP protocol, I was wondering if perhaps they were so small that the HTTP overhead, combined with I/O bottlenecks for fetching the files, made the scheme a little inefficient.

With that experience, I tweaked the requirements a bit, as you saw here, and came up with a way to describe a site in one file, and it wasn’t that bad in terms of size. For instance, in my WSDF, I describe 93 assets in a 14K file. Not bad, really. If a large site contained, say, 5000 assets worth putting into a WSDF, it would occupy a file roughly 750K in size. I suppose that in the web world, that’s huge, but I don’t really see it as being a problem. For one thing, there’s nothing keeping you from breaking the wsdf into smaller files, with the top-level file linking to other WSDFs. But that might not be necessary if the ones most likely to use the file are the ones hosting the site (and provide you as the user with a richer means of utilizing the site).

Ease of markup

I don’t think anyone could argue intelligently that between RSS, Atom, and XHTML, more people would know XHTML more than the others. This means that the learning curve for implementing WSDFs is nearly flat — you need only learn the model, and familiarize yourself with the two tags absolutely required to make this work.

Viewing a WSDF File.

As you no doubt have already tried to do, trying to view a WSDF is an interesting experience. Instead of getting markup, you actually get a chance to see the entire website in one page. For those of you on slower connections, that was probably a painful experience, and I’m sorry about that. But I don’t really see it as a liability, because for testing purposes, what better way to check to see if you got all your links working right? Besides, the WSDF obviously isn’t meant to be viewed in a browser. It’s meant to be mined for useful data.

XHTML offers exactly the right semantics for the job.

Let’s look again at the anatomy of a web site. here’s the breakdown:

Discussion threads, blogs, and wikis can all easily map to either a section (a <div /> tag) containing more granular assets, or simply be noted as an asset (an <object /> tag) which would have as it’s URI the starting point for that service. Browser-level, plug-in, and downloadable assets are definitely objects. Home pages and applications are objects too. The only thing left are sections, and as those are structural hints, not URI-deference-able, they map naturally to <div /> tags.

If a property of a web site maps to a <div /> tag, then the only information we really need is a human-readable title. On the other hand, if it’s an object, then all we really need are the location of the object, the type of object it is, and a description of the contents.

I don’t think I’ve abused the ontology of XHTML to describe a site, whereas with RSS and Atom, I had to shoehorn a required value into at least one of the fields in an unintended manner.

Other Notes on creating WSDFs

As described, none of the possible formats, including this one, force the notion of a home page; it would be up to the author of the file to mark which pages are home pages. If the author of a WSDF wanted to ensure, however, that the file would be as useful as possible, then he is encouraged to adopt the generally accepted terminology wherever it is relevant.

If it’s not obvious at this point, the value of a WSDF is directly proportional to what you put in it. You don’t have to put every single resource in it if you don’t want to — at the risk of diminishing the value of a WSDF. But there are cases where this is exactly the thing to do. For example, it might not make sense to put any URI’s that come from the middle or end of an application flow, because you want to ensure that people always start at the beginning of a task. You might not want to put some of the graphics in the WSDF, because they describe the look and feel of the site, and would serve no value to anyone else.

Another notion peculiar to the XHTML implementation of a WSDF is that while you must describe your site using <div /> and <object /> tags, by no means are you limited to the kind of markup you could put within an object tag. You can make citations, link to other resources, both within or without your site, or otherwise add structure to your data.

All this sounds like great theory, but what’s the utility? Is there a killer app for this? I believe so, and I’ll start the discussion here.

Getting the value out of WSDFs

I don’t have all the answers here, but I’m convinced that this proposed solution is bound to be useful in all sorts of situations. I humbly submit the following inspirations, and I hope you’ll share with me ideas of your own.

Single Sign On

If you are contributing to a vast international website spanning multiple domains on multiple machines in multiple locations, you’ve probably had to give some thought to creating some kind of single sign-on mechanism that would empower any registered user to access any part of the site, no matter where she registered in the first place.

If such a site also required that some user have different privileges than others, then a WSDF file could be useful to establish domains where people may have access. What’s cool about this is that you can alter the files whenever you want, and transparently alter what domains are accessible to a registered user — in real time.

Site Robustness

This idea depends on the willingness for search engines, such as Google, to cache WSDFs (but since they’re HTML anyway, that’s already been done, I suppose), then modify the search interface a bit to be able to search the contents of such files intelligently.

How will this contribute to the robustness of your site? Suppose your site spanned multiple domains, and one of your machines got slashdotted. You decide to move some of the content to another machine, link it off the homepage on that machine, and rest easy. Why? Because anyone knowledgeable enough to hit Google after discovering your slashdotted site could easily find out from the cached WSDF for your site that you’re hosting on more than one machine, and go see if the information they seek is on the other one.

I think the WSDF also de-emphasizes the importance of domain names. I’ve had a friend make the comment that anyone who had to split their site up among multiple machines is being sloppy. Yes, that’s arguable; there are many sites built over multiple machines that are purposefully designed that way, and to consolidate them under a single domain could well be impossible. But with a WSDF and the right interface, where a particular piece of information comes from may well be irrelevant, as long as you can get it.

Ownership of Content

Have you ever been in the situation where you decided to comment on another story or post on someone else’s site, only to find that the thinking was so good, you wish you could move it to your own? With a WSDF file, you could simply ‘claim’ your comments as being part of your site, and people searching for something you said wouldn’t have to worry if you said it on your own hosted pages or as part of a submitted comment.

None of these ideas are in themselves the ‘killer app’ for WSDF, but that’s ok. We’ve had the web since, what? 1991? I got started in this in ‘95, and I know that there existed no machine-parseable definition of what a website is in that time. If this notion were to take off and become popular, I’d rather it stay loose enough to ensure we’re marking up the right content, and tighten it up as time goes by, and our relative experiences increase. And it’s going to take a certain amount of experience before we can figure out how to make the best use of this information.

A Search Engine for WSDF

I have code here to show you how you could leverage the value inherent in a WSDF. Unfortunately, since this domain is not my own, I am not free to simply set up the scripts needed as I see fit. So I encourage you to download it, unzip it, and have a look. If you’re not a programmer, you may be more interested in possible use cases for WSDFs if someone should but create scripts to realize them.

The code is provided as proof-of-concept only. I’m sure that many of you reading this are probably far better programmers than I, and I hope that you’ll take what inspiration you can from them and run with it.

For those of you not interested in downloading it (yet?), this post will provide the briefest of overviews of how I wrote this search engine.

What I decided to do was design the search page in such a way as to provide as much context as possible for the search. In this case, what I provided were two HTML select lists. One of them contained all the section titles (so you can set a scope for the search), and the other contained a list of possible file types to search within (choices included HTML, RSS, images, CSS, and Javascript). Finally, you can enter search terms to further refine your search.

This search page was generated by a script called wsdffindmaker.pl, and its sole purpose was to parse the WSDF and create the list of possible sections and filetypes to search within. The thinking behind doing it this way is that, as sole author of this blog, I will be updating my WSDF file, and hence my search page, only as often as I’m posting. All the other times when requests are made to the page, the representation will be exactly the same. Why bother, then, wasting CPU cycles generating the exact same output between page refreshes?

The search itself is covered by another script, called wsdffind.cgi, which dynamically constructs the XPath for the search, pulls matching nodes, and attempts to match the search text (if present) to the result set, printing anything that appears to be a candidate result.

I encourage you to download the code and take it out for a spin. The requirements are that you have XML::XPath (and it’s prerequisites) installed. I also make use of the perl CGI module (which seems to be part of the standard Perl distribution these days). I’m running the page successfully on a Mac OS X machine, and despite the tedium of installing all the prerequisites, I have not encountered any actual problems with the install.

This isn’t the only thing you could do with a WSDF. I’ve discussed some other applications in this post that may inspire you.

WSDF as XHTML

Once I started looking at the possibilities of using XHTML, I realized that I could change some of the properties. Let’s review the list:

  1. Each file describes the current ‘section’.
  2. If other sections exist, a link is provided to their WSDF.
  3. If its not a section, an entry is added to the current WSDF. This entry has the following properties:
    1. a URI to the asset
    2. an indication of what it is (possibly by mime-type, or a consistent use of terms)
    3. a description of what it is

I’ll show you the syntax in a bit, but what I realized was that the first three requirements could be rewritten as follows:

  1. One file can be used to describe an entire site
  2. Each section can be represented with a tag
  3. If it’s not a section, use another tag to describe what kind of resource it is.

As an XHTML file, there’s obviously going to be the usual required overhead: <html /> tags, <head /> tags, <title /> tags, and <body /> tags. Inside the <body /> tags, we can describe what a website looks like using only two tags: the <div /> tag and the <object /> tag. An example is provided here.

A section can be indicated with <div /> tags, and the title of that section can be reflected in the title attribute. Sections may be nested within each other to create sub-sections or sub-categories. This nesting can be as deep as required to describe your site.

The <object /> tags are wonderfully rich, and so are perfect for describing each asset that makes up a web site. The data attribute contains the URI for the asset in question, and the type attribute provides an intelligent way of discerning what kind of asset it is. Between the open and close tags can lie any content you care to use to provide a human-readable description of what that object describes. Even more interesting is that there’s nothing keeping you from putting richer markup in this area. For example, a link can be provided to a site as a citation for the source of the information represented in that asset.

After looking at implementations of the concepts in RSS, Atom, and XHTML, I decided that the best fit was XHTML. Let me outline the thinking in this section.

WSDF as RSS/WSDF as Atom

The RSS implementation of WSDF

When I started writing this series of posts, I already had the properties of a WSDF file already in my head. Initial research suggested to me that RSS would be the only candidate, because the structure was simple enough, and I already understood it. So it came as a surprise to me to work this out and then realize that Atom and XHTML are also good candidates. Nevertheless, I’d like to demonstrate how these principles could be implemented in RSS.

The RSS implementation is dirt easy. I chose to work with the 0.91 spec so as to use a ‘lowest common denominator’. I’m going to assume that you already know how to read RSS files.

Let’s copy a list from this post. The properties of a WSDF entry must be:

  1. a URI to the asset
  2. an indication of what it is (possibly by mime-type, or a consistent use of terms)
  3. a description of what it is

Well, we got two of the three RSS tags used in each <item> nailed already — <link /> and <description />. So that leaves the title tag to hold the indication of what it is. That feels wrong to me, somehow, because it’s a <title /> tag, not <an-indication-of-what-it-is /> tag. Still, I guess it could be argued that a title is supposed to be an indication of what something is.

I’ve created the RSS implementation of the WSDF file. You can find the one for the home page here.

What’s really cool about this minimal implementation is that if you want to add more metadata than the minimum I defined, you need merely switch to a more modern revision of RSS — one that supports, say, the Dublin Core metadata set, and run with that.

The Atom implementation of WSDF

Before I started this project, I knew about Atom, and ‘I was there’ when Sam Ruby first opened up the issue that eventually led to the fine work that has been done, and continues to be done, today. However, like many others, I was quickly overwhelmed by the frenetic pace of development, and decided to sit out until they had something to show for their efforts.

I sat down one weekend to look at the Atom Syndication Feed specification for the first time, and realized that this format was very interesting because it provided a richer markup framework for describing what a site was. A lot of this information wasn’t required, based on the model, but to provide it arguably leads to a higher-quality (and potentially more useful) description file.

Again, assuming you know how to mark up an Atom feed, consider the following:

Each entry in an Atom feed describes one asset of a web site. As in RSS, the <link />/<id /> tags point to the representation itself, the <summary /> tag will contain a human-readable description of the asset, and the title would contain the indication of what the resource is.

But the Atom syntax specs requires additional information in the feed. For example, each entry neds an <issued /> tag to indicate when the entry was, well, issued. Feeds also require at the top of the document such things as a title, a link to connect the feed to (in this case, it would be the home page of a site), a <modified /> tag to indicate the last modified date… you get the idea. There’s more information required to make up an Atom feed, much more than I think is required to describe a web site, but nevertheless, value is being added by filling in that data.

I have not at this time provided an atom version for your consumption, because in terms of structure, it is so similar to RSS that I believe nothing new could be learned by studying it in Atom.

Now let’s examine the XHTML version of a WSDF.

How the WSDF is discovered.

Use the <link /> tag in the HTML documents on your site. You only need to use one on the home page, as that one file must link to all the other WSDF’s on the site. However, if you’ve implemented a WSDF search engine in your site, then it may be very useful to have the ‘home pages’ of sections link to their section’s WSDF file for scoped searching.

Ok. Let’s explore possible implementations.

Properties of the WSDF

A WSDF should look like this:

  1. Each file describes the current ‘section’.
  2. If other sections exist, a link is provided to their WSDF.
  3. If its not a section, an entry is added to the current WSDF. This entry has the following properties:
    1. a URI to the asset
    2. an indication of what it is (possibly by mime-type, or a consistent use of terms)
    3. a description of what it is

In choosing those constraints, I feel that the resulting files will tend to be quite small and easy to read through, and quick to parse, even for extremely large sites.

One more thing to talk about before going into implementations: we need to talk about how WSDF files are discovered.

The Web Site Description File (WSDF)

I believe that the best way to define what’s in a web site is to create a machine-readable file listing all the assets you think are important enough to be documented. And a machine-readable file, in this case, means that we’re going to talk XML.

Should we use RDF? That, for everyone who uses RDF, would be a no-brainer. If someone was to take the model I set up and created an ontology optimized for describing web sites, then great! Please share it with me when you get it finished!

In fact, I wanted to use RDF to describe this. I have no problems figuring out the model RDF uses. However, I have yet to find a good tutorial that makes learning RDF syntax as easy. The code examples I’ve seen are inscrutable.

So I don’t want to use RDF. I want the solution to be as simple as marking up HTML. Tim Bray has argued on many occasions that a successful markup language (or programming language) is one you can view-source, and hack around in with a high degree of confidence that what you’ll do will probably work. That should be the sweet spot to aim for in the implementation of the syntax.

Another approach is to build yet another markup language that captures the ontology precisely, and is easy to pick up. I didn’t want to do that either. In a world where everyone and their brother has their own XML-based tag set, yet another one isn’t going to do much good. I’d much rather try to leverage something already popular and easy to use.

So, the solution I want is going to be an ontology-free (or as free as possible), simple to use, already deployed XML markup language.

I can only think of three candidates: RSS, Atom, and XHTML. Before I discuss the implementation, I want to sketch out what I think should be represented and how.

Anatomy of a Web Site

With the general types of sites now described, what does a site consist of, semantically speaking?

It may be that some of you would think that movies or Flash files should be Plugin/Downloadable assets, and others think that they’re Browser-level. Honestly, it doesn’t matter that much to me. You’re both right. But very likely, everyone would agree that there are some things that are more Plugin/downloadable and others are more Browser-level.

So, what that list boils down to is this: If you want a web site, you must begin with at least a homepage. Having one implies, that there’s a section. Everything else is optional.

The term ‘section’ is interesting enough to be fleshed out a bit:

Anatomy of a section

That list looks familiar, doesn’t it? A web site, then, is essentially a section that may contain a number of assets, including other sections.

I wonder, though, if some of the numbers could be tweaked. For example, can a web site contain a section made up of nothing but, say, browser-level assets? I suppose so, in which case you’d have 0-n home pages. But for now, I’ll leave the numbers the way they are, because they are typical for almost any site.

URI’s

Tim Bray, in this article, already talked about it, really. A web site can span domains, or it may be within a directory or two of a domain, but not at the root of that domain. It stands to reason that the only safe assumption to make is a web site must have at least one URI, and that all assets, whether its a page or an image or a binary document, is going to have a URI.

Non-physical attributes of a web site

Also worth mentioning are such things like the author and last modified dates of a site. Such attributes are typically not referenced by URI, but are represented by markup within a representation. So, for the purposes of this model, they are not considered, because I don’t believe they will have an impact on the solution.

I’m not sure whether there’s any more to put in the model at this point, so let’s have a look at what falls out of this.

Model of a Web Site

Before I invested too much time on a solution, I wanted to build a model of what a web site is. That way I can check my solution against the model to see if it works. As you read through the model, please understand that while I have made a bit of effort to be as complete as possible, I’m sure I missed points that are sure to be valuable. Please feel free to contribute to this discussion, because even if my proposed solution is unsatisfactory, I believe that having a model to test potential solutions against will continue to have value.

Kinds of web sites

Library-like

By saying Library-like, I’m imagining a web site that has as its primary characteristic a top-down organizational model strongly supported by the navigation system. If you want information, you would typically ‘drill down’ from the home page, to a relevant section of the site, to (eventually) a page containing that information. This behaviour is very much like a library (hence the term) — you start at the front door, walk to the section likely to contain the book you want, then you search the shelves, then the books.

Application-like

This type of web site is designed to function, of course, like an application. It would consist of at least one page, but usually more. Its primary organizational characteristic is, I think, a left to right topology. I’m not sure if I’m using that term correctly, and I’m not implying that this type of site was built left-to-right, but what I am saying is that once the thing is done, the way it would be used is to start at the beginning and work your way through to the end.

If we ignore for the moment the single-page applications, then another of the characteristics typical of Application-like sites is that it won’t make much sense to bookmark a page in the middle of a task flow, because a bookmark can’t capture the context of the work you’re doing in that application.

Blog-like

A blog-like site’s main characteristic is that posts are typically sorted in reverse chronological order. Reinforcing that is, usually, a weak development of a more traditional top-down navigation system such that you’d find in the Library-like sites.

Wiki-like

A wiki-like site’s main characteristic is a high density of links in the body content, coupled by a weak top-down navigation system more typically found in Library-like sites. A formal organization may be nonexistent, but if a wiki-like site is organized, then that organization tends to be bottom-up in nature.

Community-like

Any site that is primarily a place for on-line discussion where you post to a web form and see your comments in-line with other contributions by other people is a community-like site. Posts are generally sorted in chronological order, or by a ranking system, or by a threading mechanism, and many posts typically share the same page, although there are some systems that will allow you to page through the comments if having them all on one page is unwieldily.

While I called all of these a type of site, that won’t be true of all web sites — any given web site may have a dominant type, with other types in various sections (ie: a corporate site like FedEx would be Library-like, with many Application-like areas within)

News Portal

Any site designed to aggregate news covering one, or a variety of topics, superficially resembles a blog in that the most current news are posted to the main page, but also possesses characteristics of a strong top-down navigation system designed to help filter the kinds of news a consumer is interested in. Such sites may be designed to not host the news item itself, but rather simply link to the news articles on other sites.

With these classifications in mind, let’s break them down into their basic components.

tall ship