(You might want to read Part 1 if you haven’t read it already.)
Finding categories on a Blogware blog if you’re a human
Navigating through categories
If you’re a human, you would typically find the list of categories for a Blogware blog by looking at the page and keeping an eye for a list called “Categories”, “Topics”, or whatever the blog’s author has decided to call them. The default templates for Blogware blog designs tend to put the list of categories in a sidebar, usually on the left side.
Let’s use Boss Ross’ blog, Random Bytes, as an example. His category list, under the design that’s in use as I write this article, appears in the left sidebar near the top of the page. It looks like this:
You select a category by clicking on its name, which takes you to the page for that category. You will then see articles that were classified as being under that category. For example, if you were to visit Ross’ blog and then select the Random Bytes category from his category list by clicking on it, you would see a page filled with articles that fall under that category.
The Random Bytes category contains a set of categories, and these appear in the sidebar:
Ross enabled category “bubbling up” (see the last entry on categories for a full explanation), which means that any given category will display articles that have been classified under that category and any of its subcategories. In the case of Ross’ blog, what that means is that if you’re looking at the Random Bytes category, you’re looking at articles that have been filed under the category Random Bytes as well as under its subcategories (Stream of thoughtlessness, Code, Pop Trivia, Blathering, The Changeblog, CycleLog).
Where am I?
As a navigational aid, Blogware blogs default to showing where you are by providing a handy set of category links near the top of each page. In Ross’ blog, these links look like this:
In the example above, the links indicate that you are in the Code category, which is contained within the Random Bytes category, which in turn is contained within the Main Page, the master category that contains all categories.
/You/are/here
Programmers will feel very comfortable with the concept of hierarchical categories, as they’ve been taught to organize things in terms of hierarchies. If you’ve spent any time navigating through the folders or directories in your hard drive or the directories in a web site, you’ll find the hierarchical system of categories familar. Since categories in a blog follow an organizational scheme similar to files and folders on a hard drive or pages on a web site, you might find that some people tend to use the same notation for describing them. For example:
The Code category, which is contained within the Random Bytes category, which is contained within the Main Page category
could be notated as
/Random Bytes/Code
Finding categories on a Blogware blog if you’re a computer
Screen scraping
We humans are very good at abstracting information from the medium used to deliver it. For example, if I were to change the the “look and feel” of my blog — perhaps move my blogroll over to the left side, change the logo, put the byline and posting time at the end of each article rather than at the beginning or switch from the current three-column design to two columns — you’d still be able to be read it because you can abstract that information from the layout. We humans are pretty clever that way.
On the other hand, computers are complete morons. Their strength is that although they can perform only simple tasks, they’re capable of doing them much more than we can.
Let’s first look at “screen scraping”, a term which refers to having a program “look” at the contents of a screen or web page and extract information from it. One of my first attempts are writing programs that made use of information off the Web was something that went to the local weather page, grabbed it contents, extracted the temperature data and then displayed it in a little window. The weather web page that it consulted displayed temperature like this:
Current temperature: 15 C
I wrote my program so that it looked for a block of text that began with “Current temperature:” and ended with “C”. It would grab the text between those two phrases, shave off any leading and trailing spaces, and use whatever was left as the temperature. For the first few weeks, it worked well, displaying a window that loked something like this:
(Ignore for the moment that this window looks like one from Windows XP. It was 1997, so imagine that this is a Windows 95 window.)
This worked well, for as long as they didn’t change the phrases that I used as guides to find the temperature data. Of course, this was during that period of time when “web designer” was a hot career and you often saw them drinking Chivas out of glass slippers and lighting cigars with hundred-dollar bills. That meant that the site was redesigned often. One day, a web designer gave the weather page a brand new layout and put the temperature in a table. As far as the program was concerned, the temperature was displayed this way:
<td>Current temperature:</td><td>15 C / 59 F</td>
My program ended up displaying this:
The problem was that the information on the page was attached to the way it was presented. When the web designer changed the layout of the page, s/he changed the cues that my program used to find the information I wanted — the current temperature. If she’d gone even farther and changed the wording of “Current temperature:” to “The temperature right now is:”, my program would like show either a blank or perhaps crash, depending on how I wrote it.
If the people who ran the weather page could somehow provide the temperature in a standardized form that was somehow removed from all those presentation niceties like “look and feel” and layout, and if they promised to stick to that form, I could then write a program to access that information secure in the knowledge that changes in the layout of the weather page will not “break” my program.
Enter XML
XML is short for eXtensible Markup Language, and its purpose is to mark information so that programs like my weather program can grab that information without having to worry about looking for “cues” or changes in “look and feel”. The wetaher web page could provide an URL that would hold some XML data about the current weather, perhaps something that looked like this:
<temperature>15</temperature>
(This, of course, is an incredibly simplified version of what the weather web page would probably provide.)
If XML looks sort of like HTML, it’s because both have the same parent, SGML (Structured General Markup Language), a system for marking up information. There are a number of “introduction to XML” articles on the web; this one is a pretty gentle introduction.)
RSS
Most blogs and all Blogware blogs have what is called an RSS feed. RSS, depending on whom you ask is short for either Really Simple Syndication or RDF Site Summary, but what both really mean “a standard way for marking up the content of an oft-updated web page like a blog or news site, independently of the presentation”. RSS comes in a few flavours, but it boils down to a way to present the content of a blog in XML form.
Consider the example article below (it’ll make more sense if you’ve seen any one of the Matrix movies):
For each entry, Blogware creates matching RSS data, which a program can read without having to worry about possible changes in the format of the blog. The RSS for the article shown above would look like this:
<item>
<title>We need a new captain</title>
<link>http://example.blogware.com/blog/_archives/2003/11/2/5518.html</link>
<guid>http://example.blogware.com/blog/_archives/2003/11/2/5518.html</guid>
<pubDate>Sun, 02 Nov 2003 13:40:52 -0500</pubDate>
<description>Morpheus is really getting on my nerves. <br> <br>
He has this annoying habit, where we'd be in a meeting going over
some very crucial detail of the plans, and he'll suddenly break all of
Robert's Rules of Order and launch into some prophetic monologue
about Neo (again!) and how he's "The One" (as if the doofus didn't
already have some kind of messiah complex) and that tonight could
be "the battle that wins the war against the Matrix".
That guff got old a long time ago, fatboy. <br> <br>
And speaking of fat, how'd he get so tubby eating that pink goo
that they feed us, anyway? It tastes like wallpaper paste.
I eat only enough to keep the hunger pangs at bay.
I'll bet the twerp has a secret donut making machine
stashed away somewhere. I hate him.</description>
<category domain="http://example.blogware.com/blog">Main Page</category>
</item>
As I mentioned earlier, all Blogware blogs have RSS feeds. You’ll find the RSS feed for any given Blogware blog at the URL:
http://{blog url}/blog/index.xml
For instance, you’ll find the RSS feed for this blog at http://accordionguy.blogware.com/blog/index.xml. Click on the link; it’ll look much prettier if you’re viewing it in Internet Explorer or any browser that knows how to render XML.
Blogware’s RSS feeds go even farther: there’s a separate RSS feed for each category in a blogware blog — just add index.xml to the end of any category URL to get it. For instance, if you wanted to read the page for my Yeah…Girls…Geez category, you’d either click on the link for it or point your browser to http://accordionguy.blogware.com/blog/Life/Girls/. If you wanted the RSS feed for that category, you’d go to http://accordionguy.blogware.com/blog/Life/Girls/index.xml
With RSS, Blogware provides a way for software to find articles. This is useful for all sorts of applications, such as:
- Aggregators. Aggregators are essentially programs that gather up blog content from a set of blogs that you specify into one place. If you’re a heavy blog reader or doing research, an aggregator is a time-saver.
- Services like Technorati. Technorati is one of a new set of blog-realted services appearing on the web. It reads the RSS feeds of over a million blogs and produces reports such as who links to whom, what the most popular stories in the “blogosphere” are, which blogs are the most popular, and so on.
- Custom applications. With blogs exposing their content in a format that’s relatively easy for computers to understand, there are all kinds of applications that can be built that collect this data and crunch it for all kinds of purposes.
OCS
With a main RSS feed, a program can find the latest articles posted to a Blogware blog. With per-category RSS, a program can drill down further and narrow its information gathering to specific categories within a blog.
The natural question is: how does a program know what categories are in a given Blogware blog?
Unfortunately, the RSS specification does not provide for a way to do this. However, since XML is a language for marking up any kind of information, it’s possible to use it to create a list of categories. We’re also lucky that someone’s already created a specific XML language for marking up content called OCS (Open Content Syndication). I won’t go into detail about it right now, but you can find more information about it here. We use it to list Blogware categories.
Consider the categories in this blog, shown is the diagram below:
As with RSS, each Blogware blog comes with a feed listing all categories. For any Blogware blog, you’ll find it at the URL:
http://{blog url}/blog/ocs.xml
For this blog, you’ll find the category listing at http://accordionguy.blogware.com/blog/ocs.xml.
Here’s the part of the ocs.xml for this blog — this part is the list of all categories. I’ve formatted it a little bit to make it easier to read:
<directory rdf:about="http://accordionguy.blogware.com/blog">
<dc:title>The Adventures of Accordion Guy in the 21st Century - RSS Feeds</dc:title>
<dc:description />
<channels>
<rdf:Bag>
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/Python" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/Gadgets" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/Blogosphere" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/Internet" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/Interface" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/apple" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/Wireless" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/Ruby" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/OfficeSpace" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/TheOffice" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/HOTELMIT" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Geek/Blogware" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Accordion" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Accordion/KickassKaraoke" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Accordion/KickassKaraoke/20031012" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Announcements" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Girls" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Happened" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Happened/TheBig35" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Happened/Thirtysexy" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Happened/Thirtysexy2" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/HighHorse" />
<rdf:li
rdf:resource="http://accordionguy.blogware.com/blog/Life/Toronto" />
</rdf:Bag>
</channels>
</directory>
Immediately following this section is a section in which each channel is listed in a litte more detail. Here’s the information for the main page:
<channel rdf:about="http://accordionguy.blogware.com/blog">
<dc:title>The Adventures of Accordion Guy in the 21st Century - Main Page</dc:title>
<dc:description />
<image />
<formats>
<rdf:Alt>
<rdf:li>
<rdf:Description rdf:about="http://accordionguy.blogware.com/blog/index.xml">
<dc:language>en</dc:language>
<format rdf:resource="http://purl.org/ocs/formats/#rss20" />
<schedule rdf:resource="http://purl.org/ocs/schedules/#daily" />
</rdf:Description>
</rdf:li>
</rdf:Alt>
</formats>
</channel>
Here’s the information for the Geek category:
<channel rdf:about="http://accordionguy.blogware.com/blog/Geek">
<dc:title>The Adventures of Accordion Guy in the 21st Century - Geek</dc:title>
<dc:description />
<image />
<formats>
<rdf:Alt>
<rdf:li>
<rdf:Description rdf:about="http://accordionguy.blogware.com/blog/Geek/index.xml">
<dc:language>en</dc:language>
<format rdf:resource="http://purl.org/ocs/formats/#rss20" />
<schedule rdf:resource="http://purl.org/ocs/schedules/#daily" />
</rdf:Description>
</rdf:li>
</rdf:Alt>
</formats>
</channel>
And here’s the information for the Geek/Python category:
<channel rdf:about="http://accordionguy.blogware.com/blog/Geek/Python">
<dc:title>The Adventures of Accordion Guy in the 21st Century - Python</dc:title>
<dc:description />
<image />
<formats>
<rdf:Alt>
<rdf:li>
<rdf:Description rdf:about="http://accordionguy.blogware.com/blog/Geek/Python/index.xml">
<dc:language>en</dc:language>
<format rdf:resource="http://purl.org/ocs/formats/#rss20" />
<schedule rdf:resource="http://purl.org/ocs/schedules/#daily" />
</rdf:Description>
</rdf:li>
</rdf:Alt>
</formats>
</channel>
Next: Some questions, including issues of compatibility.