When Unessa.net launched back in 2000, the interweb looked quite a bit different. During this time Unessa.net has changed a lot, too. I wrote earlier about taking the jump to a Django-powered site. Here are some experiences from the journey so far.
At first there were only a bunch of static HTML pages. Then came the server-side includes, and soon after that came PHP. At first it was .php3, then .phtml, eventually just .php. In 2006 I moved the site to dedicated server and finally got some long lusted Django-love going on. So now everything is finally good, right? Well, not quite. Unlike most of other sites seem to do, I don’t want to lose the old stuff. I hate linkrot and even more I hate sites that delete (good or bad) content just because it’s convenient to do so. This means that I’m now stuck with this great server full of ancient sh*t that needs to be taken care of. (No, not that way.)
After learning Django and Python for about a year now, I’m beginning to understand how great these tools really are. Firstly, Djangos URL dispatcher is fan-freakin’-tastic. It’s very easy to set redirects to old URLs and make custom (and smart) 404 handlers to different parts of the site. It’s also very easy to get realtime information about possible broken links, which is important. Writing custom views to handle legacy URLs semi-smartly is an easy way to get rid of old crufty URLs.
For example, I had a Movable Type installation for my mobile photoblog from 2003 that had URLs like
/photoapp/archives/2003/06/28/foo.php. Being perfectionist about URLs, I wanted to evolve these pages to something like
/photoapp/2003/06/foo/, which is more logical, much shorter and cruft-free. I exported the MT-data to a new Django app, wrote short URLconf for the old URLs and a view that looks something like this:
def oldphoto_redirect(request, year, month, day, slug): """ Redirects old MT-URLs to new format "smartly". If a correct match is not found, raise a 404. """ try: photo = NewPhoto.public_objects.get(date_taken__year=year, date_taken__month=month, date_taken__day=day) return HttpResponsePermanentRedirect('/photoapp/%s/%s/%s/' % (year, month, photo.photo_id)) except NewPhoto.DoesNotExist: # If no entries match, raise 404 raise Http404 except AssertionError: # If many entries match raise Http404
Now 95% of the old URLs are redirected automagically to new URL, with a correct HTTP status code (301). The rest five percent of the cases are URLs that have more than one post in a single day. They will get a custom 404 page that explains why the pages are moved, where they are, and that I have been informed about this 404. When I get a 404 email from Django, I’ll add these few URLs manually to URLconf. This system healed itself in less than a week. Oh, joy!
(Unfortunately this site is almost entirely in Finnish, but it shouldn’t stop you from browsing trough the new photo site that has been knit together from two different photoblogs, added to Flickr, fully tagged, and enhanced in many ways. Among other things, the new app includes full Flickr catalog duplicated locally on the server, automatic synchronization with Flickr, automatic resizing of images in various sizes (see homepage), favourites and ratings for logged-in users, livesearch feature and much more.)
Harmonizing static content
Second great thing about Django and Python is, well, Python 🙂 There are tons of great libraries for Python. One that I’ve totally fallen in love with is Beautiful Soup. In addition of various PHP and Perl powered dynamic parts of old Unessa.net, there are also hundreds of statical pages that I don’t want to keep on the main site anymore. I’ve started to archive these pages to a dedicated archive, and at the same time I’m officially washing my hands about keeping those pages up to date. As a perfectionist, I want to tell this to my visitors too. But I’m just not going to edit hundreds of these files manually.
I’ve been playing with an idea that I’d process all these static HTML-files with a python script that would do something like:
- Add a note about archive status (something like “This page has been archived for historical reasons and is no longer maintained. Current content can be found from the front page.“) in a DIV right after
- Parse SSI-includes into the page
- Check for broken links and fix all trivial internal links
- Convert old image links from
- Validate the final output
This all would actually be fairly easy to do with little help from Beautiful Soup and some mind expanding regexps. And how cool would it be to have over seven years worth of archived static content, all valid and with no broken internal links 🙂
To be continued…
Unessa.net is a personal site and a hobby. These kind of things are fun to do. The best part is that sometimes it’s even more fun to do it for paying customers, in more complex projects and under a strict schedule.
I’ll keep on reporting on my progress with Djangofying Unessa.net. My goal is to have the whole site on Django (meaning that all dynamic data is served by Django and all the other data is archived in some way) by the end of this year. At the moment I’m about 40% there so there’s definitely a lot more to do. If you have any comments or ideas, please share them!