beautiful-soup Archives - Same Con to Hoyci

When Unessa.net launched back in 2000, the interweb looked quite a bit different. During this time Unessa.net has changed a lot, too. I wrote earlier about taking the jump to a Django-powered site. Here are some experiences from the journey so far.

At first there were only a bunch of static HTML pages. Then came the server-side includes, and soon after that came PHP. At first it was .php3, then .phtml, eventually just .php. In 2006 I moved the site to dedicated server and finally got some long lusted Django-love going on. So now everything is finally good, right? Well, not quite. Unlike most of other sites seem to do, I don’t want to lose the old stuff. I hate linkrot and even more I hate sites that delete (good or bad) content just because it’s convenient to do so. This means that I’m now stuck with this great server full of ancient sh*t that needs to be taken care of. (No, not that way.)

Evolving URLs

After learning Django and Python for about a year now, I’m beginning to understand how great these tools really are. Firstly, Djangos URL dispatcher is fan-freakin’-tastic. It’s very easy to set redirects to old URLs and make custom (and smart) 404 handlers to different parts of the site. It’s also very easy to get realtime information about possible broken links, which is important. Writing custom views to handle legacy URLs semi-smartly is an easy way to get rid of old crufty URLs.

For example, I had a Movable Type installation for my mobile photoblog from 2003 that had URLs like /photoapp/archives/2003/06/28/foo.php. Being perfectionist about URLs, I wanted to evolve these pages to something like /photoapp/2003/06/foo/, which is more logical, much shorter and cruft-free. I exported the MT-data to a new Django app, wrote short URLconf for the old URLs and a view that looks something like this:

 def oldphoto_redirect(request, year, month, day, slug):     """     Redirects old MT-URLs to new format "smartly".     If a correct match is not found, raise a 404.     """     try:         photo = NewPhoto.public_objects.get(date_taken__year=year, date_taken__month=month, date_taken__day=day)         return HttpResponsePermanentRedirect('/photoapp/%s/%s/%s/' % (year, month, photo.photo_id))     except NewPhoto.DoesNotExist:         # If no entries match, raise 404         raise Http404     except AssertionError:         # If many entries match         raise Http404

Now 95% of the old URLs are redirected automagically to new URL, with a correct HTTP status code (301). The rest five percent of the cases are URLs that have more than one post in a single day. They will get a custom 404 page that explains why the pages are moved, where they are, and that I have been informed about this 404. When I get a 404 email from Django, I’ll add these few URLs manually to URLconf. This system healed itself in less than a week. Oh, joy!

(Unfortunately this site is almost entirely in Finnish, but it shouldn’t stop you from browsing trough the new photo site that has been knit together from two different photoblogs, added to Flickr, fully tagged, and enhanced in many ways. Among other things, the new app includes full Flickr catalog duplicated locally on the server, automatic synchronization with Flickr, automatic resizing of images in various sizes (see homepage), favourites and ratings for logged-in users, livesearch feature and much more.)

Harmonizing static content

Second great thing about Django and Python is, well, Python 🙂 There are tons of great libraries for Python. One that I’ve totally fallen in love with is Beautiful Soup. In addition of various PHP and Perl powered dynamic parts of old Unessa.net, there are also hundreds of statical pages that I don’t want to keep on the main site anymore. I’ve started to archive these pages to a dedicated archive, and at the same time I’m officially washing my hands about keeping those pages up to date. As a perfectionist, I want to tell this to my visitors too. But I’m just not going to edit hundreds of these files manually.

I’ve been playing with an idea that I’d process all these static HTML-files with a python script that would do something like:

Add a note about archive status (something like “This page has been archived for historical reasons and is no longer maintained. Current content can be found from the front page.“) in a DIV right after <body>-tag
Parse SSI-includes into the page
Check for broken links and fix all trivial internal links
Convert old image links from /images/* to http://images.unessa.net/*
Validate the final output

This all would actually be fairly easy to do with little help from Beautiful Soup and some mind expanding regexps. And how cool would it be to have over seven years worth of archived static content, all valid and with no broken internal links 🙂

To be continued…

Unessa.net is a personal site and a hobby. These kind of things are fun to do. The best part is that sometimes it’s even more fun to do it for paying customers, in more complex projects and under a strict schedule.

I’ll keep on reporting on my progress with Djangofying Unessa.net. My goal is to have the whole site on Django (meaning that all dynamic data is served by Django and all the other data is archived in some way) by the end of this year. At the moment I’m about 40% there so there’s definitely a lot more to do. If you have any comments or ideas, please share them!

Syntax highlighting in blog posts is something that has always bugged me. I don’t like JavaScript-based solutions so I wrote a quick&dirty function that highlights Python-code in my blog posts on the server side. Following examples are written for Django, but they should work on any Python software.

The problem

I want to use Markdown and still be able to have automatic syntax highlighting for Python code that’s inline in my blog posts. Markdown alone tends to break HTML-formatted source code (because of indentations, etc) so fully working solution needs a bit tweaking.

The Solution

We’ll need:

Pygments for syntax highlighting, and
Beautiful Soup for HTML-parsing

With these tools we’re able to build a helper function that looks for source-code in a given text, highlights it’s syntax and applies Markdown filtering to it without messing up the syntax highlighted code.

The Code

My (simplified) BlogEntry model looks like this:

 class BlogEntry(models.Model):     title = models.CharField(maxlength=500)     body = models.TextField(         help_text='Use <a href="http://daringfireball.net/projects/markdown/syntax">Markdown-syntax</a>')     body_html = models.TextField(blank=True, null=True)     pub_date = models.DateTimeField(default = datetime.datetime.now)     use_markdown = models.BooleanField(default=True)      class Admin:         fields = (             (None, {                 'fields' : ('title', 'body', 'pub_date', 'use_markdown')             }),         )

Redundant body_html element is for performance: instead of calculating markdown- and syntax highlight for the body on every request, we calculate it only on every save. (Yes, it could also be done on the body-field itself, but I prefer that the content I’m editing does not change every time I save it.)

Next the highlighting function:

     def _highlight_python_code(self):         from pygments import highlight         from pygments.lexers import PythonLexer         from pygments.formatters import HtmlFormatter         from unessanet.misc.BeautifulSoup import BeautifulSoup          soup = BeautifulSoup(self.body)         python_code = soup.findAll("code", "python")          if self.use_markdown:             import markdown              index = 0             for code in python_code:                 code.replaceWith('<p class="python_mark">mark %i</p>' % index)                 index = index+1              markdowned = markdown.markdown(str(soup))             soup = BeautifulSoup(markdowned)             markdowned_code = soup.findAll("p", "python_mark")              index = 0             for code in markdowned_code:                 code.replaceWith(highlight(python_code[index].renderContents(), PythonLexer(), HtmlFormatter()))                 index = index+1         else:             for code in python_code:                 code.replaceWith(highlight(code.string, PythonLexer(), HtmlFormatter()))          return str(soup)

This function searches <code>-blocks that have class="python" attribute. It first replaces them with placeholder text, then applies markdown if necessary, and finally replaces the placeholders with syntax highlighted code. It may not be the most beautiful code, but it works 🙂

And finally the save method:

 def save(self):     self.body_html = self._highlight_python_code()     super(BlogEntry,self).save()

The body_html-field is updated on every save. On the template side you can use simply {{ entry.body_html }} without applying any additional filters.

The CSS needed for syntax coloring can pe printed out with Pygments for example like this: css = HtmlFormatter().get_style_defs('.highlight'). It may be wise to save the code and put it in a static CSS-file.

Known Limitations

Not a bug, but feature, every instance of code-tags that have class="python" will be replaced. This was a bit annoying when trying to document this particular function…
Unicode strings break the highlighter. Any help on this is appreciated!

This code is published under Creative Commons License. Please share any comments! 🙂

Tag: beautiful-soup

Healing Growing Pains With Django