Howdy!

Welcome to my blog where I write about software development, cycling, and other random nonsense. This is not the only place I write, you can find more words I typed on the Buoyant Data blog, Scribd tech blog, and GitHub.

Lazily loading attributes.

15 Sep 2008

I found myself talking to Jason today about the virtues of getattr(), setattr(), and hasattr() in Python and "abusing" the dynamic nature of the language which reminded me of some lazy-loading code I wrote a while back. In February I found the need to have portions of the logic behind one of our web applications fetch data once per-request. The nature of the web applications we're building on top of the MySpace, Hi5 and Facebook platforms require some level of network data-access (traditionally via REST-like APIs). This breaks our data access model into the following tiers: Dia FTW

Working with network-centric data resources is difficult in any scenario (desktop, mobile, web) but the particularly difficult thing about network data access in the mod_python-driven request model is that it will be synchronous (mod_python doesn't support "asynchronous pages" like ASP.NET does). This means every REST call to Facebook, for example, is going to block execution of the request handler until the REST request to Facebook's API tier completes.




 def request_handler(self, *args, **kwargs):


    fb_uid = kwargs.get('fb_sig_user')


    print "Fetching the name for %s" % fb_uid


    print time.time()


    name = facebook.users.getInfo(uid=fb_uid) 


    ### WAIT-WAIT-WAIT-WAIT-WAIT


    print time.time()      


    ### Continue generating the page...

There is also a network hit (albeit minor) for accessing cached data or data stored in databases. The general idea is that we'll need to have some level of data resident in memory through-out a request that can differ widely from request-to-request.

Lazy loading in Python

To help avoid unnecessary database access or network access I wrote a bit of class-sugar to make this a bit easier and more fail-proof:




class LazyProgrammer(object):


    ''' 


        LazyProgrammer allows for lazily-loaded attributes on the subclasses 


        of this object. In order to enable lazily-loaded attributes define


        "_X_attr_init()" for the attribute "obj.X"


    '''


    def __getattr__(self, name):


        rc = object.__getattribute__(self, '_%s_attr_init')()


        setattr(self, name, rc) 


        return rc

This makes developing with network-centric web applications a bit easier, for example, if I have a "friends" lazily-loading attribute off the base "FacebookRequest" class, all developers writing code subclassing FacebookRequest can simply refer to self.friends and feel confident they aren't incurring unnecessary bandwidth hits, and the friends-list fetching code is located in once spot. If one-per-request starts to become too resource intensive as well, it'd be trivial to override the _friends_attr_init() method to hit a caching server instead of the REST servers first, without needing to change any code "downstream."

Lazy loading in C#

Since C# is not a dynamically-typed language like Python or JavaScript, you can't implement lazily-loaded attributes in the same fashion (calling something like setattr()) but you can "abuse" properties in a manner similar to the C# singleton pattern, to get the desired effect:




using System;


using System.Collections.Generic;





public class LazySharp


{


	#region "Lazy Members"


	private Dictionary _names = null;


	#endregion





	#region "Lazy Properties"


	public Dictionary Names


	{


		get {


			if (this._names == null)


				this._names = this.SomeExpensiveCall();


			return this._names;


		}


	}


	#endregion


}

Admittedly I don't find myself writing Facebook/MySpace/Hi5 applications these days on top of ASP.NET so I cannot say I actually use the class above in production, but conceptually it makes sense.

Lazy loading attributes I find useful in the more hodge-podge situations, where code and feature-sets have both grown organically over time, they're not for everybody but I figured I'd share anyways.

Resurgange of the shell.

11 Sep 2008

mono javascript

Two things happened in such short proximity time-wise that I can't help but thing they're somehow related to the larger shift to interpreters. Earlier this week Miguel introduced csharp shell which forced me to dust off my shoddy Mono 1.9 build and rebuild Mono from Subversion just because this is too interesting to pass up on.

One of my favorite aspects of using IronPyhton, or Python for that matter is the interpreter which allows for prototyping that doesn't involve creating little test apps that I have to build to prove a point. For example, I can work through fetching a web page in the csharp shell really easily, instead of creating a silly little application, compiling, fixing errors, and recompiling:




tyler@pineapple:~/source/mono-project/mono> csharp


Mono C# Shell, type "help;" for help





Enter statements below.


csharp> using System;


csharp> Console.WriteLine("This changes everything.");


This changes everything.


csharp> String url = "http://tycho.usno.navy.mil/cgi-bin/timer.pl";


csharp> using System.Web;


csharp> using System.Net;


csharp> using System.IO;


csharp> using System.Text;


csharp> HttpWebRequest req = HttpWebRequest.Create(url);


(1,17): error CS0266: Cannot implicitly convert type `System.Net.WebRequest' to `System.Net.HttpWebRequest'. An explicit conversion exists (are you missing a cast?)


csharp> HttpWebRequest req = HttpWebRequest.Create(url) as HttpWebRequest;


csharp> HttpWebResponse response = req.GetResponse() as HttpWebResponse;


csharp> StreamReader reader = new StreamReader(req.GetResponseStream() as Stream, Encoding.UTF8);


(1,45): error CS1061: Type `System.Net.HttpWebRequest' does not contain a definition for `GetResponseStream' and no extension method `GetResponseStream' of type `System.Net.HttpWebRequest' could be found (are you missing a using directive or an assembly reference?)


csharp> StreamReader reader = new StreamReader(response.GetResponseStream() as Stream, Encoding.UTF8);


csharp> String result = reader.ReadToEnd();


csharp> Console.WriteLine(result);











What time is it?


 US Naval Observatory Master Clock Time
 



Sep. 11, 07:29:02 UTC



Sep. 11, 03:29:02 AM EDT



Sep. 11, 02:29:02 AM CDT



Sep. 11, 01:29:02 AM MDT



Sep. 11, 12:29:02 AM PDT



Sep. 10, 11:29:02 PM AKDT



Sep. 10, 09:29:02 PM HAST

Time Service Department, US Naval Observatory csharp> reader.Close(); csharp> response.Close(); csharp>

I really think Miguel and Co. have adding something infinitely more useful in this Hackweek project than anything I've seen come out of recent hackweeks at Novell. The only feature request that I'd add along to the csharp shell would be "recording", i.e.:
tyler@pineapple:~/source/mono-project/mono> csharp Mono C# Shell, type "help;" for help Enter statements below. csharp> Shell.record("public void Main(string[] args)"); recording... csharp> using System; csharp> Console.WriteLien("I prototyped this in csharp shell!"); (1,10): error CS0117: `System.Console' does not contain a definition for `WriteLien' /home/tyler/basket/lib/mono/2.0/mscorlib.dll (Location of the symbol related to previous error) csharp> Console.WriteLine("I prototyped this in csharp shell!"); csharp> Shell.save_record("Hello.cs"); recording saved to "Hello.cs"Which could conceptually generate the following file:using System; public class Hello { public void Main(string[] args) { Console.WriteLine("I prototyped this in csharp shell!"); } }

JavaScript Shell

In addition to the C# shell, I've been playing with v8, the JavaScript engine that powers Google Chrome. The V8 engine is capable of being embedded easily, or running standalone, one of the examples they ship with is a JavaScript shell. I've created a little wrapper script to give me the ability to load jQuery into the V8 shell to prototype jQuery code without requiring a browser to be up and running:




tyler@pineapple:~/source/v8> ./shell


V8 version 0.3.0


> load("window-compat.js");


> load("jquery.js");


> $ = window.$


function (selector,context){return new jQuery.fn.init(selector,context);}


> x = [1, 5, 6, 12, 42];


1,5,6,12,42


> $.each(x, function(index) { print("x[" + index + "] = " + this); });


x[0] = 1


x[1] = 5


x[2] = 6


x[3] = 12


x[4] = 42


1,5,6,12,42


>

The contents of "window-compat.js" being:




/*


 * Providing stub "window" objects for jQuery


 */





if (typeof(window) == 'undefined') {


        window = new Object();


        document = window;


        self = window;





        navigator = new Object();


        navigator.userAgent = navigator.userAgent || 'Chrome v8 Shell';





        location = new Object();


        location.href = 'file:///dev/null';


        location.protocol = 'file:';


        location.host = ''; 


};

In general I don't really have anything insightful or especially interesting to add, but I wanted to put out my "+1" in support of both of these projects. Making any language or API more easily accessible through these shells/interpreters can really help developers double-check syntax, expected API behavior etc. Thanks Novell/Google, interpreters rock!

Don Quixote's new side-kick, Hudson

06 Sep 2008

slide software development hudson

I recently wrote about "one-line automated testing" by way of Hudson, a Java-based tool that helps to automate building and test processes (akin to Cruise Control and Buildbot). If you were to read this blog regularly, you'd be well aware that I work primarily with Python these days, at a web company no less! What does a web company need with a continuous integration tool? Especially if they're not using a compiled language like Java or C# (heresy!).

As any engineering organization grows, it's bound to happen that you reach a critical mass of developers and either need to hire an equitable critical mass of QA engineers, or start to approach quality assurance from all sides. That is to say, automated unit testing and automated integration testing becomes a requirement for growing both as a engineering organization but as a web application provider (user's don't like broken web applications). With web products like Top Friends, SuperPoke! and Slide FunSpace we have a large amount of ever-changing code, that has been in a constant state of flux for the past 16-18 months. We can accomodate for ever-changing code on the backend for the past year and half with PyUnit and development discipline.

How do you deal with months of ever changing code for the aforementinoned products' front-ends? Your options are pretty slim, you can hire a legion of black-box QA engineers to manually go through regression tests and ensure your products are in tip-top shape, or you can hire a few talented black-box QA engineers to conscript a legion of robots to go through regression tests and ensure your products are in tip-top shape. Enter Windmill. Windmill is a web browser testing framework not entirely unlike Selenium or Watir with two major exceptions: Windmill is written in Python and Windmill has a great recorder (and lots of other features). One of my colleagues at Slide, Adam Christian has been working tirelessly to push Windmill further and prepare it for enterprise adoption, the first enterprise to use it, Slide.

Adam and I have been working on bringing the two ends of the testing world together with Hudson. About half of the jobs currently running inside of our Hudson installation are running PyUnit tests on various Subversion and Git branches. The other half of the jobs are running Windmill tests, and reporting back into Hudson by way of Adam's JUnit-compatible reporting code. Thanks to the innate flexibility of PyUnit and Windmill's reporting infrastructure we were able to tie all these loose ends together with a tool like Hudson that will handle Jabber-notifications or email notifications when test-runs fail and include details in it's reports.

We're still working out the kinks in the system, but to date this set up has helped us fix at least one critical issue a week (with a numerous other minor issues) since we've launched the Hudson system, more often than not before said issues reach the live site and real users. If you've got questions about Windmill or Hudson you can stop by the #windmill or the #hudson channels on Freenode.

Automated testing is like a really good blend of coffee, until you have it, you think "bah! I don't need that!" but after you start with it you can't help but wonder how you could tolerate the swill you used to drink.

Did you know! Slide is hiring! Looking for talented engineers to write some good Python and/or JavaScript, feel free to contact me at tyler[at]slide

One-line Automated Testing

20 Aug 2008

slide opinion software development hudson

For about as long as my development team has been a number larger than one, I've been on a relatively steady "unit test" kick. With the product I've worked on for over a year gaining more than one cook in the kitchen, it became time to start both writing tests to prevent basic regressions (and save our QA team tedious hours of blackbox testing), but also to automate those tests in order to quickly spot issues.

While I've been on this pretty steadily lately, I'm proud to say that automated testing was one of my first pet projects at Slide. If you ever crack into the Slide corporate network you can find my workstation under the name "ccnet" which is short for Cruise Control.NET, my first failed attempt at getting automated testing going on our now defunct Windows desktop client. As our development focus shifted away from desktop applications to social applications the ability to reliably test those systems plummeted; accordingly our test suite for these applications became paltry at best. As the organization started to scale, this simply could not stand much longer else we might not be able to efficiently push stable releases on a near-nightly schedule. As we've started to back-fill tests (test-after development?) the need to automate these tests has arisen to which I started digging aronud for something less painful to deal with than Cruise Control, enter Hudson.

Holy Hudson Batman!

I was absolutely astounded that I, nor anybody I knew, was aware of the Hudson project. Hudson is absolutely amazing as far as continuous integration systems go. The only major caveat is that the entire system is written in Java, meaning I had to beg one of our sysadmins to install Java 1.5 on the unit test machine. Once that was sorted out, starting the Hudson instance up was incredibly simple:
java -jar hudson.war
In our case the following to keep the JVM within manageable virtual memory limits:
java -Xmx128m -jar hudson.war --httpPort=8888

Once the Hudson instance was up and rnuning, I simply had to browse to http://unittestbox:8888/ and the entire rest of the configuration was set up from the web UI. Muy easy. Muy bueno.

Plug-it-in, plug-it-in!

One of the most wonderful aspects of Hudson is it's extensible plugin architecture. Adding plugins like "Git", "Trac" and "Jabber" means that our Hudson instance is now properly linking to Trac revisions, sending out Jabber notifications on "build" (read: test run) failures and monitoring both Subversion and Git branches for changes. From what I've seen from their plugin architecture, it would be absolutely trivial to extend Hudson with Slide-specific plugins as the needs arise.

With the integration of the PyUnit XMLTestRunner (found here) and working an XML output plugin into Windmill we can easily automate testing of both our back-end code and our front-end.

Hudson in action

And all with one simple java command :)

Did you know! Slide is hiring! Looking for talented engineers to write some good Python and/or JavaScript, feel free to contact me at tyler[at]slide

Let's Swap iPods.

30 Jul 2008

miscellaneous media

Since I've started to spend such an enormous amount of my time with work and settling into a new apartment, I've had literally no time to discover new music. Because of this utter lack of time on my part, I've been pondering this idea for about the past month or two on a daily basis, I want to participate in an iPod Foreign Exchange Program.

I currently own a 30GB Video iPod (black) that has about 28GB of music on it with a few assorted podcasts here and there.

Here's what I'm thinking would constitute a good set of rules for swapping an iPod to "walk a mile in somebody's shoes" (musically).

We can be acquaintances, but not friends. I know what my friends listen to and can steal their iPods myself :)
The period to swap iPods would last one week
Both parties would make sure to un-sync their address book and calendars from the iPod, but not change any of the music (no trying to impress people)
The iPod swap is accompanied with a business card or means to coordinate a swap-back
Both parties must be respectful of the others' tastes, even if it's really weird (you know who you are)

I went ahead and removed my calendars and contacts from my iPod just in case I run into somebody on the train that has read this post and wants to swap right away, but failing that, if you're around San Francisco, let's swap iPods :)

Experimenting with Git at Slide (Part 1/3)

27 Jul 2008

slide opinion software development git

For the past two months I've been experimenting with varying levels of success with Git inside of Slide, Inc.. Currently Slide makes use of Subversion and relies heavily on branches in Subversion for everything from project specific branches to release branches (branches that can live anywhere from under 12 hours to three weeks). There are plenty of other blog posts about the pitfalls of branching in Subversion that I won't go into here, suffice to say, it is...sub-par. Below is a rough diagram of our general current workflow with Subversion (I've had some other developers ask me "why don't you just work in trunk?" to which I usually wax poetic about the chaos of trunk when any project gets over 5 active developers (Slide engineering is somewhere between 30-50 engineers)).

There's always a catch
Subversion at Slide
Toying with Git
Git at Slide

There's always a catch

There are three major problems we've run up against with utilizing Subversion as our version control system at Slide:

Subversion's "branches" make context switching difficult
Depending on the age of a branch cut from trunk/, merges and maintainence is between difficult and impossible
Merging Subversion branches into each other causes a near total loss of revision history

Given that branches are a critical part of Slide's development process, we've historically looked at branch-strong version control systems as alternatives, such as Perforce. Before I joined Slide in April of 2007, I was a heavy user of Perforce for my own consulting projects as well as for some of my work with the FreeBSD project as part of the Summer of Code program. In fact, my boss sent out a "Perforce Petition" to our engineering list on my third day at Slide...we still haven't switched away from Perforce.

Up until earlier this year I hadn't given it a second thought until the team I was working with grew and grew such that between me and four other engineers we were pushing a release anywhere from once to three times a week. That meant we were creating a Subversion "branch" multiple times a week, and a significant part of my daily routine became merging to our release branch and refreshing project branches from trunk/. All of a sudden Git was looking prettier and prettier, despite some of its warts. At this point in time I was already using Git for some of my personal projects that I never have time for, so I knew at the bare minimum that it was functional. What I didn't know was how to deploy and use it with a large engineering team that works in very high churn short iterations, like Slide's.

Subversion at Slide

Moving our source tree over into a system other than Subversion, from Subverison, was destined to be painful. The tree at Slide is deceptively large, we have a substantial amount of Python running around (as Slide is built, top-to-bottom, in Python) and an incredible amount of Adobe Flash assets (.swf files), Adobe Illustrator assets (.ai files) and plenty of binary files, like images (.png/gif/jpeg). Currently a full checkout of trunk/ is roughly 2.5GB including artwork, flash, server and web application code. We also have roughly 88k revisions in Subversion, the summation of three years of the company's existence. Fortunately somebody along the line wrote a script (in Perl however) called "git-svn(1)" that is designed to do exactly what I needed, move a giant tree from Subversion to Git, from start to finish (similar to svn2p4 in Perforce parlance).

Toying with Git

When I first ran the command `git-svn init $SVN` I let the the command run for somewhere close to 6-7 hours before completing, I was shocked at the size of the generated repository. If our Git repository were to be left unpacked .git/ alone would be close to 9GB, adding the actual code on top of it, ~11GB. I decided that maybe packing this repository would be a good idea so I ran `git gc` and went to grab a coke from the fridge ... and the machine ran out of memory. One of our quad-core, 8GB RAM, shared development machines ran out of memory?!

After lurking in #git on Freenode for a while I determined two things

Apparently nobody uses Git for projects this large
Git was retaining too much memory, like a memory leak, but just don't call it a memory leak.

To compound this, the rules for memory usage with Git are vastly different between a 32-bit machine and a 64-bit machine, and because we're just that cool, we're using 64-bit machines across the board. The amount of memory Git decides to keep resident while doing repository-intensive operations like the `git gc`, is 256MB on 32-bit machines, and 8GB on 64-bit machines. As these machines are shared between developers we use of ulimit(1), when you limit memory usage with ulimit(1) it restricts address space meaning both virtual and resident memory. When Git tried to mmap(2) gigabytes of address space to do it's operations, the kernel stepped in to intervene and started returning ENOMEM to Git which promptly exited.

After raising this enough times, I finally caught spearce who was able to identify the problem and supply a patch that fixed the memory allocation issues with Git and a repository of Slide's size. First obstacle overcome, now I could actually test a Git workflow inside of Slide.

Git at Slide

Now that I could pack the repository on our development machines, I could get the repository down to a reasonable 3.0GB, i.e. .git/ weighed in at 3GB making a entire tree ~5.5GB (far more managable than 11GB). Despite Git being a decentralized version control system, we still needed some semblance of centralization to ensure a couple basic rules for a sane workflow:

A centralize place to synchronize distributed versions of the repository
Changesets cannot be lost, ever.
QA must not be over-burdened when testing releases

This meant we needed a centralized, secure, repository which left us two options: Git over WebDav (https/http) or Gitosis. After discovering that `git-http-push(1), the executable responsible for doing Git pushes over WebDav has tremendous memory issues, I abandoned that as an option (a `git push` of the repository resulted in memory usage peaking at 11GB virtual, 3.5GB resident memory).

If you are looking to deploy Git for a larger audience in a corporate environment, I highly recommend Gitosis. What Gitosis does is allows for SSH to be used as the transport protocol for Git, and provides authentication by use of limited-shell user accounts and SSH keys; it's not perfect but it's the closest thing to maintainable for larger installations of Git (in my opinion).

So far the experimenting with Git at Slide is pretty localized to just my team, but with a combination of Gitosis, git-svn(1) and some "best practices" defined for handling the new system we've successfully continued development for over the past month without any major issues.

As this post is already quite lengthy, I'll be discussing the following two parts of our experimenting in subsequent posts:

Team Development with Git
Git back to Subversion, mostly automatically.

Did you know! Slide is hiring! Looking for talented engineers to write some good Python and/or JavaScript, feel free to contact me at tyler[at]slide

NAnt and ASP.NET on Mono

04 May 2008

mono miscellaneous software development

Most of my personal projects are built on top of ASP.NET, Mono and Lighttpd. One of the benefits of keeping them all running on the same stack (as opposed to mixing Python, Mono and PHP together) is that I don't need to maintain different infrastructure bits to keep them all up and running. Two key pieces that keep it easy to dive back into the the side-project whenever I have some (spurious) free time are my NAnt scripts and my push scripts.

NAnt
I use my NAnt script for a bit more than just building my web projects, more often than not I use it to build, deploy and test everything related to the site. My projects are typically laid out like:

bin/ Built DLLs, not in Subversion
configs/ Web.config files per-development machine
libraries/ External libraries, such as Memcached.Client.dll, etc.
schemas/ Files containing the SQL for rebuilding my database
site/ Fully built web project, including Web.config and .aspx files
sources/ Actual code, .aspx.cs and web folder (htdocs/ containing styles, javascript, etc)

Executing "nant run" will build the entire project and construct the full version of the web application in the site/ and finally fire up xsp2 on localhost for testing. The following NAnt file is what I've been carrying from project to project.

The Push Script
Since I usually build and deploy on the same machine, I use a simple script called "push.sh" to handle rsyncing data from the development part of my machine into the live directories.




#!/bin/bash


###############################


##      Push script variables


export NANT='/usr/bin/nant'


export STAGE=`hostname`


export SOURCE='site/'


export LIVE_TARGET='/serv/www/domains/myproject.com/htdocs/'


export BETA_TARGET='/serv/www/domains/beta.myproject.com/htdocs/'


export TARGET=$BETA_TARGET


###############################





###############################


##      Internal functions


function output {


        echo "===> $1"


}


function build { 


        ${NANT} && ${NANT} site


}


###############################





###############################


##      Build the site first


output "Building the site..."


build


if [ $? -ne 0 ]; then


        output "Looks like there was an error building! abort!"


        exit 1


fi





###############################


##      Start actual pushing


if [ "${1}" = 'live' ]; then 


        output " ** PUSHING THE LIVE SITE ***"


        export TARGET=$LIVE_TARGET


else


        output "Pushing the beta site"


fi





output "Using Web.config-${STAGE}"


output "Pushing to: ${TARGET}"





cp config/Web.config-${STAGE} site/Web.config


rsync --exclude *.swp --exclude .svn/ -av ${SOURCE} ${TARGET}

Depending on the complexity of the web application I might change the scripts up on a case-by-case basis, but for the most part I have about 5-6 projects out "in the ether" that are built and deployed with a derivative of the NAnt script and push.sh listed above. In general though, they provide a good starting point for the tedious bits of non-Visual Studio-based web development (especially if you're in an entirely Linux-based environment).

Hope you find them helpful :)

Parsing HTML with Python

03 May 2008

slide miscellaneous software development

A while ago I jotted down about seven or so ideas of stuff that I thought would make good blog posts, somehow "markup parsers in Python" is next on the list, so I might as well spill the beans on how incredibly easy it is to process (X)HTML with Python and a little built in class called HTMLParser.

There have been a few occasions when I needed a quick (and dirty) way to perform transforms on some chunk of HTML or merely "search and replace" parts of it. While it might be cleaner to do something with XSLT or the likes, using them doesn't even begin to match the speed of development of an HTMLParser-based class in Python.

Getting Started
One major thing to keep in mind when working with HTMLParser, especially if you're newer to Python, is that it is what's referred to as an "old styled" object, meaning subclassing it is a bit different than "new styled" classes. Since HTMLParser is an old-styled object, any time you'd want to call a super-class defined method you would need to perform HTMLParser.superMethod(arg) instead of super(SubHTMLParser, self).superMethod(arg)

Creating the HTML parser
For the purposes of this example, I want something simple, so we're just going to take a block of markup and "tweak" all the <a> tags within it to be "sad" (whereas "sad" means they'll be bold, blue, and blinkey). The actual code to do so is only 50 lines long and is as follows:

import HTMLParser





class SadHTML(HTMLParser.HTMLParser):


    '''A simple HTML transform-class based upon HTMLParser.  All links shall be bold, blue and blinky :('''





    def __init__(self, *args, **kwargs):


        HTMLParser.HTMLParser.__init__(self)


        self.stack = []





    def handle_starttag(self, tag, attrs):


        attrs = dict(attrs)


        if tag.lower() == 'a':


            self.stack.append(self.__html_start_tag('blink', None))


            attrs['style'] = '%s%s' % (attrs.get('style', ''), 'color: blue; font-weight: bold;')


        self.stack.append(self.__html_start_tag(tag, attrs))


    


    def handle_endtag(self, tag):


        self.stack.append(self.__html_end_tag(tag))


        if tag.lower() == 'a':


            self.stack.append(self.__html_end_tag('blink'))





    def handle_startendtag(self, tag, attrs):


        self.stack.append(self.__html_startend_tag(tag, attrs))                





    def handle_data(self, data):


        self.stack.append(data)





    def __html_start_tag(self, tag, attrs):


        return '<%s%s>' % (tag, self.__html_attrs(attrs))





    def __html_startend_tag(self, tag, attrs):


        return '<%s%s/>' % (tag, self.__html_attrs(attrs))





    def __html_end_tag(self, tag):


        return '' % (tag)





    def __html_attrs(self, attrs):


        _attrs = ''


        if attrs:


            _attrs = ' %s' % (' '.join([('%s="%s"' % (k,v)) for k,v in attrs.iteritems()]))


        return _attrs





    @classmethod


    def depreshun(cls, markup):


        _p = cls()


        _p.feed(markup)


        _p.close()


        return ''.join(_p.stack)

The actual ins-and-outs of the parser are very simple; markup like "<a href="#">Hello</a><br/>" would execute accordingly:

handle_starttag('a', [('href', '#')])
handle_data('Hello')
handle_endtag('a')
handle_startendtag('br', [])

Since HTMLParser just gives you element tag names, and there attributes, SadHTML simply builds a list of strings out of what data is passed to it via the super class and then when everything is finished, ties the list back together with: ''.join(list_of_tags).
Executing the SadHTML.depreshun method on the contents of my last blog post is a good example, part of the post was:

An informal poll at the Slide offices this past week yielded these interesting results: at Slide.com, nearly 100% of white people seem to like "Stuff White People Like".

After running it through "SadHTML", the following markup is generated instead:

An informal poll at the Slide offices this past week yielded these interesting results: at Slide.com, nearly 100% of white people seem to like "Stuff White People Like".

If you're curious as to how much more you can do with HTMLParser, do check out the documentation. It's far more lenient than using eXpat for parsing HTML, and it's still fast enough to be used on longer documents (there's also htmllib available for Python but I've not used it yet).

An informal poll

13 Apr 2008

miscellaneous

An informal poll at the Slide offices this past week yielded these interesting results: at Slide.com, nearly 100% of white people seem to like "Stuff White People Like".

I must say, I sure do like Arrested Development, not having a tv, expensive sammiches and diversity.

Hi5 goes 100%

08 Apr 2008

slide

It's so easy to get caught up in the flurry of things going on here in Silicon Valley (not to mention just at Slide), but I figured that Hi5 deserved being mentioned. I'd like to congratulate Lou, Anil, Paul, Zack and the rest of the Hi5 Platform team on being (from what I can tell) the first social network to turn their OpenSocial-based platform on 100% to users. As of last friday they finally ramped up to 100%, meaning every user on Hi5 can add OpenSocial applications that have been approved and added to the Hi5 applications gallery.

The past couple weeks I've been lurking on the #Hi5dev channel on Freenode, where most of the Hi5 team has been as well, dutifully answering questions and getting general developer feedback. I highly recommend following their developer blog where Lou (pictured here) has been posting regular updates and all the important things that you need to do in order to get your application viral, approved and reaching Hi5's users.

Some of the applications we've launched include: Top Friends, Slide TV and SuperPoke. Of course, if all you want to do on Hi5 is be friends with me, you can find me here :).

Overall the OpenSocial/Hi5 platform has been an interesting experience, moving more of the application into the realm of JavaScript as opposed to what I've become used to on the Facebook platform has made me think harder about the separation of front-end code from back-end code and where you actually draw the line when both are written in the same language. One down, only two to go!

← Newer posts Older posts →

Howdy!

Lazy loading in Python

Lazy loading in C#

US Naval Observatory Master Clock Time

Sep. 11, 07:29:02 UTC Sep. 11, 03:29:02 AM EDT Sep. 11, 02:29:02 AM CDT Sep. 11, 01:29:02 AM MDT Sep. 11, 12:29:02 AM PDT Sep. 10, 11:29:02 PM AKDT Sep. 10, 09:29:02 PM HAST

JavaScript Shell

Holy Hudson Batman!

Plug-it-in, plug-it-in!

There's always a catch

Subversion at Slide

Toying with Git

Git at Slide

Sep. 11, 07:29:02 UTC

Sep. 11, 03:29:02 AM EDT

Sep. 11, 02:29:02 AM CDT

Sep. 11, 01:29:02 AM MDT

Sep. 11, 12:29:02 AM PDT

Sep. 10, 11:29:02 PM AKDT

Sep. 10, 09:29:02 PM HAST