tag:blogger.com,1999:blog-72259653013981088752009-06-27T14:07:37.183-07:00organized chaosdaily doses of hacking, technology & the funky freshBlake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.comBlogger44125tag:blogger.com,1999:blog-7225965301398108875.post-15418206020989017452008-09-22T14:43:00.000-07:002008-09-22T19:48:01.103-07:00Oracle does NOT enter the AWS Cloud<p>Okay, seriously? Did the <a href="http://aws.typepad.com/aws/2008/09/hello-oracle.html">announcement that oracle was entering the AWS cloud</a> really get sent out today? Don't get me wrong. I think Jeff Barr is great and I love AWS, but let's be clear about what this announcement really means.</p>
<p>When I saw this headline I thought, "Awesome, Oracle is lowering the barrier to entry for SMB customers." This isn't the case. It's true that Oracle is making it easier for people to boot up a presumably properly tuned Oracle instance in the cloud. It's also true that Oracle will support certain EC2 hardware configurations in the cloud (up to 8 virtual cores although you can license it for 16). However, when Jeff Bar makes a statement like the following I get confused.</p>
<blockquote>The variability and flexibility of cloud-based licensing has perplexed users and vendors for some time now. Now that a large software vendor has made a clear statement of direction here, we should see more and more cloud-compatible licenses before too long.</blockquote>
<p>Let's dig into the <a href="http://www.oracle.com/corporate/pricing/cloud-licensing.pdf">Oracle licensing terms</a> for a minute. If you read the licensing terms it becomes clear that two things have happened:</p>
<ol>
<li>Oracle will support certain versions of their flagship product running on certain hardware configurations in the cloud.</li>
<li>Oracle will license certain versions of their flagship product running on certain hardware configurations in the cloud.</li>
</ol>
<p>I'm sorry, but does Oracle just not get it? This type of a license is the exact opposite of the utility model. While the hardware (and software, in the case of Red Hat, S3, SQS, etc) is "pay for what you use", Oracle has decided that you will pay whether you are using the software or not. On <a href="http://www.redhat.com/solutions/cloud/">Red Hat's Cloud Computing page</a> we get the following quote:</p>
<blockquote>Cloud computing changes the economics of IT by enabling you to pay only for the capacity that you actually use.
</blockquote>
<p>This is what I always believed was the power of cloud computing. Pay for what you use. I hope Oracle does not become a model for other software vendors as Jeff states. I hope companies take a page from Red Hat's book in this case, especially if they are looking to enter and stay competitive in the SMB market.</p><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-1541820602098901745?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-18529473228975439472008-09-19T18:06:00.000-07:002008-09-19T19:38:24.312-07:00Digg: Still not interesting<p>As some of you know, during my job search last year one of the places I interviewed that ended up making me an offer was Digg. <a href="http://www.joestump.net/">Joe Stump</a>, who is now the lead architect at Digg was someone I knew from the Seattle PHP community and one of the folks that interviewed me. Being interviewed by someone you know is nice because you have a shared set of experiences that allow you to ask good questions that can help you make a better decision about whether or not to accept an offer.</p>
<p>One of the questions I asked Joe was, "are you learning?" to which Joe essentially responded (and, I'm paraphrasing) that he wasn't learning much but things operated on the largest scale at which he had been involved which made things interesting for him. Some of the technology points I inferred at that point, knowing Joe's background and that he's a bright guy:</p>
<ul>
<li>Typical LAMP stack</li>
<li>No web services</li>
<li>Database sharding</li>
<li>'Legacy', organically grown code base</li>
</ul>
<p>This was my opinion after talking with several engineers. Having worked in that environment for several years, it wasn't exactly what I was looking for and I ultimately ended up declining the offer.</p>
<p>In a series of recent blog posts, Digg engineers including Joe have begun describing the system and software architecture.</p>
<ul>
<li><a href="http://blog.digg.com/?p=168">How Digg works</a></li>
<li><a href="http://blog.digg.com/?p=213">Digg Database Architecture</a></li>
<li><a href="http://highscalability.com/digg-architecture">High Scalability - Digg Architecture</a></li>
</ul>
<p>One of the things that strikes me is that from a technology perspective, not a lot has changed in a year. The typical high traffic LAMP system still consists of:</p>
<ul>
<li>Caching (Memcache)</li>
<li>Distributed file system (MogileFS)</li>
<li>Monitoring (Nagios)</li>
<li>Asynchronous Processing (Gearman)</li>
</ul>
<p>It's about as vanilla as it gets from an architecture perspective. But what's wrong with that?</p>
<p>Clearly Digg has been successful and as such their approach to technology has obviously worked. Anyone that has been tasked with scaling a web application is going to recognize the building blocks that Digg is using. However in not building a distributed system (as Digg has decided to do) you will run into some of the following issues: increased coupling of software components, longer ramp up for new developers, inability to update individual system components, difficulty in parallelizing development tasks and additional risk in new releases.
</p>
<p>Let's use a Unix pipes analogy for a minute. Assume that each component in a software system is a unix tool; ls, grep, tail, etc. Imagine the command you are running is:</p>
<pre>
ls /bin/ | grep cat | tail
</pre>
<p>Each of these applications handles a very specific piece of functionality. You can use each application in isolation. You can upgrade any of these applications without affecting another. Different developers can work on each application in isolation. There are some obvious advantages to the Unix approach. This is one way you can think of a distributed system but instead of pipes you're using an IP based transport (probably) and instead of command line options you're using a well defined API.
</p>
<p>Now imagine an application called lsgreptail. It's a single application that handles all of the above functionality. You lose the ability to use each part of the application in isolation (no mixability). The code base is larger so it's more difficult for developers to get up to speed on it or become an expert with it. Making a change to functionality in directory listing (ls) requires reinstalling the entire application. Tracking down a performance bug becomes more difficult due to the lack of component isolation. There are some obvious drawbacks to this approach in software development. This is how you can think of how Digg (and many LAMP based sites) has built their system.</p>
<p>The point is this; there is more to scalability than the number of simultaneous users you can support. As your business grows and becomes successful, scaling your development team is just as important and becomes increasingly difficult on a monolithic lsgreptail style application. Digg is digging their own hole (no pun intended) by continuing to build their system in this fashion.</p><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-1852947322897543947?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com5tag:blogger.com,1999:blog-7225965301398108875.post-89064288596828833602008-09-18T08:31:00.000-07:002008-09-18T09:25:53.436-07:00AWS CDN - Super Sweet<p>
One of the very first things I did with AWS (Amazon Web Services) was to use S3 (Simple Storage Service) and EC2 (Elastic Compute Cloud) to build a CDN on top of it. A CDN is essentially a way of distributing static content to your users rapidly in a scalable fashion. I built mine by publishing data to S3, using UltraDNS to distribute users requests to an appropriate availability zone in EC2 (east coast, west coast) based on their geographic location, and serving the request out of S3 but from EC2. Many people choose not to go this route for reasons of simplicity and just serve content out of S3. Well, now Amazon is going to do all the hard work for you.
</p>
<p>If you are on the early warning radar of Amazon Web Services, I'm sure that you received the following email this morning just like I did:</p>
<blockquote>
<p>
...we are excited to share some early details with you about a new offering we have under development here at AWS -- a content delivery service.
</p>
<p>
This new service will provide you a high performance method of distributing content to end users, giving your customers low latency and high data transfer rates when they access your objects. The initial release will help developers and businesses who need to deliver popular, publicly readable content over HTTP connections. Our goal is to create a content delivery service that:
<ul>
<li>Lets developers and businesses get started easily - there are no minimum fees and no commitments. You will only pay for what you actually use.</li>
<li>Is simple and easy to use - a single, simple API call is all that is needed to get started delivering your content.</li>
<li>Works seamlessly with Amazon S3 - this gives you durable storage for the original, definitive versions of your files while making the content delivery service easier to use.</li>
<li>Has a global presence - we use a global network of edge locations on three continents to deliver your content from the most appropriate location.</li>
</ul>
You'll start by storing the original version of your objects in Amazon S3, making sure they are publicly readable. Then, you'll make a simple API call to register your bucket with the new content delivery service. This API call will return a new domain name for you to include in your web pages or application. When clients request an object using this domain name, they will be automatically routed to the nearest edge location for high performance delivery of your content.
</p>
</blockquote>
<p>
Why is this significant?
<ul>
<li>Lowers the barrier to entry for small businesses wanting to use a CDN</li>
<li>Reduces the need to do DNS based geo-distribution on your own</li>
<li>Allows you to take advantage of something you are already using (AWS S3)</li>
<li>Allows you to simply 'enable' the service for existing items stored in S3</li>
</ul>
</p>
<p>
Given the expense of CDN services from companies like Akamai, Limelight and Level3 as well as the term commitments (you typically negotiate a rate in a fashion similar to bandwidth), many smaller companies have often avoided using a CDN. This is despite the fact that using a CDN is one of the easiest ways to significantly improve perceived page load times for end users. By allowing users to pay for a CDN via the utility model that has become so popular with AWS, this opens the door for Joe Developer to simply start out using a CDN and not have to make those kinds of trade offs.
</p>
<p>
Very cool stuff. If anyone from Amazon happens to read this, please add me to your beta group :)
</p><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-8906428859682883360?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com1tag:blogger.com,1999:blog-7225965301398108875.post-8291421287204047982008-09-16T20:01:00.000-07:002008-09-16T20:01:01.092-07:00Practical TDD<p>
It took me several years to drink the <a href="http://en.wikipedia.org/wiki/Test-driven_development">TDD</a> kool-aid but now that I have I'm addicted. It's not that I didn't want to automate my testing, it's just that it wasn't particularly practical for me to do so. Having worked at startups over the past several years, I have never been able to find that balance between producing new code and appropriate test coverage for that code.
</p>
<p>
The problem has typically been that the testable API changed frequently enough that I spent as much time or more updating tests as I did writing new code. This was the problem for unit tests and as such I never seem to get around to writing them or using continuous integration tools like CruiseControl. However, at my current job we have managed to create a TDD methodology that works particularly well for us. It essentially works like this:
<ol>
<li>Agree upon web service API</li>
<li>Write Unit Tests that cover the new service API</li>
<li>Iterate on code until all service tests pass</li>
</ol>
</p>
<p>
The primary difference between this and any other testing methodology is that we focus on testing our web services as opposed to the underlying API's. This gives us a few very concrete benefits:
<ul>
<li>The front-end team can begin coding against the service API immediately.</li>
<li>Increased test coverage with less tests due to service dependencies.</li>
<li>Immediate feedback on work in progress.</li>
<li>Breaking API changes caught immediately, reducing impact on customers.</li>
</ul>
The introduction of continuous integration via CruiseControl along with 100% service coverage has allowed us to immediately see the benefits. The number of bugs introduced into our production environment has been reduced a measurable amount since creating the test framework.
</p>
<h2>Resources</h2>
<p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Test-driven_development">TDD</a></li>
<li><a href="http://cruisecontrol.sf.net/">CruiseControl</a></li>
</ul>
</p><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-829142128720404798?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-88101726286860013492008-09-15T18:11:00.000-07:002008-09-15T19:38:45.930-07:00MySQL Multimaster Replication in an Asynchronous Environment<p>
By design, MySQL replication occurs asynchronously. That is to say that the replication on a slave doesn't necessarily occur at the same time as on the master. In a multi-master replicated environment (assuming two masters), each master is a slave to the other master. There are a few gotcha's to consider when creating or editing data asynchronously in a multi-master environment. You can get bitten by these issues if using AJAX, threads, or even loading images that are database protected or backed. It's even possible to run across these in an entirely synchronous environment if the replication lag time is high enough.
</p>
<p>
Let's assume for this discussion we have the following table:
<pre>
CREATE TABLE users (
user_id INTEGER UNSIGNED AUTO_INCREMENT PRIMARY KEY,
user_name CHAR(64) DEFAULT '',
user_password CHAR(64) DEFAULT '',
user_last_login DATETIME DEFAULT '0000-00-00 00:00:00',
UNIQUE KEY (user_name)
);
</pre>
Let's also assume that we have two servers, master1 and master2.
</p>
<h2>Auto Increment Conflict</h2>
<p>
Assume that the following is submitted to master1:
<pre>
INSERT INTO users VALUES (0, 'user1', 'a609316768619f154ef58db4d847b75e', '1979-09-23');
</pre>
and let's label this as e1 (event1) and assume that master1 assigns user_id 1 to 'user1'.<br/>
The following is then submitted to master2:
<pre>
INSERT INTO users VALUES (0, 'user2', 'f522d1d715970073a6413474ca0e0f63', '1984-01-02');
</pre>
and let's label this as e2 (event2) and assume that master2 assigns user_id 1 to 'user2'.<br/>
Oops. Now when replication occurs on the other slave, we end up with 'user1' and 'user2' having different values for user_id on each server.<br/>
This topic has been covered in depth elsewhere so I won't go into details. For fixes and more information see <a href="http://www.onlamp.com/pub/a/onlamp/2006/04/20/advanced-mysql-replication.html">Advanced MySQL Replication Techniques</a>. Note that I recommend using a UUID/GUID instead of AUTO_INCREMENT to avoid this type of problem however the MySQL function UUID() doesn't work with statement based replication.
</p>
<h2>Uniqueness Conflict</h2>
<p>
Let's assume that you have some AJAX code which attempts to create a user, 'user1' which results in the following SQL statement:
<pre>
INSERT INTO users VALUES (0, 'user1', 'a609316768619f154ef58db4d847b75e', '1979-09-23');
</pre>
being submitted to master1 and let's assume that for whatever reason you attempt to create the user a second time (timeout occurred, side effect causes another create to happen, etc) against master2.<br/>
You would think that the UNIQUE constraint would prevent the creation from occurring the second time however whether or not the statement is executed on both servers and in what order depends on a variety of factors including server lag (the amount of time between a statement being executed on one server and replicated and executed on the second server). This problem means that you can end up with user1 being created on both master1 and master2 but having two different user_id's.<br/>
How can you avoid this pitfall? Avoid asynchronous, identical create statements. Make the call synchronous. Additionally you can configure your application to only write to one database for certain statements, essentially from an application level reverting to a typical master-slave replicated environment.<br/>
Yes, I have run into this in a production environment.
</p>
<h2>Update Conflict</h2>
<p>
Let's assume you submit the following update request to master1:
<pre>
UPDATE users SET user_last_login = '2008-09-15 17:58:30' WHERE user_id=1;
</pre>
which is executed on master1 at the time set in user_last_login plus 1 second (at 2008-09-15 17:58:31).<br/>
Now the following is submitted to master2:
<pre>
UPDATE users SET user_last_login = '2008-09-15 17:58:32' WHERE user_id=1;
</pre>
which is executed on master2 at the same time set in user_last_login plus 1 second (at 2008-09-15 17:58:33).<br/>
Lastly, the update on master2 is replicated to master1, and the update on master1 is replicated to master2.<br/>
Here is what was executed on master1:
<pre>
UPDATE users SET user_last_login = '2008-09-15 17:58:30' WHERE user_id=1;
UPDATE users SET user_last_login = '2008-09-15 17:58:32' WHERE user_id=1;
</pre>
and on master2:
<pre>
UPDATE users SET user_last_login = '2008-09-15 17:58:32' WHERE user_id=1;
UPDATE users SET user_last_login = '2008-09-15 17:58:30' WHERE user_id=1;
</pre>
Now we have different values on each server. Uh oh.
</p>
<h2>Delete Conflict</h2>
<p>
This can occur when a delete occurs on master1 and before that delete is replicated to master2, an update to the PK used for the delete in statement1 on master1 is executed on master2. Imagine the following is executed on master1:
<pre>
DELETE FROM user WHERE user_name='user1';
</pre>
and the following is executed on master2:
<pre>
UPDATE user SET user_name='user3' WHERE user_name='user1';
</pre>
Now we have inconsistent data on each server. Uh oh.
</p>
<h2>Summary</h2>
In short, there are a variety of challenges to overcome in a multi-master replication setup. These problems are exacerbated by asynchronous operations on your data set. A few bullets of advice:
<ul>
<li>Reduce complex transactions to be written to a single server.</li>
<li>Monitor server lag.</li>
<li>Be prepared for failure. The more you distribute your data set and scale your service, the more you will need to deal with failures.</li>
</ul>
<h2>References</h2>
<ul>
<li><a href="http://www.onlamp.com/pub/a/onlamp/2006/04/20/advanced-mysql-replication.html">Advanced MySQL Replication</a></li>
<li><a href="http://www.dbspecialists.com/files/presentations/mm_replication.html">Multi-Master Replication</a></li>
</ul><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-8810172628686001349?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-30947065017283192162008-09-07T19:43:00.000-07:002008-09-07T20:25:29.001-07:00Google Code as Personal Wiki/VC ToolI figured I would put this up as some folks might find the idea useful. For a long time I've wanted an externally available, free, reliable, hosted environment that had a personal wiki and version control. I would have loved to find a Trac+Subversion environment but I didn't trust any of the free ones out there. I've got tons of documentation that I create that seems to get lost in a slew of .txt files in my home directory. Likewise I've got lots of sample code that gets created as Foo.java, foo.js, foo.php etc and ends up disappearing.
Google Code is a hosting service that has version control via subversion, issue tracking, wiki pages and a bunch of other features that I didn't really need. Today I noticed the link on Google Code saying, "Create a new project", so I thought, "What the heck?" Looking at the TOS and FAQ, there is nothing that prevents me from using this as a personal wiki and version control system. I already keep all of my source under a friendly LICENSE and have no problem doing the same for scripts and documentation as well.
I put everything I would need to checkout my home directory (bash scripts, vim files, etc) into subversion. Perfect for when I hop on a new machine. Also good for synchronizing changes between environments. I also threw up a few small java projects that have been sitting out of source control for too long, I'll keep adding more as they come up. I also got started putting a bunch of the files in my local doc/ directory into the wiki. Fortunately I've been using the Trac/MoinMoin syntax for a long time so it shouldn't be too difficult to make the transition from local storage to remote.
One of my favorite features so far is that I can check out my wiki onto my local workstation and make changes there with my editor of choice. Very cool.
If you're interested, here are some links:
<ul>
<li><a href="http://code.google.com/p/fizz-buzz/">Fizz-Buzz Google Code Project</a></li>
<li><a href="http://code.google.com/p/fizz-buzz/wiki/WikiStart">Fizz-Buzz Wiki</a></li>
<li><a href="http://code.google.com/p/fizz-buzz/source/browse/#svn/trunk">Fizz-Buzz Source</a></li>
</ul>
So far I only see two downsides. First, I have to be very careful not to commit anything sensitive to the repository as it's public. This generally isn't a problem but I do double check my commits. Second, unless I switch from Ant to Maven in the near future, jar files are going to send me over the 100MB limit sooner than later. I wonder why I don't have storage similar to GMail or Picasa. Oh well.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-3094706501728319216?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com1tag:blogger.com,1999:blog-7225965301398108875.post-61463766124231048132008-02-17T07:51:00.000-08:002008-02-17T08:10:44.543-08:00Beliefs and ProgrammingI don't post too often on here these days. I've moved to blogging with my current employer, <a href="http://www.compendiumblogware.com/">Compendium Blogware</a>. You can find new posts <a href="http://blogging.compendiumblog.com/blog/the-science-of-blogging">here</a>.
Occasionally though, there is a post that doesn't quite belong or isn't quite appropriate for the corporate world. This is one of those posts.
<a href="http://www.michaelkimsal.com/blog/">Michael Kimsal</a> put together a survey called <a href="http://www.kimsal.com/reldevsurvey/results.php">Religious affiliation and software development languages</a>, which you can also discuss on his blog <a href="http://michaelkimsal.com/blog/?p=458">here</a>. I downloaded the data set and made the following further analysis:
<ul>
<li>Found the top 25 languages by the number of people who filled out the survey</li>
<li>Found the top 5 religious affiliations for each language</li>
<li>I normalized all Christian religions into the affiliation 'Christianity'</li>
<li>I grouped agnostic and atheist declarations into 'AA'</li>
</ul>
Given that only ~3815 people took the survey, not a whole lot could be drawn from the numbers. However, here is what I found, draw your own conclusions.
<ul>
<li>The top 10 languages in order were: Python, C, C++, Java, Javascript, Ruby, PHP, Lisp, Perl, Haskell</li>
<li>The top two affiliate declarations were: AA (Atheist,Agnostic), Christian. After that Buddhist was most common.</li>
<li>Without normalization, the top declarations where Atheism followed by Agnostic followed by some variety of Christianity.</li>
</ul>
I'm not sure what the target group was. I didn't even know about the survey until after it was closed. However, I would say the results are inline with what I have observed in my own geeky social circles. Analytical people tend to question doctrine.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-6146376612423104813?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-64017476821689664832007-10-16T04:56:00.000-07:002007-10-16T05:12:27.138-07:00Amazon Opens Up EC2Amazon today opened up its <a href="http://aws.amazon.com/ec2">EC2</a> (Elastic Cloud Computing) web service in an "unlimited beta". What does that mean? It means if you have an <a href="http://aws.amazon.com/">AWS</a> account, you can now sign up for EC2 without a long wait to become part of the beta group.
I haven't been blogging much lately as I joined a startup in Indianapolis about a month ago. Between the move from Seattle and the settling in I've been pretty busy. Now that things are starting to slow down a bit, one of the things I'm investigating is leveraging EC2/<a href="http://aws.amazon.com/s3">S3</a> for my company.
If you don't know those acronyms, S3 (Simple Storage Service) is essentially storage on demand and EC2 is compute on demand. The pricing uses a utility model, e.g. you pay for what you use and that's all, and the pricing is competitive. I did a cost analysis of hosting with EC2 vs a standard leased colo situation and EC2 was significantly less expensive, like, 75% less expensive. I still don't like the bandwidth pricing but I hope that will change over time as more users get on EC2.
I've played with EC2 for the past few months but hadn't found the perfect app for it. One of the challenges I'm currently facing is that at my new company, users upload lots of content. That content can vary in size (up to 10MB), is immutable, and is displayed/downloaded potentially many times. I don't want to have to worry about storage and backups of this content. Once the amount of data starts to move into the TB range, it becomes costly to effectively backup and store that much content.
Enter EC2/S3. You can imagine several EC2 instances that handle storage and retrieval of content from S3. A simple web service on top of this system allows for users to upload content essentially right to S3, and anyone can view content. Since no ACL is needed for the content, and I don't have to worry about SSL, this is for me a perfect application.
I am also currently aware of another EC2 customer using the platform for logging web requests. When users load a web page, they download a web bug from EC2. Since web logging and analytics can be potentially a costly and compute intensive application, this is another great way to utilize EC2. We (developers, technologists, hackers) are only beginning to see the possibilities of having a platform like EC2/S3 at our fingertips.
On another note, I am again hiring developers in Indianapolis. If you are, or know, bright hackers in the area or that would be willing to move to the area, please contact me.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-6401747682168966483?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-33450940086216755772007-08-08T23:47:00.000-07:002007-08-08T23:58:55.213-07:00One Time Passwords for Web AppsI recently did some traveling through Europe, and as I did I encountered my fair share of Internet cafe's and sketchy net connections. In the Internet cafe's I worried about keyloggers, screen capture utilities and rootkits. On the sketchy net connections in hotels I was primarily concerned about sniffers on the wire. In all, I got to thinking about one time passwords for web applications, and why they seemingly don't exist.
One of the things I started thinking was, many people have a cell phone. Why not replace your SecurID card with a cell phone? When you go to log onto a site from an untrusted location, have an option where users can check a box and enter in a pin instead of their password. Once successfully entered, a user receives a text message with a one time password they can use for a short duration of time. The user then uses their pin, along with the one time password to gain access to the site.
This would be easy and inexpensive to implement as a web service that you could offer to third parties, so why has no one tackled this problem? If you know, let me know.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-3345094008621675577?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com3tag:blogger.com,1999:blog-7225965301398108875.post-9797775223869586652007-08-07T19:35:00.001-07:002007-08-07T19:53:53.419-07:00Back from DefconI spent this past weekend at Defcon 15. This is my 8th year going to Defcon and the conference keeps getting better. I got to meet up with some people I haven't seen in a few years so that was excellent. This year I mostly went to non-technical tracks, I went to what I would consider 'geek' tracks. My favorites were:
<ul><li>Creating Unreliable Systems, Attacking the Systems that Attack You</li><li>GeoLocation of Wireless Access Points and "Wireless GeoCaching"</li><li>Being in the know... Listening to and understanding modern radio systems</li><ul><li><a href="http://www.radioreference.com/">http://www.radioreference.com/</a>
</li></ul><li> Hardware Hacking for Software Geeks</li><ul><li><a href="http://www.sparkfun.com/">http://www.sparkfun.com/</a>
</li></ul><li>Satellite Imagery Analysis</li><ul><li><a href="http://rst.gsfc.nasa.gov/Front/tofc.html">http://rst.gsfc.nasa.gov/Front/tofc.html</a></li><li><a href="http://eyeball-series.org/">http://eyeball-series.org/</a></li><li><a href="http://digitalgeography.co.uk/">http://digitalgeography.co.uk/</a></li><li><a href="http://wikimapia.org/">http://wikimapia.org/</a>
</li></ul></ul> I got some great info from the "Hardware Hacking for Software Geeks" talk. I'm planning on building a micro-controller driven camera that acquires location via GPS and submits photos via Bluetooth. This probably already exists, but it should be fun to build. Wikimapia is also a really cool site.
Highlights of the weekend included the first Defcon wedding and an undercover reporter being outed.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-979777522386958665?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-87751759015347454912007-06-26T22:09:00.000-07:002007-06-29T21:35:25.393-07:00On the Ethics of ContractingFor the past couple of months I have been doing contract work for a variety of local companies. When someone takes you on as a contractor, they have certain expectations about what you bring to the table. In particular, clients have an expectation that you bring particular expertise to the company and can help them solve a particular problem more quickly then they could do on their own with their given resources. This means you are, as a contractor, particularly well suited for <span class="blsp-spelling-error" id="SPELLING_ERROR_0">startups</span>, short-term projects, acquisitions and mergers.
In many circumstances, you are brought in as a domain expert and simply asked to do the "best" thing for that company while still solving their problem. There are no or very few technology requirements. Herein lies a problem of ethics. If you determine that the best solution is a technology that you have very little or no experience with, do you have a responsibility to inform the company of that fact and should you charge them for the time it takes you to get up to speed?
First, I'm not sure how you can recommend a technology if you have zero experience with it. Yet, I've seen it happen. If you have no experience, how do you know that the solution will meet their expectations? I don't care how much you've read about something, experience matters. Assuming that you take some time to work with the technology and then make a recommendation, you have a responsibility to let the company know what your experience level with the technology is. This mitigates risk, and allows the company to make an educated decision on how to move forward. Also ask yourself the question, "How much is this recommendation based on my own personal desire to become an expert on the technology?" If it plays a large part, do the right thing and at least reconsider the recommendation. If you genuinely selected the technology because it is the best fit, read on.
If you make a recommendation to use a technology that you have very little experience with, should you charge the company for your learning time? Assuming you let them know that you aren't an expert, and they still want you to handle the development, I still don't believe you should charge for learning time. The only time I believe that is appropriate is when the technology is specified and you aren't an expert. It shouldn't happen, but it does.
So, what do you think? Do contractors have an ethical (and perhaps legal) responsibility to disclose their level of expertise with a given technology and should they charge for the time they spend learning? Until now, I have felt alone in the thought camp of responsible disclosure and appropriate billing. What do other contractors out there do?<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-8775175901534745491?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-81289878237795344292007-06-22T19:59:00.000-07:002007-06-22T20:34:50.331-07:00Distributed Computing FailuresI went to a talk several months ago given by Alan Robins, a Principal Engineer in the distributed systems engineering group at Amazon. The title of the talk was something like, "Performance and Availability" but the focus of the talk was much more on the how and why of what distributed computing technologies had failed at Amazon. It was really interesting. I took about three pages of notes, and they're more or less verbatim below. I wish he had released the slides, there was a lot of really good information that I was unable to get down on paper.
<ul><li>Technologies</li><ul><li>XA Distributed transactions (two phase commit)</li><ul><li>TP monitors such as Tuxedo</li></ul></ul><ul><li>RPC</li><ul><li>DCOM, DCE, CORBA, RMI, EJB</li></ul><li>Stateful Remote Objects</li><ul><li>RMI</li></ul></ul><li>Many dimensions to consider beyond performance and availability
</li><ul><li>Performance (TPS/Host Latency, etc)
</li><li>Availability: How many nines (time up/total time)?
</li><li>Scalability: How much effort to scale?
</li><li>Distributability: How much effort for multiple data centers?
</li><li>Evolvability: Effort to extend and mutate
</li><li>TCO: Hardware, licensing, dev or integration, operations and maintenance</li><ul><li>Reconsider performance and availability relative to TCO!</li></ul></ul><li>Distributed Transactions: Atomic transactions across multiple transactional resources
</li><ul><li>Example: Customer changes primary address and hits customer and address db</li><li>Dark Side</li><ul><li>Expensive. Reduces scalability of db server</li><li>Latency of commit is 5x more over normal transactions</li><li>Reduce throughput of application</li><li>If any resources are down, nothing can happen. Reduces availability</li></ul><li>Alternate to XA</li><ul><li>Be optimistic, commit what work you can</li><li>Do no harm: order commits such that if failure occurs you can live with inconsistent state</li><li>Compensate: undo previous commit or queue up rest of work for later</li><li>Design for failure: Minimize cross db foreign key refs, even denormalize</li><li>Tolerate dangling references and inconsistencies</li></ul></ul><li>Remote Procedure Calls: make a function call like it's local, but it's not</li><ul><li>Example: calculate shipping charges on a customer order</li><li>Dark Side</li><ul><li>Binary formats create dependencies</li><li>Evolving API forces client side rebuilds. Expect to evolve.</li><li>Service owners must run multiple versions of their software.</li><li>RPC tightly couples availability requirements.</li><li>Many fine grained requests have high latency over global distances</li></ul><li>Alternative to RPC</li><ul><li>Document passing paradigm</li><ul><li>Self describing wire format (XML)</li><li>Evolution without affecting old clients possible</li><li>Good for asynchronous message passing</li></ul><ul><li>RPC model still possible</li></ul><li>SOAP Problems</li><ul><li>Large messages</li><li>Expensive to parse and build DOMs</li></ul></ul></ul><li>Stateful Remote Objects (CORBA, EJB)</li><ul><li>Problem being solved: supports ? for clients, clients can make many fine grained calls, keeps data model on server, complex data model not transferred</li><li>Dark Side</li><ul><li>Mapping client session to stateful server is complex.</li><li>Servers must keep state for each client (reduces scalability)</li><li>Server failure fails a lot of clients (reduced availability)</li></ul><li>Alternative: Stateless servers with persistence store</li><ul><li>Servers handle each request independently</li><li>Use data in request to establish context</li><li>Return results to caller</li><li>Advantage: High performance, high availability, scales great</li><li>Disadvantage: Pushes state onto data store</li></ul></ul><li>Asynchronous messaging, Once-only delivery</li><ul><li>Problem being solved: service developers don't worry about dupes. They can just do what the request wants. Reduces application logic and complexity of handling dupes.</li><li>Example: Customer 1-clicks on an item
</li><li>Dark Side</li><ul><li>Almost impossible to guarantee. In order to ? everything must be transactional</li><ul><li>Double clicks happen all the time</li></ul></ul><li>Alternative to Only-Once</li><ul><li>Idempotence (quality of something that has the same effect if used multiple times at once): dupes handled correctly with respect to application.</li><li>Advantages: simple, enables more scalability and availability. Simplifies clients.</li><li>Disadvantages: Requires services to check their db. Sometimes service has to build look aside cache.</li></ul></ul><li>In order delivery: service doesn't worry about temporal discontinuities</li><ul><li>Example: Order adds A, adds B, adds C, deletes B, submits order</li><li>Dark Side</li><ul><li>Very difficult for infrastructure to manage total ordering.</li><li>Tight coupling.</li><li>Can't deliver message until current message delivered.</li><li>Eliminates availability and scalability</li></ul><li>Alternative: best effort delivery</li><ul><li>Developers deal with out of order messaging. Requires event to have a time stamp or sequence id.</li><li>Advantages: high throughput, optimistic delivery policy (deliver events when you can), very high availability</li><li>Disadvantage: Application developers must deal with out of order messages</li></ul></ul><li>Stored Procedures</li><ul><li>Easy for developers to write RPC type applications</li><li>DBAs can ensure db resources are used efficiently</li><li>Complex logic performed without moving data across the wire</li><li>Dark Side</li><ul><li>Database resources are the most expensive</li><li>Creates scaling limitations</li><li>Low performance</li><li>Application now split between application server and database</li></ul><li>Alternative</li><ul><li>Use a database for what it's good for: relational queries and updates</li><li>Keep business logic on server</li></ul></ul><li>Centralized Database</li><ul><li>ACID model is easy to program against, ensures consistency</li><li>Reads after write guaranteed to reflect write</li><li>Provides single synchronization point for all applications</li><li>Provides richest set of capabilities</li><li>Example: Customer information database</li><li>Dark Side</li><ul><li>Doesn't scale</li><li>Doesn't lend to global distribution</li><li>Most labor intensive</li><li>Least available</li></ul><li>Alternative: Lightweight operation datastores/caches (e.g. bdb)</li><ul><li>Datastore distributed geographically</li><li>Updates propagate via asynchronous messaging</li><li>Read operations are done locally</li><li>Updates done locally then write back to central or peer</li><li>Disadvantages</li><ul><li>Inconsistencies: Read after write not absolutely guaranteed</li><li>Partitions can cause multiple versions to exist on different peers</li><li>Requires distributed group management (DHT)</li></ul></ul></ul><li>The nature of distributed systems</li><ul><li>Nodes fail</li><li>Networks partition</li><li>Data centers go down</li></ul><li>There is a tradeoff between availability and consistency</li><ul><li>Via distribution and redundancy you gain availability, scalability and performance but lose consistency</li><li>Strive for eventual consistency</li></ul><li>Embrace failure: build in availability</li><li>Accept inconsistency.</li><ul><li>Apology oriented development</li></ul><li>If you are a developer, deal with these things: Potential inconsistencies considering race conditions</li><li>Infrastructure cant hide</li><li>Model applications as event driven process. Include all info needed in each message. Prop, repl, cache. This provides high performance and high availability</li></ul>There are talks of this nature fairly regularly at UW and Seattle University, I encourage you to go when you can. This was one of the most informational talks I have ever been to, and it was free.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-8128987823779534429?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-45363690775165949852007-06-22T04:03:00.001-07:002007-06-22T04:08:17.353-07:00Trac reminds me of OracleI just finished getting a <a href="http://trac.edgewall.org/"><span class="blsp-spelling-error" id="SPELLING_ERROR_0">trac</span></a> installation up on my site, you can find projects <a href="http://mobocracy.net/code/">here</a>. And, while I love <span class="blsp-spelling-error" id="SPELLING_ERROR_1">Trac</span>, the install process just reminded me of Oracle. It took several hours, had a bunch of dependencies, and to get it to work the way I wanted it required several more hours of customization.
Granted, if you need an "enhanced wiki and issue tracking system" that has integration with subversion and works pretty well, <span class="blsp-spelling-error" id="SPELLING_ERROR_2">Trac</span> is tough to beat. But getting it running on a slightly older system was no easy task. If they want that tool to become more widely used the developers are going to have to fix the installation issues and create some type of an automated installer.
Of course, I really have no right to complain considering I haven't written a single line of code for the project.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-4536369077516594985?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-78156241335330812822007-06-20T10:27:00.000-07:002007-06-20T11:21:21.732-07:00Martin Roesch on Snort 3.0 and SourcefireYesterday <span class="blsp-spelling-error" id="SPELLING_ERROR_0">Sourcefire</span> put on a two hour presentation at the <span class="blsp-spelling-error" id="SPELLING_ERROR_1">EMP</span> here in Seattle. With admission you got some <span class="blsp-spelling-corrected" id="SPELLING_ERROR_2">swag</span> including a calendar and a snort toy, admission to the Sci-<span class="blsp-spelling-error" id="SPELLING_ERROR_3">Fi</span> museum for the afternoon and an "ice cream social". Below are <span style="font-weight: bold;">notes</span> from the presentation, these are not my <span class="blsp-spelling-corrected" id="SPELLING_ERROR_4">opinion</span>. Overall I found the presentations pretty interesting with them covering the following topics:
<ul><li><span class="blsp-spelling-error" id="SPELLING_ERROR_5">Sourcefire</span> & Snort; past, present & future</li><li>Demo of their RNA/<span class="blsp-spelling-error" id="SPELLING_ERROR_6">ETM</span> tools</li><li>Snort 3.0</li><li><span class="blsp-spelling-error" id="SPELLING_ERROR_7">Sourcefire</span> 4.7</li></ul>In particular, I wanted to hear Marty's thoughts on Snort 3.0 and where he is heading. Martin said that the 3.0 release would focus on the following areas:
<ul><li>Reduce Manual Tuning & Automate Configuration</li><ul><li>"Tuning today is a failure"</li><ul><li>We need dynamic defense for dynamic networks
</li></ul></ul><li>Solve layer 3/4 evasion due to the IDS not being <span class="blsp-spelling-error" id="SPELLING_ERROR_8">IP</span> stack aware</li><ul><li>Model the way an endpoint sees, model the <span class="blsp-spelling-error" id="SPELLING_ERROR_9">IP</span> stack
</li></ul><li>Normalize rules and configuration languages</li><ul><li>Pro</li><ul><li>Rules work well</li><li>Trivial to use for simple stuff</li></ul><li>Con</li><ul><li>Ugly</li><li>Hard to do hard things</li><li>A bad rule can significantly impact performance</li></ul><li>Snort is not a language project</li><ul><li><span class="blsp-spelling-error" id="SPELLING_ERROR_10">LUA</span> will be snort 3.0's next generation language processor</li><li>Snort 3.0 will include a command shell that will allow <span class="blsp-spelling-error" id="SPELLING_ERROR_11">LUA</span> commands to be executed
</li></ul></ul><li>Take better advantage of hardware</li><ul><li>We are getting more cores, not speed. Snort is single threaded, this is a problem.</li><ul><li>Must multi-thread snort
</li></ul><li>Vendors are accelerating the wrong parts of Snort and have been for years</li><ul><li>Need explicit locations for optimization.
</li></ul></ul></ul>Martin asserts that tuning, prioritization and evasion are the same problem. The root of this problem is a lack of knowledge of what is being defended. The solution is to impart knowledge about the operating environment directly into the engine. This allows for the engine to tune itself, automate anti-evasion and automate prioritization.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://mobocracy.net/images/snort_3.0.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px;" src="http://mobocracy.net/images/snort_3.0.png" alt="" border="0" /></a>Above is the snort 3.0 architecture as described/shown by Martin. I think of primary interest is the <span class="blsp-spelling-error" id="SPELLING_ERROR_12">rearchitecture</span> and threading. I will be surprised if Martin is able to release RNA as open source and integrate that into Snort. If that doesn't happen, it either means that the automation features won't make it into Snort or they won't work nearly as well as RNA.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-7815624133533081282?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com1tag:blogger.com,1999:blog-7225965301398108875.post-13585288571499658432007-06-19T11:55:00.000-07:002007-06-22T04:01:57.240-07:00Work in Progress: CopyBlogI used to have a <span class="blsp-spelling-error" id="SPELLING_ERROR_0">Wordpress</span> blog, and although there weren't many posts I wanted to import them into Blogger. I found a bunch of tools for importing from Blogger to <span class="blsp-spelling-error" id="SPELLING_ERROR_1">Wordpress</span>, but none that did the opposite. I found one tool that did what I was wanting (or says so), <a href="http://code.google.com/p/blogsync-java/"><span class="blsp-spelling-error" id="SPELLING_ERROR_2">blogsync</span>-java</a>, but looking at the code it isn't very modular. Given that my technology tastes seem to change every other month, I really wanted a tool that would allow me to copy posts and comments between any two blog systems. Hence, <span class="blsp-spelling-error" id="SPELLING_ERROR_3">CopyBlog</span>.
<span class="blsp-spelling-error" id="SPELLING_ERROR_4">CopyBlog</span> is a command line tool that allows you to copy posts and comments between any two blog systems, at least in theory.
I spent some time yesterday and this morning writing some code to take <span class="blsp-spelling-error" id="SPELLING_ERROR_5">Wordpress</span> posts and comments and import them to Blogger. Although this currently just replicates the functionality of <span class="blsp-spelling-error" id="SPELLING_ERROR_6">blogsync</span>-java, the <span class="blsp-spelling-error" id="SPELLING_ERROR_7">API</span> is much more modular so you should be able to drop in a single class and you can immediately copy to and from that blog type. I should have a 0.1 version out the door this week if I can find some free time, and that would include full support for Blogger, <span class="blsp-spelling-error" id="SPELLING_ERROR_8">Wordpress</span> and <span class="blsp-spelling-error" id="SPELLING_ERROR_9">LiveJournal</span>.
You can find source code up at <a href="http://mobocracy.net/code/CopyBlog">http://mobocracy.net/code/CopyBlog</a><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-1358528857149965843?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-75368002631931353542007-06-17T14:10:00.001-07:002007-06-17T15:49:35.605-07:00The Challenges of EC2I've recently been working on a project building out the development process, environment and tools for a startup client. This includes things like configuration management, release engineering, automated testing, version control, etc. In doing so, I've been wanting to create as part of this several images including:
<ul><li>A bootable and optionally installable build system (see <a href="http://blog.mobocracy.net/2007/06/bootable-development-environments.html">this</a> post)
</li><li>A bootable development environment for each developer (preconfigured application server, version control data, preconfigured database, etc)</li><li>A bootable QA and Beta environment</li></ul>I was hoping to use <a href="http://aws-portal.amazon.com/">AWS</a> for this, particularly <a href="http://aws.amazon.com/ec2">EC2</a> for the development, QA and Beta environments. If you're not familiar with EC2, it is Amazon's Elastic Compute Cloud, and it allows you to essentially boot and run OS images in their cloud. You pay for hourly CPU usage and data transfer from the cloud and to the cloud. Let's compare a fully hosted, dedicated server solution from <a href="http://www.theplanet.com/">the planet</a> with an EC2 image.
<div><table border="0" cellpadding="3" cellspacing="0"><tbody valign="top"><tr><td width="33%"> </td><td width="33%">The Planet</td><td width="33%">Amazon EC2</td></tr><tr><td width="33%">OS</td><td width="33%">CentOS</td><td width="33%">Linux w/2.6 Kernel</td></tr><tr><td width="33%">Data Transfer</td><td width="33%">1500GB</td><td width="33%">Unlimited</td></tr><tr><td width="33%">Disk Space</td><td width="33%">250GB</td><td width="33%">160GB</td></tr><tr><td width="33%">RAM</td><td width="33%">2048MB</td><td width="33%">1792MB</td></tr><tr><td width="33%">Cost</td><td width="33%">$147/month</td><td width="33%">$312/month</td></tr></tbody></table></div>
So, the above for bandwidth assumes that you use your full 1500MB a month (and the same in EC2) and that inbound traffic is only 25% of the total traffic, that number comes from my own experience and is probably overly generous. If you decreased the percentage of inbound traffic, the price increases. Also, it assumes 24/7 operation of a machine over a 30 day period. We also a amortized a $225 setup fee at the planet over a 12 month period. So, EC2 is more expensive for a front-end web server then a hosted environment however your downtime due to hardware related failures decreases to almost zero. You also have no setup fee and you can literally bring images up in minutes (so I'm told, more on that later). However, for an N-tier system, EC2 is a very inexpensive solution for your middleware application servers and backend servers since traffic between EC2 systems costs you nothing. Your cost for operating an EC2 image on a 24/7 basis and only doing inter-image traffic? $72.00/month.
So, purely on a cost basis, EC2 seems like a platform for at least middleware and backend systems. Also from an downtime perspective, not having hardware to deal with should increase availability.
In doing some research, I found the following troubling points. First, the internet is apparently full. This limited beta is currently full, a helpful error message tells me, and I will be notified when there is availability. Okay, that sucks. Second, EC2 doesn't natively have persistent storage for images. That is, if an image fails/aborts/shutsdown, any data stored on the local disk is lost to you. Apparently you can mount an S3 partition in an EC2 image, but S3 isn't really meant for random IO like you might have in a database for instance.
I hope that EC2 opens up to new users in the near future, because it does solve many common problems (rapid horizontal scaling for the web tier, ephemeral environments for dev or qa). It is not however a silver bullet. Bandwidth to and from the cloud is expensive, and the lack of persistent storage makes many jobs impractical for the platform. If I ever get a chance to test, I'll put more information up.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-7536800263193135354?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-24849081648939401562007-06-15T18:24:00.000-07:002007-06-15T18:53:04.271-07:00SASAG Meeting for JuneLast night was the monthly <a href="http://www.seattle-sage.org/">SASAG</a> (Seattle Area System Administrators Guild) and was the first I had been to. As I'm not a system administrator but a software engineer, I wasn't sure what I would gain from going but it turns out a lot. Last nights topic of discussion was, "Project Success: Science, Magic or Luck?" and was presented by Leeland Artra who is currently a PM at Qpass. Essentially, Leeland was suggesting applying software development techniques to system administration. For example, he suggested using test first development methodologies (I always knew them as TDD, Test Driven Design) to works towards a functional system. He also recommended using scrum to manage projects.
My experience with systems engineers is that their programming experience is limited to systems programming, and things like functional and unit tests are foreign to them. If this isn't true, by all means let me know. So the suggestion for non-programmers to write programs to validate systems seemed counter intuitive to me. However, I liked the idea of using TDD to move forward on a project.
I'm not sure why Leeland didn't suggest simply using Nagios or another monitoring system as your test platform. You can imagine going and writing all your monitors for Nagios which all start off red, or non-functional initially and as parts of the system come online and become functional monitors go to green. This seems like a much more straightforward, and intuitive way to use TDD for non-development projects.
Regardless, the idea of applying agile methodologies to non-development projects is an interesting one and one I hadn't seriously considered before. I'm not sure how well it would apply to projects with serious capital expenditures such as hardware acquisitions, but the ideas should apply pretty well for any project. Leeland also showed off an interface he had developed for testing complex systems, which just seems unfair since I know it's not open source.
On another note, I had my all time longest interview today. 7 hours and 8 people including the CTO, the hiring manager, two developers, one system programmer, one system engineer, the ops manager and the HR person. I really enjoy interviews that are challenging because you know you're going to be working with other good people. I mentally collapsed in my last technical interview though, so who knows what they thought. Updates ahead.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-2484908164893940156?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-39751012037216010002007-06-14T22:17:00.000-07:002007-06-14T22:46:42.148-07:00What do you want to be when you grow up?I've been doing software development, system engineering and architecture for almost 10 years. I've worked at large companies, small companies and tiny companies. And after leaving Mixxer in February with the intention of going back to school, for the first time in many years I felt a bit lost in terms of "What do I want to do now?". So I spent about 8 weeks traveling, saw family, did some vacationing, and got back feeling refreshed with what I had hoped would be a new perspective on things. I didn't have that though.
What I had instead was the desire to go back to work, but no idea about what I wanted to be doing. So I started interviewing with everyone, 19 companies to be exact, doing everything from embedded C & C++ development to Ruby on Rails at companies ranging from fortune 100 to pre-funding startups. During the interview process I have kept busy consulting for small startups; helping with software development, architecture and direction. In doing some consulting it helped me figure out not what I want to do, but the characteristics by which I will be able to identify what I want to do. Characteristics of the right job include:
<ul><li>A company that believes they are improving the quality of life for its users</li><li>Coworkers who are really smart and passionate about the company mission</li><li>A startup</li><li>People who get the philosophy behind the technology they are using</li><li>The hacker ethos is prevalent
</li><li>Decisions are made based on merit, not ego</li><li>A technical community based on meritocracy, not seniority</li></ul>After determining that the above list would help me classify the right company for me, my list of companies dropped from 19 to 4. I'm wrapping up interviews with those 4 now, although the way I crumble in my in person interviews it could drop to zero pretty quickly :)
In any case, being able to identify what it is exactly about working that you love is crucial to being able to find the right job. Sometimes you don't have a choice, you have responsibilities that drive you towards finding the first available and well paying position. I choose to wait.
If you think you fit the environment I described above, and based on my <a href="http://mobocracy.net/resume.pdf">resume</a> I look like a good fit, send me an email.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-3975101203721601000?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-58283056822021842932007-06-14T22:05:00.000-07:002007-06-14T22:17:22.226-07:00Botnets and the Convex HullOver the past few months I have worked on some computational geometry problems which required computing the Convex Hull for some set of points. I have been using it for some pattern recognition work and in doing so thought to myself, how could you map an IP address to a real vector space? And, if you could, is it possible to track an attacker or adversary? More importantly, can you estimate size or infer the location of a master in a botnet?
Now, I realize that the location of a compromised host has no bearing on the location of the attacker. However, the latency between the compromised host and the attacker (or botnet master) does have a bearing on location. Likewise, there are a number of other useful metrics such as how recently the machine was compromised, the difference in times for two zombies to receive the same command, etc.
Take one of these metrics, and assign it to each node you are aware of. Now use that metric as the distance to an arbitrary point P. Now compute the convex hull. Perform this same series of steps for each of the metrics you have chosen and overlay the convex hull for each metric. My assumption would be that your arbitrary point P could be identified in each one, and that may help identify a master. Also, it may help estimate the size of the botnet.
The above writing is very hand wavy, I realize. However I'm curious if any work has been done until now to determine botnet topology via a similar mechanism. If anyone is aware, please let me know.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-5828305682202184293?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-32807075974988892982007-06-14T08:56:00.000-07:002007-06-14T22:04:57.998-07:00Reviving WADEI couple of years ago I started a small project called WADE. WADE stands for "Wireless ADvertising Engine", the goal of which was to enable coffee shops and other sites providing free wi-fi with the ability to earn money from advetising to help offset the cost of wifi. The technology would essentially allow sites to insert advertising in place of existing ads on a page, with ads that they get a payment for. More specifically, using something like the Adblock filter list, apache, mod_rewrite/mod_proxy and some dns magic to instead of removing ads, replace them with perhaps more relevant ads from local businesses.
When I went to a friend at the EFF, he informed me that there may be an issue with Copyright law. As in, the content and layout of a page are protected under copyright law. I'm not sure if I'm liable for providing the software, the coffee shop is liable for using the software, or if it is a non-issue.
In any case, I have received a few emails about WADE over the past couple of months and have thought I would revive the software and send it to a few friends who can use it. The question is, does something already exist and I should just point people there?<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-3280707597498889298?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-88472826689825537412007-06-13T19:01:00.000-07:002007-06-22T04:02:43.534-07:00REALM Part I: Tomcat & ServletsThis is the first in a five part series on the REALM stack. The previous introduction can be found <a href="http://blog.mobocracy.net/2007/06/realm-stack.html">here</a>. In this posting I will introduce Tomcat & Servlets as well as review the basic tomcat installation, organization of a deployment, organization of your source and the actual development process. A link to source code will be provided at the end of the tutorial, and assuming you have tomcat installed you should be able to modify the included build.properties file and type "ant install". I will not cover ant or tomcat installation, as there are a number of good tutorials on the web. These applications are available for most Linux distributions.
Some of this information has been taken from, "Developing Applications with Tomcat" which is available <a href="http://tomcat.apache.org/tomcat-3.3-doc/appdev/contents.html">here</a>.
<span style="font-weight: bold;">The Project</span>
CalculatorInc.com wants to provide a website where they allow people to do basic arithmetic operations on the web using arbitrarily large numbers. They are sure it will be the next big thing in this "web 2.0" space, however they haven't yet discovered web services or rails so they have implemented the entire thing using JSP & Servlets.
<a href="http://mobocracy.net/code/calculator">Here</a> is a link to the project source. In it is a reasonable starting place for any Servlet/JSP project, it contains the following: ant build file, JavaCC grammar, unit tests, a servlet and a jsp page. Pretty basic but we're going to use it as a starting point from which we will enhance it with Spring, add remoting to provide a web service and finally hook it up to rails. Since I recently <a href="http://blog.mobocracy.net/2007/06/stop-giving-out-your-passwords.html">made fun of</a> a calculator web service, we'll create a calculator web service. It only has the basic operations (+,-,/,*,%,^) but it uses a JavaCC grammar so if you haven't used JavaCC before it's a good simple intro. The requirements are in the docs/README.txt file, but in short: Java >= 1.5, Tomcat 6 (5 probably works, 4 perhaps as well), JavaCC and Ant. That should be about it. Basic install instructions (edit build.properties, ant install test) are in that same document.
<span style="font-weight: bold;">What is Tomcat?</span>
First and foremost, Tomcat is a web container. A web container, according to sun is defined as follows.
<blockquote>"A container that implements the Web component contract of the J2EE architecture. This contract specifies a runtime environment for Web components that includes security, concurrency, life-cycle management, transaction, deployment, and other services. A Web container provides the same services as aJSP container as well as a federated view of the J2EE platform APIs. A Web container is provided by a Web or J2EE server." </blockquote>That's a wordy definition. I would say, "Tomcat allows Java code to run in a web environment". It also does all of the things described above, but for the purposes of this discussion the shorter definition is fine. Secondly, Tomcat implements theJSP and Servlet API specifications. For Tomcat 6.0, which is the most recent release of Tomcat, that means the Servlet 2.5 and JavaServer Pages 2.1 specifications. These can be found <a href="http://jcp.org/aboutJava/communityprocess/mrel/jsr154/index.html">here</a> and <a href="http://jcp.org/aboutJava/communityprocess/final/jsr245/index.html">here</a>, respectively.
In general you configure Tomcat via xml files that can be found in the conf directory. Tomcat also has its own web server. This is fine for development purposes however you almost always want to use something like mod_jk in production settings, assuming you are using Tomcat in the web tier as well as the middleware tier. There is excellent documentation for Tomcat available <a href="http://tomcat.apache.org/tomcat-6.0-doc/index.html">here</a>. Competitors include Jetty, Geronimo, Resin, BEA WebLogic and JBoss.
<span style="font-weight: bold;">What is a Servlet?</span>
A servlet is a Java server application that answers and fulfills requests from clients. It's that simple. Tomcat interacts with and manages your servlets, that is one of its jobs. In terms of implementation, a servlet is a component that extends or implements classes or interfaces from the javax.servlet package or the javax.servlet.http packages. A servlet allows you to create dynamic content, and is most commonly interacted with via the HTTP protocol. Some typical uses of a servlet include:
<ul><li>Processing data submitted by an HTML form (and optionally storing it)</li><li>Providing dynamic content (e.g. information stored in a DB)</li><li>Managing state information (e.g. sessions)</li></ul>Servlets have the following advantages over a typical CGI; doesn't run in its own process, stays in memory between requests, there is a single instance that handles all requests concurrently.Servlets are also typically packed in a <span style="font-style: italic;">WAR</span> file, that is, a Web ARchive. This is the web analogy to a JAR file.
<span style="font-weight: bold;">Your Tomcat Installation</span>
Once you have tomcat installed (I'll assume it's in /usr/local/java/tomcat), it at a minimum should have the following directories.
<ul><li>bin - contains startup, shutdown and other scripts
</li><li>conf - server configuration
</li><li>lib - The jar files used by Tomcat
</li><li>logs - Application and server log files
</li><li>webapps - Location of servlets/web applications
</li><li>work - Automatically generated by Tomcat, these files are often intermediary (such as compiled JSP)</li></ul> Optionally, you can create the following directories.
<ul><li>classes - Classes you want available to a servlet. You may have to configure this.
</li><li>doc - Documentation for tomcat, copy from webapps/docs
</li><li>src - Servlet API source files.
</li></ul> When you deploy a webapp as source or as a WAR file, it will end up under the webapps directory. When you go to your servlet on the web, files and directories will be created under the work directory. The above should be enough hierarchy information for you to figure out where things are or should be.
<span style="font-weight: bold;">Your Deployment</span>
Servlets conforming to the 2.2 or later Servlet specification are required to accept a WAR file in a specified format. A WAR file has a specific directory and file hierarchy that must be conformed to and as such it often makes sense for your development environment to reflect this layout but more on that in the next section. The WAR file when unpacked is useful for development, and packed is useful for deployment.
The top-level directory of your web app hierarcy is also the <i>document root</i> of your app. You should put your HTML/JSP/UI files there. When you deploy your application to a server, your application is assigned a <span style="font-style: italic;">context path</span>. If your context path is <i>/catalog</i>, then a request URI to /catalog/index.html will fetch the index.html file from your document root.
From the document root, the directory and file hierarchy will look something like this:
<ul><li>*.html, *.jsp, images, etc - Files that must be visable to the client. You can break this up in a hierarchy if your application is large.
</li><li>WEB-INF/web.xml - The Web Application Deployment Descriptor. This XML file describes servlets, initialization parameters, container security, etc.
</li><li>WEB-INF/classes/ - Java class files that are required for your application that are not in JAR files. Added to your classpath.
</li><li>WEB-INF/lib/ - Contains the JAR files required for your application. Added to your classpath.
</li></ul> The WEB-INF/web.xml contains the Web Application Deployment Descriptor for your app. This is an XML document that defines everything about your app that the server needs to know (except the context path). The complete syntax and semantics for the descriptor are defined in Chapter 13 of the Servlet API specification, version 2.4. Also see doc/appdev/web.xml. The 2.5 API specification seems to not be available yet in PDF form.
A web application must be installed in a server container, even during the development phase. A web application can be installed in several ways, however when you run "ant deploy" for this application ant will submit the built WAR file to the Tomcat installation and Tomcat will automatically unpack it for you.
<span style="font-weight: bold;">Your Source Code</span>
This section primarily focuses on the directory structure and build targets of your build.xml ant file. You want to separate your source code (tests and application) from your deployable application as much as possible. This makes both deployment and revision control easier. Below is the recommended hierarchy for the top level <span style="font-style: italic;">project source directory</span>.
<ul><li>build.xml - Your ant build file</li><li>build.properties - Ant build properties</li><li>build/ - temporary home for javadocs and built war files</li><li>dist/ - temporary home for build classes</li><li>docs/ - Documents generated by javadoc or install notes, etc</li><li>lib/ - Jar files needed for builds and distributions
</li><li>src/</li><ul><li>src/tests/ - Your unit tests, load tests, etc
</li><li>src/main/ - The primary java code for your servlets and application
</li></ul><li>web/ - User facing components (images, html, etc)
</li><ul><li>web/WEB-INF/ - The application
</li><ul><li>web/WEB-INF/web.xml - Your web application descriptor
</li><li>web/WEB-INF/classes/ - The compiled classes from src/main/. Built for you.</li><li>web/WEB-INF/lib/ - JAR files from lib/. Built for you.
</li></ul><li>web/images/ - Images for web facing components
</li><li>web/jsp/ - JavaServer Pages
</li></ul></ul> Again, this is simply the <span style="font-style: italic;">recommended</span> hierarchy, and the one that will be used for all the sample code. The top level build.xml file will include a properties file called build.properties. The properties file will contain build related properties such as the tomcat manager username and password and the base of your tomcat installation. The included build file has the following targets available:
<ul><li>all - Run clean target followed by compile target, to force a complete recompile.</li><li>clean - Delete any previous build and dist directory so that you can be ensured the application can be built from scratch.</li><li>compile - Transforms source files (from src/main/ directory) into object files, generally unpacked in build/WEB-INF/classes.</li><li>dist - Creates binary distribution of your application in a directory structure ready to be archived. Runs compile and javadoc.</li><li>install - Tells tomcat5 to dynamically install the web app and make it available for execution (deploy). Does not cause app to be remembered across restarts. If you just want Tomcat to recognize that you have updated classes (or web.xml) use the reload target instead.</li><li>javadoc - Creates Javadoc API documentation for the Java classes included in the application. Normally only done for dist.</li><li>list - List currently running web applications. Useful to check if app has been installed.</li><li>prepare - Create the build dest directory, copy static content to it. Normally executed indirectly.</li><li>reload - Signals tomcat to shutdown and reload. Useful when web application context isn't reloadable and you have updated classes or properties or added new jars. In order to reload web.xml you must stop and then start the web application.</li><li>remove - Remove the web app from service (undeploy).</li><li>start - Start this web application.
</li><li>stop - Stop this web application.
</li><li>test - Run all unit tests.
</li><li>usage - Display a short form of the above, the default target.</li></ul> In general, you will run "ant install" to build and install your application.
<span style="font-weight: bold;">The Development Process</span>
The Servlet/JSP development cycle mantra is "edit, test, deploy". Say it with me now, "edit, test, deploy". After you have created your base directory structure and installed your application at least once (i.e. Tomcat recognizes it) you will employ this mantra frequently.
<span style="font-weight: bold;">How Does it Work?</span>
Based on web/WEB-INF/web.xml (which sets our "routes" which are known as servlet mappings), by default the controller servlet will be loaded which is in the net.mobocracy.web.ControllerServlet class. When that servlet receives control, it will if the request is an HTTP GET, forward to the jsp file located at /jsp/index.jsp (a forward is essentially handing off control). If the request is an HTTP POST, it will do basic validation, instantiate the Arithmetic parser, and evaluate the arithmetic expression. On success, it sets ArithmeticSuccess for the JSP page and on failure it sets ArithmeticError. The doPost method also forwards to the /jsp/index.jsp page, but only after setting one of the previous attributes.
Once the JSP page is handed control, it sees what attributes (if any) have been set by the Servlet and acts appropriately. The JSP page only has a few lines of code so it probably isn't worth discussing too much.
The examples are meant to be evaluated in the source code, both in comments and interaction. The code comes in at only 338 lines, not including the build file and README so I will not spend a lot of time explaining the source here but if you have questions please leave it as a comment or send me an email. You should find the source fairly well documented.
<span style="font-weight: bold;"></span><span style="font-weight: bold;">Conclusion</span>
We will be using this to build upon to take this from a JSP/Servlet architecture to a REALM project where we can utilize the wide variety of toolkits available to J2EE and the flexible front-end development environment of Ruby on Rails. After reading this you should have an understanding of one model of Servlet/JSP development using Tomcat. Below is a list of resources.
<span style="font-weight: bold;">Resources</span>
Tomcat
http://tomcat.apache.org/tomcat-6.0-doc/index.html
http://en.wikipedia.org/wiki/Apache_Tomcat
http://www.coreservlets.com/Apache-Tomcat-Tutorial/
Java 1.5
http://java.sun.com/j2se/1.5.0/docs/api/index.html
JavaCC
https://javacc.dev.java.net/doc/docindex.html
http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-cooltools.html
http://www.idevelopment.info/data/Programming/java/JavaCC/The_JavaCC_FAQ.htm
Ant
http://ant.apache.org/manual/index.html
Servlets
http://www.servlets.com/
http://java.sun.com/products/servlet/
http://www.apl.jhu.edu/~hall/java/Servlet-Tutorial/
JSP
http://java.sun.com/products/jsp/
http://www.jsptut.com/
Source Code
http://mobocracy.net/code/calculator<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-8847282668982553741?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-14290896617982504992007-06-13T09:44:00.000-07:002007-06-13T10:05:36.171-07:00Stop Giving out Your PasswordsOver the past 2-3 years, as this web 2.0 thing has come to be official jargon, another term that has become popular is "SOA". For those of you who are new to that term, SOA stands for Service Oriented Architecture and is more commonly referred to simply as a "web service". This effectively means that web sites expose functionality via a service using a protocol like REST, XMLRPC or the dreaded SOAP. The typical example is that of a calculator. You have a calculator web service that people hook their applications into that provides all your calculating needs without the developers doing the integration having to know anything about subjects such as addition and subtraction.
Some web sites have taken SOA to mean, "Anything you expose via a web page that can be scraped I can use". This means that more and more frequently, users are being asked to provide their username and password for sites such as GMail, Yahoo Mail and MySpace. Once you have given this third-party site your credentials they login and scrape information like contacts and friends. While this isn't a new practice, it only seems to have become widely accepted over the past few years.
If you are a user and are asked for your credentials, should you provide them? I would say as a general rule, no. However in the real world it really all depends on a variety of factors such as what kind of data you are exposing, how much you trust the third-party and the level of utility being provided by the service. My assumption is that most users implicitly trust many of these third-parties and simply assume that they would not be asked for this information unless it was needed. The additional use of GMail/MySpace/etc corporate logos makes the request seem even more legit.
As a third-party site, what are your ethical and legal responsibilities to your users? I would argue that if a service such as GMail provides an authentication mechanism (they do) which doesn't require you to actually process the login or store any user data you have a responsibility to use it, even if it doesn't mesh with corporate branding. Additionally you should give users the option to store their credentials or not, and assume this automatically. Use of a logo without permission I believe to be another no no as it implies endorsement.
As a popular web site, realize that third-parties will want to integrate with you. For the sake of your users, provide at a minimum a token based authentication system that third-parties can use. You could also just get on the wagon and embrace web services like everyone else (I'm looking at you MySpace).
In short, stop giving out your passwords and providers STOP ASKING FOR THEM. The security community has enough problems without having companies instill the idea in end user that giving out their password is okay.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-1429089661798250499?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-15430717698777639842007-06-10T21:24:00.000-07:002007-06-10T21:38:33.850-07:00Mirror your BookmarksI've been building and migrating my bookmarks.html file for nearly 10 years. It started out as a Netscape bookmark file, then became a Mozilla bookmark file, then it was transferred to Phoenix and finally to Firefox where it has happily stayed for several years now. I recently wrote some perl to automatically check my bookmarks, and found that a large number (like, 10%) of the links were invalid (404) or off the net (no server response).
It occurs to me that when you bookmark a file, you are often less interested in the URL as being able to find that content again. Many of the resources I bookmark these days are publications, how-to's, FAQ's or other informational pieces of content. It is much more rare that I bookmark a site that has so many resources that I want to just be able to get back to that site.
It seems like a useful Firefox extension would be one where when you bookmark a site, you are asked if you want to mirror it as well (just fetch that page and images/etc, no real mirroring). Then when you try to go to a site from your bookmarks, if the site can't be reached or is a 404, the bookmark pulls up the local copy for you.
Is anyone aware of such an extension? If one doesn't exist I'll write one, but I'd rather download it. The closest thing I have found so far is the <a href="https://addons.mozilla.org/en-US/firefox/addon/2570">Resurrect Pages</a> extension, but that seems to be more useful for recent or highly trafficked sites as opposed to obscure ones. Although it does use Internet Archive, IA is hardly reliable. On a side note, does anyone know what happened to them? It almost seems that they've stopped archiving most sites.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-1543071769877763984?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com1tag:blogger.com,1999:blog-7225965301398108875.post-46463021079955428632007-06-10T12:08:00.000-07:002007-06-10T13:21:51.643-07:00Bloom Filters for EveryoneA <a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a> according to wikipedia, "is a space-efficient <a href="http://en.wikipedia.org/wiki/Probabilistic" title="Probabilistic">probabilistic</a> <a href="http://en.wikipedia.org/wiki/Data_structure" title="Data structure">data structure</a> that is used to test whether an <a href="http://en.wikipedia.org/wiki/Element_%28mathematics%29" title="Element (mathematics)">element</a> is a member of a <a href="http://en.wikipedia.org/wiki/Set_%28computer_science%29" title="Set (computer science)">set</a>." Named for the computer scientist Burton Bloom, the Bloom filter has the following informal attributes:
<ul><li>The filter can yield false positives (told element is in the set but it isn't) but not false negatives (told element is not in set but it is)
</li><li>Typically implemented using a bit vector</li><li>The more elements added to the filter, the higher the chance for false positives (assuming no resize)
</li><li>Is space compact with respect to the set it represents (can represent all possibilities with little space)
</li></ul>So, what might you want to use a Bloom filter for? There are plenty of examples in search problems which illustrate their usefulness so we'll look at two of them. First (taken from Wikipedia), consider a spell checker. You can imagine a language with a large dictionary where it is expensive to do a spell check (dictionary too large to hold in memory). In this case, you map words in the dictionary to a large bit vector using a Bloom filter. When you go to spell check the document and your filter indicates that the word is _correct_, you can check an original source (dictionary file) to ensure you haven't received a false positive.
The above usage isn't ideal since realistically in most cases (except mine), the majority of words will be spelled _correctly_. What you really want for a use case with a bloom filter is one where the majority of content is not found. A better example comes from PlanetPeer, a P2P network. When a peer searches for content, it looks for peers that have Bloom filters with the correct bits set to indicate that the content is available. The peer then checks all found peers to see if the content actually exists on those nodes or not. This is the ideal case for Bloom filter usage, where it reduces the search space and reduces the number of expensive operations required (in this case, network connections).
The mathematics for a bloom filter are fairly trivial so I leave the majority of the mathematics to the wikipedia reference. I also don't know how to embed LaTeX into Blogger yet. Three important choices for the Bloom filter are the size of <span style="font-style: italic;">m</span> (total number of bits in the vector), <span style="font-style: italic;">k</span> (there are <span style="font-style: italic;">k</span> hash functions, one for each bit set in the vector) and the hash function (which should decrease false positives). Calculating these values will often times be functions of the physical limitations of the hardware (disk space, etc) and the desire to reduce false positives.
One should note that the probability of a false positive is (1/2)^k or approximately 0.6185^(m/n) where k is the number of hash functions used (and therefor bits set), m is the size of the vector and n is the number of elements inserted into the vector. This makes <span style="font-style: italic;">k</span> with respect to <span style="font-style: italic;">m</span> and the potential size of <span style="font-style: italic;">n</span> the crucial elements to reducing false positives.
There are a variety of references to implementations in the wikipedia article.<span style="font-style: italic;"></span><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-4646302107995542863?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0tag:blogger.com,1999:blog-7225965301398108875.post-3857956107001626722007-06-10T08:00:00.000-07:002007-06-10T00:48:22.558-07:00The REALM StackRecently I started building with what I call the REALM stack. That's Ruby on Rails, EJB, Apache, Linux and MySQL. Although to be fair it's never EJB, it's usually Spring or something similar. This has changed my attitude completely from what it was while using the traditional LAMP stack. There is a more appropriate separation of responsibilities for each tier. A lot more design and forethought goes into the middleware tier because it can be expensive to make changes there (rebuild, redeploy, restart, etc). The front end developers no longer wait on backend functionality to be exposed which means they can work with stakeholders earlier in the process to provide a functional prototype. The toolkits are fantastic. And, it's fun again. New challenges, new things to learn, new problems to solve and new communities to interact with.
In the middleware I have been using Hessian as the service protocol between Rails and Spring and it's fast, although I don't yet have any benchmarks. I'm using Maven2 or Ant plus a host of applications for continuous integration, bug tracking, deployment, etc. I've been using <a href="http://appfuse.org/">AppFuse</a> too which does just what it says, it simplifies web development with Java. Granted, I'm not using any of the web stack but regardless <a href="http://raibledesigns.com/rd/">Matt Raible</a> has done a great job with the project. Hibernate is what bears do, but despite the horror that you can be faced with it is a fairly well engineered tool with a lot of nice features. I like the inversion of control/dependency injection model, but it would be nice if there was a bit more convention as there is with Rails.
On the front end (RoR) it's easy to hook into web services provided by Spring remoting and you can still benefit from the slick scriptaculous integration, the rapid prototyping (scaffolding, etc), and the instant gratification of reloading the page to see a change (no need to redeploy a war file). You can even generate the models/controllers/views from the xml provided by your servlet definition, so the UI folks can get started as soon as you've enabled remoting. And in general, UI folks seem to like RHTML/erb a lot more than JSP/Velocity/etc.
When I was getting started with the stack, a lot of the examples were complex and reading intensive for grasping relatively simple concepts. Over the next week I'm going to do a 5 part series on getting your REALM stack up and going, using as few lines of code as possible and as few lines of text while still proving all the appropriate references. The REALM series will look something like:
<ul><li>Tomcat & Servlets</li><li>Spring</li><li>Spring Remoting</li><li>RoR</li><li>RoR Web Services</li></ul>The end goal being that you have a general understanding of all the technologies involved, the underlying design principals, the architecture component interaction and where to go for help and documentation. Although the services protocol is hessian, after going through this series you should be able to swap it out for REST, XMLRPC or SOAP (or any other services protocol). The first installment will be on Monday, so if you have any suggestions between now and then please let me know. The platform used will obviously be Linux, although you should be able to adapt the series to another platform.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7225965301398108875-385795610700162672?l=blog.mobocracy.net'/></div>Blake Mathenyhttp://www.blogger.com/profile/15984995884622791435noreply@blogger.com0