Your Metadata Sucks

Signs your product is doomed #42314

2013-08-18T20:02:00.001-04:00

It enables toasters or refrigerators communicating with anything.

Stop. They shouldn't do that.

Toasters toast. Refrigerators refrigerate.

New Internet Draft: Semantic Content Packages

2012-04-13T10:33:00.002-04:00

"Blobs, Triples, and a URI. Bring Your Own Vocabulary."

Sound interesting? Then have a look at http://www.ietf.org/id/draft-wilper-semantic-content-pkgs-00.txt Is this the next logical step in semi-structured data management or is Chris just trying to stir up a hornet's nest? You decide.

scpproject.org

Seriously though, I've been thinking in this area for a while and I know others have, too. Particularly in the preservation & archiving community with things like BagIt and ORE and various combinations. I think something a bit more generic that has RDF at its core and, critically, acknowledges that copies of the same content can be made available from multiple locations...could have quite a lot of potential.

Anyway, I thought it would be interesting to get down to business and actually specify something for people to bang on. So if this is an area you've got an interest in, and you've got a few minutes, I'd appreciate your giving it a read.

Public comments or email to me are fine for now. I can also set up a group if there's sufficient interest.

A simple file-level dedupe utility in Python

2010-11-09T01:32:00.004-05:00

At home, I've been working on organizing my photo library and found FastDup to be a great little utility. You point it at a directory and it finds duplicate files with surprising speed. It works well because it's smart about not doing more work than it needs to. A naive dedupe utility (which, ahem, I may have written in Java a couple years ago to do similar work with my audio library) works like this:

Compute the checksum of all files
List files with matching checksums

A smarter approach is to:

Group all files by size
Do a partial comparison of all files of a given size, quickly excluding obvious non-matches
Complete the comparison for files that look equivalent so far, listing matches

FastDup, which is written in C++, takes this approach. I compiled and ran it fine on my file server (an Ubuntu machine) and tried to compile it on my Mac, too...no luck. The author states in the README that it works in Linux and nowhere else, and the last release was a couple years ago, so it seemed I was out of luck.

Well, not really. I've been wanting to get re-acquainted with Python for a while now (for various reasons), and I figured this was a good excuse. How hard could it be? As it turns out, not very.

qdupe - A command-line utility to quickly find duplicate files, written in Python and inspired by FastDup.

So how does it compare to FastDup?

Out of curiosity, I ran both over my DVD library, which is currently at about half a terabyte. I ran each twice, back to back, in order to see the effects of the OS's buffer cache. They both found 911 dupes, adding up to about 500MB. The first time I ran them, they each took about a minute. The second time, FastDup took 3.0 seconds and qdupe took 3.6 seconds.

Dot Plan from 1995

2009-12-03T01:55:00.002-05:00

Before the inter-twitter-facebook-blogweb, or whatever you kids call it, there was Finger. Finger was cool because only geeks knew about it. You'd post your status to your .plan file and people anywhere in the world could type "finger some-obsure-userid@some-obsure-host.edu" to see it.

It was like blogging, but with an even slimmer chance of having an audience. Great stuff.

Anyway, I was rooting around my old account at csh tonight and found this my .plan:

class CS2

creation
    brain_washing

feature -- Global variables

    student: STUDENT
    clean: INTEGER is unique
    warped: INTEGER is unique

feature -- Main program

    brain_washing is
    do
        from
            !!student.make
            student.mind := clean
        until
            student.mind = warped or world.end_of
        loop
            student.io.putstring( "EIFFEL is Good%N" )
            student.io.putstring( "Don't worry that your executables " )
            student.io.putstring( "are usually over 20,000 times larger " )
            student.io.putstring( "than the source code.%N" )
            if student.resists then
                student.attend_lecture
                student.attend_lecture
                student.attend_lecture
                student.attend_lab
            end
        end -- loop
    end -- brain_washing

end -- CS2

Clearly, this is an important digital artifact to preserve.

By posting it here, I feel I have played an important role in format migration for future generations. Thank you.

An extra cent?

2009-09-04T14:39:00.004-04:00

It often happens that my flight price goes up while I'm in the process of booking. I thought it was pretty shady the first few times it happened. Now I just accept it and move on. But I thought this one was a little bizarre today:

I can't help but wonder if Peter Gibbons is behind this in some way.

Discovery of content metadata on the web

2009-08-31T02:12:00.012-04:00

A thought experiment...

I recently read an entertaining old article on various things people have been shoving into http response headers. Some for utility (X-XRDS-Location), and some for fun (slashdot's random X-Fry and X-Bender quotes). One site actually put a bunch of DC.title, DC.etc headers in their responses. Not that anyone's looking for them there, but *just* in case...

This got me thinking (again) about ways to provide richer metadata, especially RDF, about resources on the web. We have RDFa now, which is a big step forward, but there are a couple key problems we still don't have worked out:

ISSUE 1: How do we discover publisher-sanctioned resource descriptions for arbitrary resources on the web? (e.g., non-XHTML)

I think the http Link: response header is the right way forward on this: An isDescribedBy link, pointing to a resource whose representation encodes an RDF graph describing this resource.

ISSUE 2: Given that a resource and the content of a representation of that resource are distinct things, how do we make statements about the latter on the web?

This one deserves more explanation.

If I access http://example.org/Picture1, and my browser uses content negotiation to request the image/jpeg representation, and gets it, I want to be able to discover this kind of info:

@prefix    : <http://dear.lazyweb/please/write/this/ontology/>
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>

# The file is a JPEG and here's some basic info about it

_:myFile  a            :OctetStream;
        :name        "Picture1.jpg";
        :mediaType   "image/jpeg";
        :format      <info:pronom/fmt/42>
        :length      105124;
        :md5sum      "7846df5ced300e9543a267a856c4ab6e";
        :sha1sum     "e3b5112b24e793f41fc5b843a505a83a80aaf776";
        :created     "2009-08-31T10:12.342Z"^^xsd:dateTime;
        :modified    "2009-08-31T16:28.921Z"^^xsd:dateTime;
        :renditionOf <http://example.org/someImage>

# The file is one of any number of renditions of a picture

<http://example.org/Picture1>
        dc:title       "Best Picture Ever";
        dc:description "This is a picture of my cat, Lucky"
        dc:creator     "Bob Dobbs".

What would be cool is if my browser knew about the http Link response header, and the metadata was just a click away, in an RDFa document.

The trick would be for user-agents to be able to associate the particular rendition I got by GETting the resource with the appropriate resource in this graph. Notice it's a bNode in the example above. It might have a URI, it might not; but the URI of the rendition isn't known by the user-agent when it retrieves this graph....and the relation expressed by the http Link header is to be interpreted as "(the resource identified by this URI) isDescribedBy (the graph resource over there)"

So, absent some additional information, in the general case, the user-agent is going to have to do the association via some distinctive property matching: Did the response of the original GET request on the picture include a Content-MD5 header? If so, that's a good clue. Hmmm.

That's Classy

2009-05-04T21:43:00.002-04:00

Here's a simple program to report on Java .class versions. I'm sure some variant of this has been written a thousand times, but Google wouldn't give me what I wanted right away, so here it is again :)

The program takes one argument: a path to a .class file, .jar file, or directory containing a mixture of both, and produces a report of each class file's major .class format version (50 for Java 6, 49 for Java 5, and so on). Handy if you want to track down those new fangled classes and avoid the dreaded java.lang.UnsupportedClassVersionError


import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.jar.JarEntry;
import java.util.jar.JarInputStream;

public abstract class ThatsClassy {

 static void classyFile(File file) throws Exception {
   if (file.isDirectory())
     for (File child: file.listFiles())
       classyFile(child);
   else if (file.getName().endsWith(".jar"))
     classyJar(file);
   else if (file.getName().endsWith(".class"))
     classyClass(file.getPath(), new FileInputStream(file), true);
 }

 static void classyJar(File jarFile) throws Exception {
   JarInputStream jarStream = new JarInputStream(new FileInputStream(jarFile));
   JarEntry entry = jarStream.getNextJarEntry();
   while (entry != null) {
     if (entry.getName().endsWith(".class"))
       classyClass(jarFile.getName() + "#" + entry.getName(), jarStream, false);
     entry = jarStream.getNextJarEntry();
   }
   jarStream.close();
 }

 static void classyClass(String id, InputStream in, boolean close) throws Exception {
   in.skip(7);
   int majorClassVersion = in.read();
   if (close) in.close();
   System.out.println(id + " " + majorClassVersion);
 }

 public static void main(String[] args) throws Exception {
   classyFile(new File(args[0]));
 }
}

dev8D Tweet Cloud

2009-02-16T10:46:00.004-05:00

Here's my abbreviated trip report: dev8D was a big success -- any conference that gets developers together and avoids long monologues is a winner in my book. As a side note, I think twitter works pretty well as a backchannel. I noticed at least one person created a separate account to avoid spamming their regular followers. Not a bad idea.

Multi-project Subversion Commit Notification

2008-10-04T12:30:00.004-04:00

I recently had to set up commit notification for a repository hosting multiple projects and thought I'd write up my experience here.

There are several ways to set up commit notification in subversion. Each involves the use of the post-commit hook. Here's how it works: After the subversion repository successfully commits a change, if it finds an executable file, /path/to/svn/hooks/post-commit, it will be invoked with two arguments. The first is the path to the repository, and the second is the revision number of the commit.

The content of post-commit can be whatever you want. In practice, most people make it a shell script that just invokes a utility like svnnotify to get things done.

Since the repository I was working on is hosting multiple projects (ala apache), each top-level project has it's own codewatch mailing list. I don't want to spam each project with every change to unrelated projects in the repository. So, based on arguments passed to post-commit, I had to start by determining which project the change was relevant to. I used the svnlook utility for this, like so:

# Get the first top-level directory changed by the commit
# Note: svnlook's dirs-changed output is multi-line, and
#       each line looks like "projname/trunk/etc"
PROJ=`/usr/bin/svnlook dirs-changed -r $2 $1|head -1|sed -e 's/\/.*//g'`

Once I had that information, the rest was straightforward. Here's the whole script.

Fedora Commons Repository - Lines of Code

2008-07-12T08:59:00.005-04:00

We're wrapping up our last branches before the 3.0 final code freeze. I got curious last night about how the maintenance branch (2.2.x line) and the trunk (3.0 line) compared in terms of lines of code.

So I decided to pull up the archives of past releases and do a per-release comparison of everything under src/java/fedora. Here's what LocMetrics and Gnuplot told me:

It's hard to draw any definitive conclusions about the SLOC metric, but it's safe to say it's directly related to maintenance cost. And it's interesting to see how certain features / architectural changes affect it.

Installing Fedora in Two Minutes

2008-06-07T13:57:00.004-04:00

Want to get a Fedora repository up and running as quickly as possible?

This screencast uses the installer's "quick" option to skip all the hard questions.

The "quick" option is useful if you've never installed Fedora before and just want to get acquainted. For more serious use, you'll want the "custom" option. And the installation guide :)

Current CMA Documentation Available

2007-12-21T07:46:00.000-05:00

Coinciding with the availability of Fedora 3.0 Beta 1 this week, the first round of semi-official CMA (formerly called CMDA) documentation is now available: The Fedora Content Model Architecture. As Dan points out, we'll be doing some name changes before it's all said and done, but so far this is the most up-to-date diagram of the supporting object-object relationships:

As implemented, the BDef and BMech objects are basically unchanged. Here's what the new CModel control object looks like:

The DS-COMPOSITE-MODEL datastream specifies the structural requirements of member objects. The dsCompositeModel.xsd schema describes the expected format. For example, here's the DS-COMPOSITE-MODEL of info:fedora/fedora-system:ContentModel:


<dsCompositeModel xmlns="...">
  <dsTypeModel ID="RELS-EXT">
    <form MIME="text/xml">
  <dsTypeModel>
  <dsTypeModel ID="DC">
    <form MIME="text/xml">
  <dsTypeModel>
  <dsTypeModel ID="DS-COMPOSITE-MODEL">
    <form MIME="text/xml">
  <dsTypeModel>
</dsCompositeModel>

Pretty simple. It says, member objects must have at least these datastreams, and each be in the form specified. If multiple forms are listed in a single dsTypeModel, the datastream may be in any of those forms.

Fedora 3.0 - Where's the Binding Map?

2007-12-16T16:04:00.000-05:00

Okay, I'm excited.

After several months of effort, Fedora Commons 3.0 Beta 1 should go live sometime this week. For most Fedora users, this Beta will be their first real exposure to the Content Model Dissemination Architecture, or CMDA. (this name is subject to change before 3.0-final)

Among other things, the CMDA allows people to attach runtime behaviors to digital objects at a class level. This architectural change has been a long time coming for Fedora, and we've worked hard to get the design right. Dan is working on the official design doc for publication with the software, but here's a simple overview of how it works:

The Fedora-defined CMDA relationships are expressed in RDF in the RELS-EXT datastream of each referring object. As long as all the necessary relationships exist, Fedora will use them to provide the desired behaviors for each data object. By design, the Resource Index does not need be enabled for this to work.

One question that will inevitably arise for those familiar with Fedora's traditional disseminators is, "Where's the Binding Map?". The short answer is, they no longer exist. For the long answer, continue reading.

Background
To support extensible "views" or "behaviors" on digital objects, prior versions of Fedora required each object to include a special piece of metadata called a disseminator. The disseminator included a reference to a "Behavior Definition" (an object that defines the behaviors), a "Behavior Mechanism" (an object that grounds the behaviors to a specific implementation), and lastly, a "Datastream Binding Map". The binding map's purpose was to map the datastream IDs in the object to specific input requirements of the BMech.

CMDA Implementation of Behaviors
With the CMDA, behavior subscription is now done at the content model level. Among other useful properties, this design allows people to significantly change behaviors for whole classes of objects without making changes to (or visiting) every single one.

Since the content model object would now appear to occupy the role of the old per-object disseminator, if a datastream-to-BMech-input mapping existed, it would go in the content model, right?

Actually, I don't think so. In general, a content model is intended to be a sharable object that survives through time. It a) describes a class of objects by their structure, and b) indicates which operations/behaviors they should have within a repository. In order for it to be as sharable and survivable as possible, the content model must not dictate *how* the operations are to be executed. That's the job of the BMech.

Part of the "how" is deciding which (if any) of the datastreams defined by the content model actually need to be given as input to the code that executes the behavior. At a high level, BMechs are bound to content models, and not vice-versa. The direction of the relation is important. It's the BMech's job to pick apart the content model it works with and decide how it's going to fulfill the contract with the given pattern of data.

Therefore the mapping, if necessary, is really a BMech implementation detail. But if a BMech only isContractor for one content model, then there's really no point to having the extra indirection...just make the part names in the BMech match the datastream IDs and be done with it. That's the simplest approach, and the one that I think will get people "up and running" with the CMDA the quickest.

But, you ask, what if you want to use the same BMech for content models that differ only in their datastream IDs? First, if possible, consider merging those content models. It'll make life easier for you in the long run. If that's impractical or doesn't make sense for your use case, then just create a BMech for each -- one that only differs in the part names used.

For Fedora 3.0b1, what this means in a practical sense is that people who have lots of variance in their datastream IDs will either need to "bring them in line" (which is a very practical thing to do in its own right, for ease of management), or will need to define different content models for them, which use different BMechs, even if they formerly used the same BMechs.

The migration tools (which I'm writing the docs for now) will do the latter automatically, creating Content Models and BMech copies with appropriate IDs automatically. If people want a "cleaner" upgrade, they need to invest some sweat in getting their datastream IDs consistent prior to running the analysis (the first of three phases of migration) so they don't end up with too-unmanageable a set of BMech copies.

3.0-final and Beyond
Two things absent from the Beta 1 release, which should be present 3.0-final are 1) the ability to assert object-object relationship constraints as part of the formal definition of a content model, and 2) a basic validator that can take a content model and an object that claims to adhere to it, and tell whether it actually complies or not.

For 3.0b1, we've kept the "Fedora Object Type" idea around. Viewed through this old lens, there are only four basic kinds of Fedora digital objects. We know that there is some overlap with the "typing" introduced by the CMDA. As the CMDA takes hold, I think the idea of "Fedora Object Type" can be gracefully subsumed by content model.

In future releases, the BMech will also evolve to something more flexible. We know people have got a lot of mileage out simple web service HTTP GET bindings, but other methods, protocols, and even in-VM code bindings are definitely called for. With the CMDA, we are now in a much better position to do these things.

Another idea that keeps popping up in CMDA discussions is, can an object be it's own content model? Or from a slightly different angle: Can a content model play the role of a Data Object, and thus act as a template? Also, what about multiple content models per object? Inheritance?

These questions hit on design, implementation, and best practices issues, all of which we are now in a much better position to discuss with the release of 3.0b1. I'm looking forward to it.

Fedora Commons Launched

2007-08-28T09:54:00.000-04:00

For those who haven't heard yet.... this is great news.

Carol also has some pictures from the launch celebration over at NSDL Road Reports.

Here's the text of the official announcement:

FEDORA COMMONS AWARDED $4.9M GRANT TO DEVELOP OPEN-SOURCE SOFTWARE FOR BUILDING COLLABORATIVE INFORMATION COMMUNITIES

(Ithaca, New York, August, 2007) - Fedora Commons announced the award of a four year, $4.9M grant from the Gordon and Betty Moore Foundation to develop the organizational and technical frameworks necessary to effect revolutionary change in how scientists, scholars, museums, libraries, and educators collaborate to produce, share, and preserve their digital intellectual creations. Fedora Commons is a new non-profit organization that will continue the mission of the Fedora Project, the successful open-source software collaboration between Cornell University and the University of Virginia. The Fedora Project evolved from the Flexible Extensible Digital Object Repository Architecture (Fedora) developed by researchers at Cornell Computing and Information Science.

With this funding, Fedora Commons will foster an open community to support the development and deployment of open source software, which facilitates open collaboration and open access to scholarly, scientific, cultural, and educational materials in digital form. The software platform developed by Fedora Commons with Gordon and Betty Moore Foundation funding will support a networked model of intellectual activity, whereby scientists, scholars, teachers, and students will use the Internet to collaboratively create new ideas, and build on, annotate, and refine the ideas of their colleagues worldwide.

With its roots in the Fedora open-source repository system, developed since 2001 with support from the Andrew W. Mellon Foundation, the new software will continue to focus on the integrity and longevity of the intellectual products that underlie this new form of knowledge work. The result will be an open source software platform that both enables collaborative models of information creation and sharing, and provides sustainable repositories to secure the digital materials that constitute our intellectual, scientific, and cultural history.

Recognizing the importance of multiple participants in the development of new technologies to support this vision, the Moore Foundation funding will also support the growth and diversification of the Fedora Community, a global set of partners who will cooperate in software development, application deployment, and community outreach for Fedora Commons. This network of partners will be instrumental for making Fedora Commons a self-sustainable non-profit organization that will support and incubate open-source software projects that focus on new mechanisms for information formation, access, collaboration, and preservation.

According to Sandy Payette, Executive Director of Fedora Commons, "the new Fedora Commons can foster technologies and partnerships that make it possible for academic and scientific communities to publish, share, and archive the results of their own work in a free, open fashion, and make it possible to analyze and use content in novel ways."

"Establishing a sustainable open-source software system that provides the basic infrastructure for on-line communities of scholars will have enduring impact. The unanticipated cross-disciplinary uses of this open platform are the hallmark of this revolutionary infrastructure," said Jim Omura, technology strategist with the Gordon and Betty Moore Foundation.

Payette also noted, "The open-source software that is developed and distributed by Fedora Commons can impact the entire lifecycle of what is often referred to as "e-Research" and "e-Science," including storage of experimental data, analysis of experimental results, peer review, publication of findings, and the reuse of published material for the next generation of scholarly works. We will also continue our work with libraries and museums to facilitate the sharing of digitized collections, making previously locked away material available to wide audiences. Also, building on our attention to digital preservation in the Fedora open-source repository system, Fedora Commons will continue to stress the importance of the sustainability of digital information in applications of our work."

About Fedora Commons
Fedora Commons is a non-profit organization whose purpose is to provide sustainable open-source technologies to help individuals and organizations create, manage, publish, share, and preserve digital content upon which we form our intellectual, scientific, and cultural heritage. Since 2001, with support from the Andrew W. Mellon Foundation, Cornell University and the University of Virginia have collaborated on the Fedora Project which has developed, distributed, and supported innovative open-source repository software that combines content management, web services, and semantic technologies. The Fedora software has been adopted worldwide to support an array of applications including open-access publishing, scholarly communication, digital libraries, e-science, archives, and education.

Fedora Commons will initially be located in the Information Science Building at Cornell University, Ithaca, New York. The Executive Director of Fedora Commons is Sandy Payette, who co-invented the Fedora architecture and led the Cornell arm of the open-source Fedora Project. The Board of Directors of Fedora Commons provides leadership from multiple communities, including open-access publishing, digital libraries, sciences, and humanities. For more information, visit http://www.fedora-commons.org.

About the Gordon and Betty Moore Foundation
The Gordon and Betty Moore Foundation, established in 2000, seeks to advance environmental conservation and cutting-edge scientific research around the world and improve the quality of life in the San Francisco Bay Area. The Foundation's Science Program seeks to make a significant impact on the development of provocative, transformative scientific research, and increase knowledge in emerging fields. For more information, visit http://www.moore.org.

CONTACT:
Fedora Commons: Sandy Payette
(607) 255-9222, payette@cs.cornell.edu
http://www.fedora-commons.org
Gordon and Betty Moore Foundation: Greg Nelson
(415) 561-7427, greg.nelson@moore.org

FTP ASCII unmangler

2007-03-29T12:46:00.000-04:00

FTP text mode is evil.

I made the mistake of transferring several important binary files from OS/X to windows last night, using FTP. Actually, a few mistakes were made along the way. 1) I didn't check that I was in BIN mode first, 2) I didn't verify the integrity of the files after the transfer, and 3) I deleted the sources.

Luckily, it was Unix-to-Windows, which means all #10 octets were replaced with #13#10. First I tried dos2unix with no luck. Then I wrote a program to replace all #13#10 sequences with #10 and crossed my fingers.

It worked. Here it is in all it's inefficient glory. Maybe this will help someone else someday. No guarantees, but it's worth a shot if you're desperate.


import java.io.*;

public class Unmangle
{
    public static void main(String[] args) throws Exception
    {
        InputStream in = new FileInputStream(args[0]);
        OutputStream out = new FileOutputStream(args[1]);
        int prev = 0;
        int b = in.read();
        while (b != -1)
        {
            if (prev == 13 && b != 10)
                out.write(13);
            if (b != 13)
                out.write(b);
            prev = b;
            b = in.read();
        }
        if (prev == 13)
            out.write(13);
        out.close();
        in.close();
    }
}

Social Bookmarking == Free Metadata

2007-02-01T06:48:00.000-05:00

Metadata is expensive. Librarians aren't the only ones privy to this fact.

I remember the pain we went through in bringing HP's FTP site to the web. The first step was converting the old README files to HTML. Perl made this a snap, but somehow it didn't address the now-more-apparent quality problem. So we slurped it all into a Paradox database had a big metadata entry party.

Ok, the word "party" might be a stretch. There was technically pizza involved, but it was more of a bribe. It lasted days, and nobody really celebrated 'till it was over.

I distinctly remember the phrase "metadata monkey" entering my vernacular at that point.

I don't have anything against monkeys. Monkeys are cute. Monkeys at keyboards are even cuter. But metadata entry has long been viewed as a thankless job.

Now, sites like del.icio.us have figured out a way to get metadata monkeys to work for free. The incentive? Not pizza, not even bananas: just the ability to store our own descriptions, share those descriptions with others, and access it all from anywhere.

It makes me wonder about the role of the library in creating authoritative, versus curating social, metadata.

Resources, Representations, Repositories, and RDF

2007-01-28T05:19:00.000-05:00

Last week, Carl Lagoze gave an update on the OAI-ORE work at Open Repositories '07. ORE is a new project that intends to specify how heterogeneous repositories can exchange information about the digital objects they hold. Although they're not necessarily going after a new protocol, I still think of it as taking OAI-PMH to the next level. It's not just about metadata anymore.

For me, the most interesting parts of the talk were webarch-related. It all started with the statement (to paraphrase) "we must build on the web architecture". Carl then pointed out how representations are essentially second-class citizens on the web.

That got me thinking. At the most basic level, repositories are all about managing bitstreams (whether they're considered data or metadata). In webarch, bitstreams seem to equate to what they call "representation data". And a representation is defined by how it relates to a resource:

"A representation is data that encodes information about resource state."

So, in w3c-speak, a repository manages representation data. Okay, that's just a terminology change. But what about this statement:

"For robustness, Web architecture promotes independence between an identifier and the state of the identified resource."

That makes a whole lot of sense for the web when you consider how often web pages change. But what does it mean for repositories? How do we manage bitstreams if we can't identify them? The answer must be one of the following:

Indirect identification. Identify the associated "resource" in order the work with the bitstream(s).
Reification. Elevate the bitstream to a "resource" so we can talk about it.

How about if we want to model the repository as an RDF graph? Well, we know that representations can have metadata in addition to the payload. So in order to do this modeling, we need to reify. Internal to a repository, representation triples might look something like:

representationA represents urn:example:someTextFile
representationA contentType "text/plain"
representationA payloadLocation "/path/to/someTextFile.txt"

I think the OAI-ORE work is going to attempt something like the above: a model (and maybe a format?) for expressing resource-representation information in a repository-neutral way. It will be interesting to see what pops out.

A good source tree

2006-10-07T12:57:00.000-04:00

A good source tree:

Has a root-level README file.
Contains all compile-time dependencies.

With versions noted.
With originations and licenses noted.
With transitive dependencies noted.

Can be compiled and linked easily.

With one command.
Without a network connection.

Can be unit + integration tested easily.

With one command.
Without a network connection.

Hardly coding

2006-10-03T01:57:00.000-04:00

URGENT ASSISTANCE!!!

HELP I CAN"T COMPILE>>> ERROR!!

[checkstyle] C:\work\mptstore\trunk\src\java\org\nsdl\mptstore\query\
provider\TriplePatternSQLProvider.java:35:45: '3' is a magic number.

I EVEN TRIED INSTRUCTIONS ON HOW TO PROGRAMMER??

http://secretgeek.net/howtobeaprogrammer.asp

SO I SERCH GOOGLE LIKE A GOOD PROGRAMMER AND FOUND THE ANSER:

http://www.youtube.com/watch?v=yPzAjiLr5Zw

BUT HOW TO I PASTE VIDEO I"M USING ECLIPSE??

THANK YOU!!!

Reducing Firefox Memory Consumption

2006-08-02T07:07:00.000-04:00

Firefox uses incredible amounts of memory. Some of this is caused by various undiagnosed memory leaks. Looking around on the net, I found a few things that can help, short of restarting all the time:

Make sure you have the latest version (currently 1.5.0.5)

Enter about:config in the address bar.

Right click, add Boolean: config.trim_on_minimize (true)

Right click, add Integer: browser.cache.memory.capacity (32768)

Restart Firefox.

If things get sluggish, minimize the window and restore it.

Using a tool like CachemanXP can quickly show how well the "trim_on_minimize" setting is working.

DTP Model

2006-06-27T02:29:00.000-04:00

The X/Open DTP Model is a widely-supported standard that defines components and interfaces useful for dealing with distributed transactions.

In this model, the application uses a transaction manager to help carry out a global transaction. The app communicates with the transaction manager via the TX interface, which provides the transaction semantics (begin/commit/rollback/etc).

Each data store involved in a global transaction is exposed to the transaction manager as a resource manager. The communication takes place over the XA interface, which defines functions for lower-level transaction semantics. The transaction manager uses the XA interface to carry out two-phase commits, and the resource manager uses it to dynamically enlist specific resources in the transaction.

References:
Nuts and Bolts of Transaction Processing, by Subbu Allamaraju, subbu.org
Global Transactions - X/Open XA - Resource Managers, by Donald A. Marsh, Jr., Aurora Information Systems, Inc.

Java coding style

2006-05-06T01:26:00.000-04:00

Here's the style I prefer these days.

package org.example.myproject;

import java.io.IOException;
import java.util.Map;

import org.apache.log4j.Logger;

import org.example.myproject.util.MyUtil;

/**
* This class does several things.
*
* @author name@domain.com
*/
public class MyClass {

   public static final String PUB = "a";

   private static final String PRIV = "b";

   private String _name;

   private String _description;

   /**
    * Construct a new MyClass.
    */
   public MyClass(String name, String description) {
       _name = name;
       _description = description;
   }

}

Writing a search engine

2006-05-04T03:04:00.000-04:00

I just remembered this story...

Back in '96, I was part of a small team working on the HP support website. We had just started to look at incorporating a search engine and had hired Verity to integrate their search software with our site.

One afternoon, Jim, the project manager for search, invited a few of us to attend a Verity training session. I was really curious about how these search engine things worked, so I was happy to attend.

Derrick, the consultant, had given us a nice introduction to inverted indexes and stopwords, and I was intrigued. Ten minutes in, I stopped listening and started typing. Shortly afterward, Jim noticed I wasn't being as attentive as the others. He came over and asked me what I was up to.

"Oh, I'm writing a search engine in Perl"

Poor Jim.