Tincat Mewsings

New Year, New Blog

2009-01-09T09:34:00.005-06:00

As you might have noticed, I did not continue my 2006 blog much into 2007. I have a draft of a new blog entry from back then, entitled "Property Values Flourish" about modeling data with multi-valued attributes. Alas, I did not complete it, attended to other work and started organizing a new company later that year. I have started to blog about my current project as it unfolds this year. The first entry of the Persistence Matters blog is here. Some of you might find it worthwhile and fun to follow this project. Thank You for your readership of this Tincat Mewsings blog. I was amazed how many people from how many countries visited. I just checked and it says there have been 17,504 "absolutely unique visitors" according to google anayltics. My goal original goal was that 100 people read my Is Codd Dead? posting. Of course, my other goal was that I help to change the industry. I doubt whether my one small voice made any difference at all, but I do feel confident that enough of the industry is invested in working with persisting logical lists, for example, that my voice is not needed. I don't think the momentum has carried into college database classrooms extensively to date, however. If you are a college professor, teaching database courses, it might be fun ("fun" being a relative term) to add entries from the Is Codd Dead? blog series to the reading that students will do. Even if you disagree with any of my points, it will provide balance to the typical diet of only relations, no lists, for breakfast, lunch, and dinner. Have a fantastic 2009!

OTLT: Metadata Piece Not Apartheid

2007-01-02T12:05:00.000-06:00

Yes, I am going to reopen a can of worms. I first saw the acronym OTLT, for One True Lookup Table, when I read this Celko article. The article pretty much made sense to me then, as it does now. I would not recommend the use of OTLT to anyone wishing to use an SQL-DBMS properly as a relational database. As a few of you might guess, I will suggest that some might be well-served using something other than a proper SQL-DBMS or using an SQL-DBMS other than properly.

Celko writes "I am going to venture a guess that this idea came from OO programmers" in the above mentioned article. However, OTLT (or OTLF) is a design pattern I encountered before I ever heard of OOP, or design patterns for that matter. It even predates our use of the term table in database design, and might be called a code file or validation file, for example. It is at the point that this design pattern was brought into the world of relational tables where it switched from being a pattern to an anti-pattern.

Many moons ago, this code file pattern was employed with implementations using indexed sequential files, e.g. MIDAS files on Primos or VSAM files on MVS on an IBM 3081, and I'm pretty sure it was around well before my encounters with it.

In case you are not familiar with this pattern, you can see typical values in the picture below. It represents a simple function--put a multipart key of code_type and code_value into the vending machine to retrieve the code_description. These attribute names, as provided by Celko in this article, are pretty much what I recall, with the exception of all caps and dashes in place of underscores.

Once upon a time, data were not democratized, and we could treat some differently than others within the data model. Now we talk about attributes having types, such as varchar or int, while tables are all seen as somewhat of the same type, i.e. Relation. [Yes, these Relations are of type Customer, Order, Person, or whatever, but I only state that so you do not wander off-course on that yourself.] Our profession once spoke of file types of a different ilk. Back when the database was perceived as a set of files, rather than tables, these files were identified with various type designations. There might be files of type:

master (named with nouns corresponding to strong entities)
transaction (events)
log
history
parm
control
code (we will focus on this one)
enzovoort.

We will zero in on the code files. Historically, some systems used separateness, or apartheid in Dutch, for each type of code. Other systems split out the really large or more complex code files as separate files but kept the rest of the codes in a single piece, pouring the data from all other potential code files into one big validation file. One rule of thumb that I recall was that if there were more than a thousand entries, we were more likely to put such codes in their own file (so as not to adversely affect the performance of all applications doing lookups).

Let's look at some of the objections to using this pattern with an SQL-DBMS to see how these play out in the case where we are using another DBMS solution that is not strictly a SQL-DBMS, such as Pick (see Is Codd Dead? for a list of Pick databases). Near the end of Celko's article, he writes that it is a "data modeling principle that a well-designed table is a set of things of the same kind instead of a pile of unrelated items." When pouring all of our code types, values, and descriptions together into one component, are we really talking about unrelated items? They certainly exhibit a similar pattern.

Relational databases, or approximations thereof, split data into two distinct groupings often referred to as data and metadata. So what are code files--data or metadata? These code files are not quite the same as typical user data. A table such as MaritalCodes doesn't relate to a traditional business proposition or entity quite like the Customer table does. With modeled propositions for a business, it is common for purposes of data quality, ease of data entry, compact representation for reports, and storage efficiency to encode values with abbreviations, aka codes, that stand for the original value within a proposition. With the value encoded in a code file, we can then look up the longer description when needed.

Code files host data that is both data and metadata, or tween data.

I'm guessing you understand why and how a code file in the old model translates to a lookup table in an SQL-DBMS. Such lookup data, equating an encoded term with a description, is also validation data. So, we might also call this a validation table. It is clearer now with GUIs that lookups for valid codes and data validation are two sides of the same coin. The list of valid codes is often used to populate a GUI widget so there is no chance of any other values being entered by the user. The values in these old code files might also compose a list for a check constraint. Code files host data that is both data and metadata, or what we might term tween data.

Unlike hard-coded check constraint values, our code files often have data entry forms that permit some of the code types to have the valid list of entries maintained by the end user. They are part of the data integrity function of constraint handling. They are metadata that users see, some of which serves as data too. They are used for validation as well as representation, turning codes into descriptions when needed. Such codes serve a purpose of standardizing attribute values, and indicating valid entries throughout a software system.

Systems that store metadata as data, which includes most software written to be implemented using multiple DBMS solutions, might indicate that the marital_status attribute of the Person relation is validated using the Marital_Status code type in the Validation_Codes or One_True_Lookup table.

Other metadata that might be required to be housed as data includes attribute names and file names, for example, for building business rules. tween data is both data and metadata and it is everywhere. So, how should we model such tween data?

We might need to model enough tween data to specify rules such as

if (order_amount > 1500) requires_authorization = "Yes"

These files are part of the application system architecture. Just as we do when designing user database entities, when looking at the entities involved in developing and maintaining the application, some standard entities such as Attributes, Files (or maybe Tables or Relations), Rules, and Validations might make their way into every application. These entities are not specific to the domain for this application and are required because this is a software application, not because this business is in the xyz vertical market.

Tween data is both data and metadata and it is everywhere

How might one make a design decision about whether to pour all such codes together in OTLT, rather than separating them out? Let's see, we want to have tight cohesion and loose coupling, right? We have the option of modeling a single proposition with a type of code, a value for that code and a description. The implementation of this proposition can then be used for dropdowns and other GUI features as well as validation and represenation. We can otherwise model a set of propositions that all sound remarkably similar with codes and their descriptions, elevating the type of code to the name of a Table rather than having it as an attribute in our Validation Table. But after designing the n-th such table, my patience starts to DRY up (see the DRY wikipedia entry).

Admittedly, while we then have our codes together, these codes relate to different things. That will be a problem if we want to add an attribute for one of these types and not for another, while splitting them out is more problematic if we opt to do something consistently across the board with all of our codes and must repeat the procedure for each such table. The former is typically mitigated by a) using tools where refactoring the database when requirements change is standard fare (that is, using an agile database, such as IBM U2, Revelation, Intersystems Cache', or OpenQM) or b) adding an attribute for further classification of a code, such as code_classifier. Different values for the code_classifier can be used to control the logic performed.

Let's look at more of the objections to the use of OTLT with an assumption that we are using a MultiValue, aka Pick, DBMS. Not coming from the world of 80-column card input (see picture in To Whom Should Size Matter?), Pick handles variable length data really well. In fact, neither types nor lengths are enforced by the DBMS, prompting me to initially conclude that it could not be considered a DBMS, an opinion I have reconsidered. In any case, this objection, while relevant to SQL-DBMS's is irrelevant in other DBMS's, including Pick.

Another objection is that software is capable of putting the wrong data in for the code_type, so that a state abbreviation might become a marital status, for example. Well, as it turns out software can put a multitude of incorrect values into a database. It has a job not to do that, and while DBMS-specified constraints might contribute to one kind of data quality, they certainly do not ensure data quality. As an aside, I would argue that they have even contributed to poor quality software, forcing developers and end-users alike to play games to try to match values to type, even where ill-advised. The bottom line on this one is that if your DBMS does not enforce types, there is no related problem with this pattern. One of the code_type values could be Valid_Code_Types where the application that permits the addition of code_values for all code_types also permits new values here, with appropriate security, of course.

I don't like discussing performance, but I definitely recognize that many-a software project has failed due to inadequate performance. Celko brings up the fact that having to search through a larger table is less efficient than through a smaller one when finding a particular value or a set thereof. As long as we do not make the file or table (whichever is being used) way too large, this file of validation codes is an obvious choice to have in cache throughout the run-time of a software application.

There is no doubt that there are tasks one can perform that are slower and some that are faster using one pattern or another, so obviously there is room for taking into consideration the type of activity that will be performed with codes in any given system. Using Pick, there are even some who design the big code validation file to have one item (aka record) per code type, with associated multivalues for the valid values and descriptions. One can then have one disk read to suck in the entire validation list for a particular attribute, such as the entire set of state codes and their descriptions.

One of the features I like best about using one big code validation file is that you can write one maintenance form for this file and it handily covers all of these, even future validation code types. Sure, you will also need a code file for what code files are permitted and you might want to apply some different validations for different types of codes (more common in theory than in practice, I suspect).

Companies...with a need for a lot of tween data might be well-served using SQL-DBMS tools as if they were file systems

Some might be convinced that when using something other than a traditional SQL-DBMS, it might sometimes make sense to model the tween data in a way that acknowledges that it also serves as metadata. But this could also be relevant to SQL-DBMS users. I will end with a heresy (as if I haven't been spouting such all along, eh?). Companies and developers with a need for a lot of tween data might be well-served using SQL-DBMS tools as if they were file systems, ignoring some of the relational dogma. If you are minimizing the amount of metadata that you duplicate by putting it both in the SQL-DBMS metadata catalog and in the database itself by minimizing your use of SQL constraints and the like, you really can use the OTLT pattern. Of course, you should be aware of the tradeoffs in doing so.

While I recognize this is not the weighty decision that Carter lays out in his controversial book Palestine: Peace Not Apartheid, for any given software metadata design for code files, you simply need to decide whether to go with a single cohesive piece of tween data or whether you favor relational apartheid.

See comments.

Cowboys with Promiscuous Databases

2006-11-06T17:26:00.000-06:00

In Northwest Iowa we have lots of cows but no cowboys. We have cattle farms that, as best I can tell, are much like hog farms.

Somewhere between Iowa and Wyoming the backdrop changes from cattle farmers to cattle ranchers. We move from confinement lots to more open ranges. Where the land is fertile, it is farmed. The cows remain rounded up and typically crowded together. Farm land is planned out, designed, and structured. Farmers designate areas for corn, soy, or cows. They work the land, toiling over its makeup throughout the seasons.

I'm no expert, but it seems that ranches are left with a more natural order. As you head towards the mountain regions, the rockier unfarmed land is available for cattle. With less control on their movements, these cattle spread out a bit more. Cowboys round them up and prompt them to move as a group to another location when needed.

Iowa cattle farm above, Wyoming cattle range below

Promiscuous means consisting of elements brought together without order

We might say that the confinement lot is organized, constrained, and controlled in accordance with economic and other mathematical principles, while the ranch provides a more promiscuous landscape. The ranch is also organized, mind you, but a natural order arises from the land. Cowboys interact with the landscape to herd the cattle, performing tasks that might have been designed into a confinement lot.

Using the second definition from dictionary.com, promiscuous means consisting of parts, elements, or individuals of different kinds brought together without order.

Lest there be confusion, let me state that as we turn our heads from cows to data, I will interpret the word order in this definition to refer broadly to the organization and structuredness of a database, not to the ordering of attributes or rows.

Real cowboys, Wyoming Nov 2006

Chose from among the most common dictionary definitions of the term database, such as a collection of related facts or perhaps one that requires the use of a computer, if you prefer. Most readers will likely be familiar with the process of designing databases by employing relational modeling, given that this is taught in college courses as well as on the job. The design is organized, constrained, and controlled according to mathematical principles from set theory and first order predicate logic. Like the cattle farmer's land, a model could be drafted showing how the database designer will structure the database. Structure of this sort attracts certain personalities (farmers?) and not others (ranchers?). You might guess that I, in particular, feel more at home on the range.

Any model other than a relational data model might seem to the software development profession as promiscuous. I chose this derogatory, yet enticing, term in part because of the seeming unorderedness, comparatively, of legacy data models. This is not unlike the seeming unorderedness of a cattle ranch compared to the more obvious structure of the cattle farm. I also like using the term promiscuous here because our profession currently sees these alternative database tools as improper, even if increasingly enticing. I predict that our industry will be seduced by something resembling these legacy databases enough to switch to considering not-really-relational databases as mainstream again in the future, especially as SQL becomes less attractive as our interface language to data.

Note that although I do need a term for the databases about which I have been writing, often referred to as embedded in marketing literature, I will likely not latch onto this term as that would put me in the uncomfortable position of endorsing promiscuity. I can live with that discomfort for this one blog entry.

Ask a rancher or cowboy how they divide up the land, and they might suggest that the land divides itself. The water is here, the grass there, and the rocks up this way. They might draw or paint you a picture. Ask a database cowboy, one working with a more promiscuous database than those based on the relational model, how they model the data and you might hear an analogous response. The data orders itself.

Ask what steps a database cowboy takes to design a database, and you will likely hear that the first step is to have a good understanding of the landscape, the business. Then you define the scope of your project, putting a fence around it, and then you record what you see inside that fence. By looking at the landscape, you can make a computerized model of this reality for your database implementation. The implementation is a model of the business, not unlike a painting of the range.

I am well aware this scenario generates laughter from some, ridicule from others, as it sounds so unscientific. But as my colleague Anthony Youngman (aka Wol) would suggest, relational modeling is mathematical but not very scientific. The RM imposes an order using mathematical terms such as predicate and relation, typically avoiding terms matching the problem domain such as thing, entity, property, empty, and list, terms used by cowboys working with promiscuous databases. Relational modeling includes putting data into Nth normal form, while the database cowboy knows the land and paints what he sees.

For anyone confused by the imprecision of this description, perhaps the Jayne VanDoe example in the Is Codd Dead? mewsing provides more hints. By the way, regarding science and databases, have the terms relational model and experiment ever made it into the same sentence? We need to return to the science and art, the craft, of databases, modeling by painting what we see and testing our models over time. I'll grant that there is a need for more emperical data related to the effectiveness and resources required over time for all varieties of databases.

At the risk of repeating what I have said in earlier blog entries, but for the sake of any new readers, I will briefly suggest three features that might distinguish a seemingly promiscuous database from one that more closely implements the relational model.

2VL
Most, if not all, languages that work with the data employ two-valued logic.
NF2
The data need not be in what has traditionally been called first normal form. Attribute values may be arrays or multivalues.
Contraints as data

This one needs a sweet acronym and a better description, but the idea is that constraints related to attribute types are typically specified with data, rather than with metadata, and are enforced outside of a DBMS, rather than by one. Rather promiscuous, wouldn't you say?

Let's take a look at legacy databases. As it turns out, the data handled with/in/using databases termed legacy is current, not primarily legacy, data. While it has been the conventional wisdom, that some would say has been proven, to migrate from legacy databases to SQL-DBMS tools, signs point to a return of such proven approaches as the use of two-valued, rather than three-valued logic. Additionally, more and more work with databases is done by developers without the tired, old 1NF requirement, often by way of object-relational or XML-RM mappings. There is reason to suggest that the future of data modeling resembles the past, the data modeling done by our current database cowboys.

In case you are asking Where's the beef? (perhaps you cannot see the pictures herein), the next blog entry will start looking at specific design patterns used when designing for one model of promiscuous databases, the Pick/MultiValue databases. While I am not as familiar with other not-really-relational models, it is very likely that these best practices will translate to best practices in many other environments as well. And, as always, I fear they are apt to irritate or even infuriate some RM enthusiasts. Heigh ho.

Although cowboys typically prefer using an apprentice approach with new recruits, a cowboy handbook might be in order so that the next generation of cowboys can learn from the best practices of those who have gone before. While it once looked like these cowboys were a dying breed, with the new wild, wild west of the internet, database cowboys look like they will be around for as long as the farmers. In the next blog entry, I will have something for you to sink your teeth into. The least we can do is pass along some tips from seasoned cowboys on how they have been saddling promiscuous databases for the past half-century.

Continue to next blog →

To Whom Should Size Matter?

2006-07-27T11:52:00.000-05:00

The Grand Tetons are incredibly impressive. They are big, really big. Combining business and pleasure, with one trip this year I went from Iowa to Washington D.C. and with another from Iowa to Washington State, both by car. I was able to see people and places, taking pictures from coast to coast. I love mountains and water, the bigger the better. In my home state of Michigan, the Great Lakes are great because of their size. Of course there are very impressive small points of interest too. When it comes to physical things, size matters.

Physical objects differ from words in that respect. It makes sense to care about the height of a mountain, the size of a portion of food, the length of a skirt, or the size of a vehicle. It rarely makes sense to care about the size of a word, with word games being an exception.

You might recall that feeling when you want to play all seven letters in Scrabble, maybe even on a triple word score, but it doesn't fit. The board is not long enough, or there are not enough open squares to play the word. Rats!

A similar game is often played by users of database management systems and related applications. The word you want stored in the database is just a little too long for the data entry screen to handle or the DBMS to accept. Some DBMS tools require or at least strongly encourage size constraints on data attribute values. Others do not. To an SQL-DBMS, size matters.

Focusing on this one specific type of constraint, let's take a look at reasons a software development team might design using a maximum length constraint (maxlen) rather than permitting variable lengths for values of a specific attribute.

Input

Perhaps the input device and related software provides for limited space for input data. The punch card is a good example. While variable length attribute values could be encoded on punch cards using delimiters between values such as deBoer,Reta,piano, to ensure that a fixed list of attribute values for each person fits on a single card, we might design the musical instrument value to be in card columns 33 through 46. The maxlen for all values of that attribute would then be 14.

Developers once had to design for input from 23 x 80 character screens too—remember those? Even with a fixed screen size, however, users could be given scrolling entry fields so entered data could be larger than what is shown on the screen at any given time.

We should consider whether we would want the restrictions of an input device to dictate database-level constraints. We might also consider that today's typical screen entry or data exchange has no such issues. While screen size is still a factor for designers, this technology requirement need not whittle down our original business requirements. I feel comfortable relegating this reason for a maxlen to the past.
Output

I still have my nifty 132-column ruler for designing greenbar reports. When we could fit only 132 characters in a fixed-length font on a single line of a report printed on greenbar, we naturally cared about attribute sizes. When designing paper forms such as payroll advices, it makes sense to care about the maxlen for names compared to the number of available characters on the form. The representation of the value of any particular attribute might be constrained by the design of a report. Should we also constrain the actual database values and not just the representation of those values? Perhaps there is a reason to do so in some cases, such as the codes discussed next, but, in general, no.
Lookup Table
When modeling data, we often swap out some terms within a proposition for succinct, consistent codes. For example, start with a proposition such as:

Jan (first) VanDoe (last) lives in Illinois (state).

It is typical to model the predicate for this proposition with two base relations, Person(first,last,stateCode) and State(stateCode,stateName). The original predicate is then modeled with a view that joins these two giving PersonView(first,last,stateName). In this example, a code of IL would be in the Person relation, mapping to the name Illinois in the State relation.

This is a common pattern in data modeling. In their Refactoring Databases book, Ambler and Sadalage label a change of this nature to an existing schema as Add Lookup Table. So instead of calling this the code file pattern, a term used back when we also talked about master files and transaction files, I'm OK with calling this the Lookup Table pattern, while still referring to the attribute (such as stateCode above) as a code.

Is there a good reason to constrain the length of codes? Yes. Lookup tables are all about constraining propositions to aid in conformity. This helps with accuracy of captured data and ease in performing analysis, for example. Placing a maxlen on the code does not further constrain nor alter the original proposition derived from business requirements. There are advantages to developers in fixing a maxlen, making such a design attractive in cases where it cannot adversely affect data integrity, such as truncated values, or unnecessarily restrict flexibility.

Now comes the question of who should care about the maxlen on codes for lookup tables. Have you ever seen a case where there is a requirement to change the length of a code? If not, I've seen enough such cases for both of us. In every case I can recall, the change was to a larger maxlen, although I can imagine the other scenario. In the rare case of changing to a smaller maxlen on a code, there must be a mapping from the longer attribute values to the new shorter maxlen, whether truncation or a fancier algorithm, and the existing database would likely need to be modified so it fits within the new constraints.

There are DBMS products that tightly couple the logical constraints with the physical implementation

In the much more common case of needing to upsize the maxlen, there is no similar requirement to change the existing database. The new requirement is less restrictive than the former requirement, so the database is logically already in compliance with the new maxlen. It might even be feasible to roll such a change in constraints out to users along with other maintainable business rules. Unfortunately, there are DBMS products that tightly couple the logical constraints with the physical implementation of the database. Such products actually make computer resource allocations based on the logical maxlen constraints!

In fact, every SQL-DBMS product I have seen requires some maintenance activity on the physical database simply because of an increase to a maxlen, a relaxing of the constraint. While there are products that have no such issue, I'm guessing that most of those pre-date the introduction of the relational model. Coincidence? [I have been corrected on this. Apparently this is not the case for Oracle, so I would guess others as well. I'm curious now what an average cost estimate might be for a requirement to increase a maxlen in a schema since I'm guessing estimates might have influenced my assumption. Regression testing might be the biggest part of the estimate.]

Back to our codes where we want to design in a maxlen, where should this maxlen be encoded? While some might be inclined to specify this constraint to the DBMS, you can see the danger in cases where such a maxlen specification is used for the dual purpose of physical implementation design. I recently saw a SQL-DBMS schema example where every attribute was designated with a maxlen of 50, with 50 being larger than the max logical maxlen for all attributes. Such an approach mitigates the issues with the tight coupling of logical and physical.

Where should the logical maxlen be placed in a logical data model? Given that all UI and web service validation routines need access to this constraint information, it can be made available by being captured as other business rules might be, in the database itself. I suggest that the database proper (rather than the schema) is a good place for specifying all business constraints. There must then be enforcement of the use of standard CRUD services that employ such constraint logic for all software maintaining the database. This provides more flexibility for the business and I suggest the cost of ownership is lower than when any business rules are specified to the DBMS.
Processing
As with punch cards, size matters to COBOL

I'll toss in a short note that if software is being developed with one of the older languages for processing data that require maxlen on variables (e.g. COBOL), then you are stuck specifying the maxlen to your software components for the purpose of reading or writing data. As with punch cards, size matters to COBOL. As sweet as I suspect it can still be to write business applications in COBOL, I'm comfortable making this an historical footnote.
Performance
Some might say that decoupling logical and physical maxlens could adversely affect performance. If that is the case, I would suggest that the industry move to a data model that does not have this defect. Yes, of course I know that the computer is a finite resource and there is a need for some information for performance tuning and optimization. But(3) it(2) is(2) not(3) essential(9) that(4) each(4) individual(10) attribute(9)...you get the point, right?
Space
Similarly, if the amount of disk or other resources is adversely affected by decoupling logical and physical constraints, I would suggest the industry move to a data model that does not have this defect.

While the models I know that fit the bill do have other defects, including logical/physical coupling in other respects, they provide greater flexibility for those changes that companies are likely to make to their data over time. That is because they take physical implementation information that is irrelevant to the user and the business requirements and require it of the logical model rather than the other way around. For example, because such things as the order of stored attributes is irrelevant to the business user, there need not ever be a requirement to change that order in either the logical data model or the physical implementation.
Real Input or Output Business Requirement
Changing the name, my cousin told me how she has let the relevant parties know that she absolutely does not want her hyphenated last name of Oostendorp-Holland to be truncated on documents, credit cards, etc as Oostendorp-Ho again, pa-lease. Given that she is an influential person, I'm guessing these parties are trying to comply, but it might be too costly for them to do so.

What should a company do when there is a business case for having a maximum length on a data value? They should typically not ditch the unrestricted real value for an attribute, but add in another attribute for the representation of the actual value at a shorter length. In this case, they could then keep the actual name value and also have a constrained value. They could put Oo-Holland or Oostendorp or any number of other values into the constrained version. This would only be used in an override situation (ignoring SQL issues with NULL handling here) where the original value as written or as truncated would not suffice. Additionally, if the constraint is on a derived value, such as the first name plus the last name (which happens to be the case in the example cited), the revised full name, with associated maxlen, would be a separate attribute.

Relational theory does not speak to maxlen, but every implementation of the relational model leads developers to make extensive use of such.

We have seen some of the reasons why a designer might design in a maxlen, with related commentary on these. Why focus so keenly on this one type of constraint? I would venture that the application of the maxlen in places where the proposition should have no such constraint, in addition to the tight coupling between the logical and physical maxlen, has cost the industry and individual organizations a bundle. It is but one area related to constraint handling where we have been bleeding, but one about which I hear little mention—and no noise, no fix, I'm guessing.

Relational theory is predicate logic and set theory. Relational theory does not speak to maxlen, but every implementation of the relational model leads developers to make extensive use of such. It has been suggested that if maxlen were removed from RM implementations, the solution would be less optimal in some way. While this has nothing to do with relational theory, it does seem to have something to do with existing relational models and implementations of such, particularly the SQL-DBMS. It might be the case that in order to resolve this, DBMS implementations would be well-served to permit attribute order to be specified, or reject some other tenet of the relational model. To whom should size matter? Not the DBMS.

Continue to next blog →

Constraints Factored In and Out

2006-05-26T21:58:00.000-05:00

It's time to factor constraints into the equation. A constraint is any component of software, including code, metadata, or data, designed to limit possibilities. Constraints implement "business" rules such as only men may be elders in churches in Sioux Center, IA.

Every aspect of a software product is constraining. Remove all constraints from software and you no longer have software. While I will use the term constraint with a broad definition, some use it only related to constraint specifications formed as propositions (logical predicates, to be precise). Some prefer or even insist upon using a declarative, rather than OO or procedural, language for constraints. While I see the charm in that approach, I will be almost programming-language-agnostic in this mewsing.

From a theory perspective, we can model both data and constraints as propositions, then use predicate logic to and these propositions. For example, if we are defining a person P to the computer, and we are collecting a birthdate for that person, we might refine our definition of a person using a constraint Q that specifies a living person's age to be <= 140. To validate our data, we can then ask the question P ^ Q?

An aside: Predicate logic is relevant to the choice of data models. Simple propositions (e.g. those without lists) require only first order predicate logic, which makes for a simpler theory than if using higher order logic. However, it is simpler for me to work with property lists, for example, than to normalize as described in Is Codd Dead?. We will have to bring Occam in for that debate at some point so he can tell us whether he is relevant to this discussion or not.

A sculptor wields the chisel, and the stricken marble grows to beauty. WILLIAM CULLEN BRYANT

No matter how a software product implements one constraint or another, there are two parts to any software constraint: a specification and a service. A constraint specification is developed using a computer language, whether a general purpose language such as Java, a database sublanguage such as SQL, a declarative rules language such as OCL, or a homegrown or vendor-supplied proprietary language, perhaps specified using XML documents.

A full range of acronyms could also be relevant to the development of a constraint service that applies a constraint and returns a result. The constraint service reads in the constraint as part of the input (Q above) along with whatever needs to be verified against the constraint (P) and performs the test (P ^ Q?), taking appropriate action based on the test result.

Note re terminology, feel free to skip: Constraint services are often referred to as validation services, which is somewhat narrower. To clarify, I'm trying to use the term rules for the conceptual realm, the analysis aspects of the project, with constraints as the implementation of those rules in the software. Some rules' implementations, aka constraints by this terminology, do not constrain but assist, perhaps providing a suggestion, for example. The rest of the software to implement a constraint, other than the specification is what I am referring to as a constraint service, even in these cases when it does not constrain the user or the data. So, it might be better to skip the term constraint altogether and speak only of rules. Clear as mud? For the examples here, consider all mention of constraint services to be validation services even though I am using a broader term.

There are two parts to any constraint: a specification and a service.

Sometimes the specification and service are interwoven, tightly coupled. If a user enters a birthdate, there might be code similar to age = (today - birthdate)/365; if age > 140 show errorMessage. Alternatively, there could be a constraint similar to (today - birthdate)/365 <= 140 specified as input to a constraint service, or rules engine. The latter is typically the case when contraints are applied by a DBMS while the former is often the case when a constraint is applied to user input. Implementation of the same constraint is often partitioned differently in different components of a single software product.

End-users, rather than developers, might maintain a constraint. We could give our users a means of changing the male-only elders constraint, for example. This could be implemented by something as simple as a system-wide WOMEN_ELDERS Yes/No flag or as complex as a general engine that interprets specifications such as if Person's gender = 'F' then mayBeElder = false. When an end-user might need to change a constraint, the specification is data, even if it is also code, often stored in a database.

Moving along, have you seen the following? Web-based software that includes constraints 1) coded in JavaScript to get that quick response time when verifying entered data in the browser environment; these constraints are also 2) specified and applied in a language associated with the web or app server; additionally, such constraints are 3) specified to and applied by the DBMS on a database server. There are three separate specifications and three separate services for the same constraint. For example, our age <= 140 constraint might be specified using the JavaScript language, using a OCL in XML, and using SQL. The associated constraint services might be coded in JavaScript and Java with the third being a component of a proprietary DBMS. Wow! All that and we are able to ensure that our software only models people who are not older than 140 years old.

I hesitate to mention that the birthdate might have been entered by someone working for another company whose software already validated this data up and down before passing it to your web service so you can do likewise, perhaps before passing it on. We verify, verify, then verify again to ensure quality data, of course, provided the date was entered correctly and no one (in Hollywood) lies about their age. Perhaps this is another example of measuring with a micrometer and cutting with an axe.

This approach to constraints results in bloated software with high maintenance costs. It is a sorry state, indeed, but I'll admit that I take a small amount of delight in the irony that the database world claims concern about redundant data while being a primary player in the repeated specification of the same constraint.

Why don't we validate our data once, so we avoid the triple-specification and triple-code for services situation? The simple answer is trust. Given JavaScript and a web browser UI as implemented today (where a user could use Greasemonkey to change data after it has been validated, for example), a middle tier cannot trust that data verified in the browser is the same data it sees. While there could be more trust between application code in a middle tier and the DBMS, there is no automated mechanism for certifying an application so that a DBMS could have a security feature that is able to accept data from a trusted source.

Without thinking about the possibility of certificates and signatures for each piece of data that has been verified, we are not going to stop the redundancy of repeatedly performing the verification. There is potential to eliminate the DBMS verification if the application software is owned by the same organization and adequate quality assurance is done for the DBMS to accept the validations of the middle tier (and why, pray tell, wouldn't that be the case?) But in order to keep some folks happy, let's just say that all of these validations must transpire. It surely is not also required that each constraint specification and the code for each service be written in different languages, right?

In designing software, we partition our solution, modeling using metaphors and related implementations for such things as objects, services, functions, or sets. We factor or refactor our software solutions so that we pull out frequently-used components for reuse. If we have encoded a constraint such as age <= 140, we could, in theory, reuse both the specification and the validation service wherever it is needed. Deciding what to factor out in the overall scope of a software project, how to partition our software, is part of the software design process.

What keeps us from factoring out constraint specifications and constraint services? If a SQL-DBMS tool is part of the software solution, constraints might be encoded in the database schema. With most DBMS tools it is not feasible, whether for performance or other reasons, to reuse the specification and constraint services of the DBMS throughout related applications. Even if it were, it is unlikely that all relevant constraints would or should be implemented there.

Some organizations choose to put a minimum number of constraints in the DBMS. I hate to say it outloud as I know it is not a popular opinion, but I favor restricting the DBMS schema to a bare minimum of constraints any time the same organization that owns the database also controls the applications that update it. I have yet to see a case where one organization permits another to write directly to its databases (although there might be such), so this is pretty much always my recommendation.

Back-end constraint services can be packaged with the organization's CRUD services used by all applications. The database validations and related manipulations can then use the same constraints as the applications. While working with database management systems lacking even foreign key constraints, it was once troubling how much cost savings there seemed to be using that development environment. I would never have guessed then that I would end up recommending such a strategy. Now I see that these constraints were still in the overall solution, even if lacking in the DBMS schema.

On the UI front, if Javascript is required, it could potentially be generated from app server languages as is done with the new Google Web Toolkit for AJAX development and Ruby on Rails AJAX efforts, although debugging such generated code could be very unpleasant. I would prefer that validations for the UI be performed in the middle tier, with JavaScript using AJAX for asynchronous validations. When the constraint directly affects the UI widget, such as permitting a selection from a drop-down list, JavaScript will still need to get such constraint data to populate the UI, but it need not duplicate it.

We could certainly get closer to having a single specification for a constraint and a single service that validates based on the constraint than what is often the case in large software development efforts. Is it time to refactor our solutions so as not to lock constraint specifications and related services within the DBMS schema? Factoring constraints into our software design might just mean factoring them out.

Continue to next blog →

Surf and Turf Reporting

2006-04-26T17:54:00.000-05:00

This is a continuation of the previous mewsing, Consuming Frozen Data. There is a new wave of surf and turf reporting in the business intelligence community.

3. Surf and Turf Reporting

Our final option is a blend of live, fluid data (surf) and extracted, fixed data (turf). Some of the advantages of reporting in the same environment as your production OLTP system and some of the advantages of reporting against data marts or extracted data are present with this option. Additionally, any new data (or data that were not originally extracted) is automatically accessible for reporting in combination with data that has been extracted.

The surf and turf strategy lays an agile foundation for analytical reporting. I am a fan of this approach; however, the pros will also be able to determine the cons of various options once we dive into the descriptions.

It should come as no shock to those of you who have been reading these mewsings that SQL and MultiValue database environments have different approaches to surf and turf. Fortunately, each permits the user to perform steps that somewhat simulate the approach of the other. Users often stick to the first approach they find that works, however. So the first of the below descriptions is what SQL developers are more inclined to do and the second is a technique Pickies often employ.

Materialized Views
An SQL View can be defined as a stored SQL query. A materialized View is the persisted result set of the query. For those cases where an SQL View can be used to specify the data to be extracted from a live system, running the appropriate commands to create a table based on this view results in a materialized view. Most Pick environments, such as U2, provide a means to accomplish a similar feat without use of SQL. The materialized view may then be used as any other base table in a schema.

Using this approach, a college or university can extract registered student data without including any accounts receivable information. Then if someone decides to identify clusters of students who failed to pay in full at the time of registration over the past ten years, the extracted data could be joined to the live accounts receivable data, including historical transactions, to perform this analysis.

We don't need to measure with a micrometer before we cut with an axe.

Reporting is done in support of decision-making. If a decision requires that only data frozen at a particular point in time be used as input, then that decision and related reporting must happen only after there has been a project to identify and extract the required data. There is a cost to this which is sometimes justified. We, as a profession, want to provide the very best data possible for decision-making, but we must also be business-savvy in our cost-benefit analyses.

Very often a small subset of the data required for a particular decision must come from data frozen at a specific point in time, while additional live data contributes a sufficient approximation to make an informed decision. There is a lot of data that rarely or never changes. If it has changed since the time of the extract, it might even be a correction. There is a time to cringe at the words good enough and a time to recognize that quality includes fiscal stewardship. We don't need to measure with a micrometer before we cut with an axe.
Savedlists
MV developers and end-users alike often decouple the restriction process, narrowing down the input for a report to a fixed set of primary keys, for example, from the projection process which yields a set of attributes related to the identified keys. In Pick, the restriction process is called the selection. While these two processes can be handled together in the query language, Pick doesn't work with result sets, in general, but with select lists a.k.a. savedlists, the result of restricting the data to a set of records. These savedlists typically store the keys to the entity (file) of interest. For example, a school might keep a savedlist with the student IDs for all students registered for the fall of 2005 and another for the spring of 2006.

These savedlists are, effectively, materialized views of primary key data. It is similar to a snapshot in time of an index on a subset of rows. After executing a SELECT statement, a SAVE-LIST or SAVE.LIST command will save the resulting keys under the name provided. Then after retrieving a savedlist (GET.LIST or GET-LIST), the next command or query is executed only against that subset of keys. While selection criteria is not always saved as it could otherwise be acted upon immediately, when it is stored in a savedlist we have a unique minimalist Pick twist on surf and turf reporting.

While materialized views are not always heavily used in SQL shops, SAVE-LIST and GET-LIST are bread and butter for MV shops. It greatly improves performance if you already have a savedlist with which to start, much as an index does when reporting on live data. Unlike an index, the savedlists are not updated once they have been built. They are minimalist data marts used to run a set of reports against a consistant set of IDs.

I have seen colleges use only savedlists for their census data extracts, saving these while running their census reports using option 1 from the last mewsing. If any other questions come up related to this census, they consider the live data, including historical records, adequate for the related decision-making. I'll admit that I would advise against this approach for census data, as I think the benefits outweigh the costs for saving some of the important attribute values at the time of the census. In this case a materialized view would likely be a better strategy.

Aside from census data, a registrar's office could create a fresh savedlist of registered students on a nightly basis, and other departments on campus could use that savedlist for their work too. No one other than the registrar's office would then need to understand the criteria for determining who, precisely, is registered for this term. One advantage to using savedlists rather than full data marts is that these savedlists can be fed to update processes as well as reporting processes. Depending on the application software, a savedlist could be entered into a process that would set the graduation date for everyone stored in a savedlist, for example.

Another advantage of savedlists is that often end-users can decide what they want to put in a savedlist. Of course there are different approaches to security in different shops, but in places where end-users have such access, these power users are empowered with respect to their data. Without actually creating new tables or files (savedlists are handled differently) and without writing data extraction procedures, MV end-users are often using a surf and turf strategy with their data from day to day.
Materialized Attributes
With a current industry focus on operational data stores which permit queries against current or near current data values, there is a return to thinking about how to do reporting against your production data banks without extracting data. Although such redundancy was once considered poor form, storing data derived from other data could be just the ticket to improved reporting performance. Aggregates and other derivations are costly when they must be performed on a large data set all for one report. If they can be computed in the background whenever there is a change that would prompt such, the added cost of integrity for the redundant data might be more than balanced by the savings in reporting. Performance against the live data can then be as good as from a data mart.

For example, if we materialize the attribute CURRENT_GPA for each student, we can run reports that require this data without having to read through all courses for all terms for each student in order to compute the GPA on the fly. A policy of no redundant data, held tightly in many shops, is one factor prompting huge redundancies in the form of data marts and warehouses. Yes, redundancy does have a cost, and a few well-chosen materialized attributes could save you from the more excessive redundancies found with extracted data.

Surf and turf is likely easier to pull off successfully if not constrained by SQL

It surely isn't a new technique, but there is a revived interest in strategies used to materialize (compute and store) derived attributes. A colleague pointed me to Sybase as an example of a product that has tools for materializing derived data. I browsed the information on it without trying it out (lazy or busy, you can choose), and I'm impressed! With other products developers can get the value of a user-defined function, virtual field, or stored procedure and store it using application code. They might do this with triggers or CRUD services that adjust the materialized values when the underlying values change.

In case you are wondering where this is headed in relation to the overall topic of these mewsings, I'll give a hint, without yet providing any arguments. In short, a surf and turf strategy can result in significant savings, but is likely easier to pull off successfully, with smiling end-users, if you are not constrained by a SQL-based environment.

Getting back to today's topic, I have often been a bit queasy about the added costs vs. benefits for data marts and warehouses. Each situation is different, but in many cases, this added cost can be reduced significantly by doing surf and turf reporting.

Continue to next blog →

Consuming Frozen Data

2006-04-20T08:49:00.000-05:00

Organizations often freeze data for reporting purposes. By the way, before it was called business intelligence (BI), we called all of it reporting, which I am still inclined to do.

Higher education often freezes data on designated census dates to get an account of the students who are registered for a particular term. Government, registrar, and board reports are run against this data, with longitudinal analysis showing changes from term to term and year to year. Cross-tabs of faculty broken down by sex (a favorite example until gender became the preferred term) and reports of students sliced and diced every which way are run against the census data.

I have observed many approaches to such reporting and will present three distinct methods, two in this entry and one in the next, that are currently employed by organizations. Many colleges and universities employ more than one of these approaches for their census reporting.

1. Keep Out Reporting

This is the basic old fashioned approach of ceasing all maintenance of registration data, rippling through the campus so a considerable amount of data entry is halted for the time it takes to run reports. Reports are often run at night to minimize disruption.

Sites recognize that they sometimes need to re-run the reports, perhaps after repairing some data that has not been properly cleansed for these reports. As a slight digression, I'll mention that the preparation of reports often includes specifying or coding the report in connection with cleansing the data. Derived data prepared specifically for a particular report might be used to virtually cleanse or tailor the data, for example. Data and code dance together, as is the case with software in general.

One advantage to freezing the data in this way for the purpose of running reports is that these exact same reports can be run at any time against the live data. This is helpful when preparing for the census. Prior to freezing the data, the reports can be run to ensure that data values, derived data, and reports are exactly what is needed for this census.

While these reports will end up on web pages, as PDF files, or even greenbar, sometimes output from the reports is also captured as data. This data might be downloaded to Excel, for example, for future use. That brings us to the next topic.

2. Data marts

Before we had data marts we had extracts. I'll make some not-altogether-standard distinctions between these two terms. A data mart might be hosted by a DBMS other than the one used by the OLTP system, where extracts were more often hosted by the same DBMS, by the file system of the host OS, or downloaded to client software, such as Excel. Some systems employing MS SQL Server have data marts hosted by Oracle and vice versa. These would not typically be called extracts.

Another possible distinction is the transformation of the data, sometimes denormalizing or shaping into fact-dimension cubes for the data mart. With an extract, there might be a simple transformation to drop attributes that are unnecessary for reporting purposes. When the data is not reshaped at all, the extract is referred to as a snapshot. Often for data marts, the logical data model is changed considerably.

Additionally, an extract comes from a single data source, where a data mart could be populated from multiple sources. There is sometimes a very complex ETL process for extracting, transforming, and loading data for a data mart, where you are more likely to talk about running the extract.

As a rule of thumb, the closer a data mart resembles an extract, the less expensive it is.

The biggest distinction just might be whether or not an organization sinks non-personnel dollars into data mart products. If you buy tools, you have invested in data marts. If not, you must have extracts.

Now, I could mention data warehouses as a separate bullet point because they are decidedly different. But I have rarely recommended anything I call a warehouse due to the required expansive scope of such a project. I would rather plan for a series of data mart projects that could, over time, be perceived as a data warehouse. But, yes, there are some warehouses out there in higher ed too. In any of these cases, the verb is extract whether the target is an extract, data mart, or data warehouse.

Benefits of extracting data include the ability to run reports against the same set of census data from now on, typically without an adverse effect on the transaction system. Performance is one of the big reasons people extract data. The use of different reporting tools is another. Users of reporting tools such as SAS often like to have the data in a SAS data set, for example.

Data marts are good for longitudinal analysis. If you do not freeze your census data, you are unlikely to report the exact same figures in the future, making it hard to compare this year to last. Additionally, you can model the data to optimize for reporting and aggregating data--cubes, for example.

Data mart projects can range from reasonably priced to grossly expensive. Techniques to help minimize the cost include hosting the data mart using the same DBMS as the transaction system and even retaining the structure of the data for the data mart. This can permit the same reporting tools and even the same reports to be run against either live or frozen data. As a rule of thumb, the closer a data mart resembles an extract, the less expensive it is.

If systems using a non-relational data model were a standard data source, we could have bigger bang for the buck reporting solutions.

Here's the rub. This is problematic with both SQL and not-highly-relational, such as MultiValue, databases. It is rarely advisable for a SQL-DBMS system to retain the data model of the transaction system for the purpose of analytical reporting because the relational model of data is not conducive to high performance, easy-to-specify analytical reporting. With MV databases, the data are typically structured to be able to use the same shape for both OLTP and OLAP, but you cannot specify such databases as a data source from the full range of reporting tools. In this case, data marts are often reshaped and rehosted in order to make the data more easily accessible from tools that work exclusively with SQL data sources.

Can you sense my frustration? As an industry, we have adopted standards that trap us into all of the high-wire acts and costs we are sinking into reporting. Getting reports out of our systems was once the easy part! From a database perspective, it is read-only, for goodness sake. One thing that changed between then and now is the proliferation of SQL and the relational model. If systems using a non-relational data model were a standard data source, we could have bigger bang for the buck reporting solutions. We could run reports against extracts hosted in the same shape and DBMS as the source data. I have seen the benefits for organizations using MV databases combined with one of the handful of MV-specific reporting solutions.

Enter: XML. The MV data model is, effectively, a subset of the XML data model. Any query language that reports against multiple XML documents could, in theory, be pointed at any MV database, provided the database vendor accomodated such. But I gotta say, XQuery is dog-ugly compared to the language formerly known as GIRLS. It does work with multivalued data and employs a two-valued logic, however, so I'm encouraged by that. If an MV database provider with adequate resources were on top of this situation, they could help standardize a simplified, but proven, subset of XML for persistence, with both OLTP and OLAP capabilities.

We have now addressed the approaches of reporting against live data and against data that has been extracted for reporting. I'm hoping that you are thinking that these two approaches cover the mix, because they don't. We will look at a third approach to consuming frozen data in the next mewsing.

Continue to next blog →

Pascal Loses Wager

2006-04-11T21:45:00.000-05:00

I'm on the road right now and stopped in at Sylvia's for the night. Sylvia decided to google me this afternoon instead of cleaning the guest bathroom. She was laughing when I arrived after having read how I was infamous. She wondered how I had offended this Pascal character after taking at look at this and this. In case these links change on the dbdebunk site, I'll just note that they basically claim that I am stupid, ignorant, and vociferous.

The following is a rebus response to Mr. Pascal's ad hominem attacks. Hover over pictures to get the word if you don't want to work the puzzle out for yourself. I hope you enjoy, but if you find it sophomoric, well, you should meet Sylvia.

an .

Note: I wrote this a few weeks ago. I requested input from a few folks regarding this blog. Roughly half thought it was great and the other half thought it was a bad idea. I agree with others who have been targeted by Pascal that he and his approach to others in the industry should not be left unaddressed. I've decided to take the risk and post this because it fits my style to respond, but not respond in kind. After all, I might be ignorant of many things, but I'm not stupid.

See comments.

Better to Have No Values

2006-03-29T09:19:00.000-06:00

The topic at hand is NULLS. Some professionals think the best option for recording missing data is to use a NULL to mean no value, an approach implemented by many database management systems. SQL then recognizes missing data when it reads a marker referred to as a NULL. Working with missing data requires both of these components: how one specifies the missing information to the database and how the database languages work with the missing information.

With the SQL approach where NULL does not refer to a value, but to the lack of a value, one attribute tagged as NULL is not equal to another attribute tagged as NULL. When you compare two values, x and y, if either of these is NULL, then your comparison is neither true nor false, but a third logical value of I dunno or UNKNOWN.

In other words, SQL employs a three-valued logic (3VL). There is some good reasoning behind this. Before we turn again to bowling scores, I'll use a very simple example. If my natural hair color is recorded as missing information and Sylvia's natural hair color is also missing, then surely we do not have enough information to say either that my natural hair color matches Sylvia's, nor that it doesn't. It's a mystery. So, if we evaluate the expression Sylvia has the same natural hair color as Dawn we don't know if it is true or false. We need a third logical value, sometimes represented as unk or UNK for unknown.

A lot of software is built with this strategy for missing values, but there are other options. The Pick, aka MV, query language employs a two-valued logic (2VL), as do most programming languages. Much has been written about NULLs and n-valued logics, particularly within the context of the relational data model. The RM does not require the 3VL of SQL, as author Chris Date has made clear. I will not revisit those discussions here, contributing only this glimpse into Pick's approach.

Step one in understanding this 2VL is to suspend the idea that NULL is the lack of a value or an indicator for no value. In MV, a NULL is a value. There are exceptions as various MV products address their interface to SQL differently, but I will simplify and focus on what is common to all such products. Think of a NULL as indicating that there are no values for a particular variable. One way to provide a mathematical model for this approach is to perceive the NULL as an empty set, aka null set.

This empty set is implemented in MV as an empty list. MultiValue solutions permit the value of a variable or the intersection of a row and column (loosly speaking), to be a list of values as indicated in my last mewsing. The list could have 0, 1, or Many elements. If the cardinality is 0, the value is NULL. If the cardinality is 1, the attribute has a single value. If the cardinality is greater than 1, it is more obviously a list. But no matter what the cardinality, the value of an attribute can be modeled as a list, whether empty, single-valued, or multi-valued.

If there are no values in the list, set, or bag, the value of the MV attribute for a particular record would be NULL. If a NULL is encountered by a user when looking at the data, the meaning of that value is no values. So, if Dani has a record of data and the value of her variable scores is NULL, then the corresponding proposition would be: Dani has no values for her score or Dani has no bowling scores. It would not be: Dani's score has no value. Got it?

Just true or false, that's it, two values and the obvious ones at that.

Because each attribute in Pick does have a value, even if that value is no values, logical expressions are easily evaluated. Just true or false, that's it, two values and the obvious ones at that. If there is ever a test for equality, a list of no values is equal to another list of no values. For less than or greater than comparisons, a NULL is less than everything other than another NULL no matter what type of data is compared.

Some might be resistant to this broad use of a NULL value as it gives us no insight into why there are no values for an attribute. If there is a reason to collect information on why an attribute has no values, additional attributes could be included in a logical data model.

In our scenario, we don't know if Dani bowled the games and the scores have not yet been recorded, if Dani bowled and the score was recorded incorrectly, if Dani has not yet bowled, if Dani was sick with bird flu and missed league bowling this week, or if Dani doesn't ever bowl. In the first two cases, our data as recorded is incorrect. So we could interpret the NULL value for scores as Either Dani has no bowling scores or our data is incorrect but that is similar to the interpretation of any value. We could say that either this bowler's name is Dani or our data is incorrect. So that is not particularly helpful. The interpretation of this NULL value, then, is that Dani has no bowling scores. That's it. If there is a NULL value for an attribute, it means there are no values for that attribute.

Our resulting two-valued logic comparisons are quite simple, useful, and meaningful. If we want to list all people with their series total, ordered lowest to highest, we would get those people who have NULL values listed first.

LIST BOWLERS SCORE BY SCORE

If we want only those with values, we could request only those WITH SCORE:

LIST BOWLERS WITH SCORE SCORE BY SCORE

Instead of having no value, we have a value that is a list with no values.

The first SCORE is testing for any values for the variable SCORE, the second shows the value (the ID for the bowler is listed automatically in the output), and the last SCORE is for the ordering. Using this approach, Dani and Shirl have the same value for their second game score. I'm certain this rubs some SQL users wrong. Instead of Dani and Shirl having no score, which would mean the question is meaningless, Dani and Shirl have no scores or an empty list of scores for the second and third games in the series.

This logic can be applied to attributes that have list values as well as those considered to be single-valued since these can also be modeled as lists, even if short ones. You can likely see how there could still be some misunderstandings in interpreting NULLs of this ilk, but it really is an amazingly simple and useful approach to what is modeled as the lack of values in SQL. Instead of having no value, we have a value that is a list with no values.

Continue to next blog →

With Data Modeling, What's Your Bag?

2006-03-25T20:38:00.000-06:00

I've been avoiding some definitions for a while now, but when writing about NULLs and two-valued logic, I digressed into explanations of a few terms. Since I'm writing on the road for a few weeks, I thought I'd split out some terminology into this blog and follow with the NULLs blog next. I'll toss in an ending on this one to tie in the title, but otherwise this mewsing is a glossary.

What I know about bowling could fit in a blog entry, but I just adore the scoring system, so I'll use that for the examples, both for these descriptions and for the NULL handling to follow.

list

The scores in the above bowling series that correspond to the three games bowled by one person; such as Chris's 120, 183, 144; compose a list. Each such list of game scores could be the value of a variable named scores. This conceptual list could be implemented with a computer language using an array like score[3], for example.

The length of a list is the number of elements in it. Computer languages vary as to what list implementations permit variable lengths. Other variations in the implementation of lists include whether list members can be accessed directly with an index or require a sequential read. Providing different features for working with a list could prompt one to work with it as a queue or a stack, for example. Lists are a wildly popular structure in any general purpose programming language. They are missing from any implementation of the Relational Model (RM), however.

set

If we remove the ordering from our lists, we get a set of bowling scores for Pat and Chris that includes all three scores. However, the set of scores for Beth is {150, 160} because each element of a set must be distinct from the others. The set of Pat's scores could be written as {110, 85, 130} or as {110, 130, 85} because the set has no implied ordering. In the case of a bowling series, it would be a bad idea to treat these scores as a set in this way, given that we would then lose the information about how many games had a particular score, as illustrated in the case of Beth. A list can be transformed into a set, however, by adding an ordinal attribute to identify the first and nth value. We could, therefore, talk about Chris's set of bowling scores in this series as {(1, 120), (2, 183), (3, 144)).

A more obvious set might be our bowlers = {Chris, Pat, Dani, Shirl, Beth}. Of course, when specifying our set, we would include a unique identifier for the elements, often referred to as a key, so that our set might look like bowlers = {(11235, Chris), (628628, Pat), (11111, Dani), (223344, Shirl), (98765, Beth)}. We saw above that we can turn a list into a set. We can turn our set into a list by placing it in some order, such as an alphabethic order of Beth, Chris, Dani, Pat, Shirl.

bag

Also known as a multiset, a bag is like a set in that it has no ordering and like a list in that it includes duplicate values. The bowling bag for Beth is [160, 150, 150] which is equal to [150, 160, 150] and [150, 150, 160]. It is as if you tossed the values into a bag, pulling them out in any order. We could turn this bag into a set by adding a quantity to each value such as Dani having a set {(150, 2), (160, 1)}. We could turn it into a list by adding order to it, such as numerical order with Dani's list being 150, 150, 160.

We talk about SQL result sets. However, because a result set may have duplicate rows, these would more accurately be termed result bags (or multisets, but that term is not as fun so I use it only like I use table tennis instead of ping pong).

relation

Given my approach to the relational theory of data, I prefer to provide a clean, clear, crisp definition of a relation from mathematics, leaving it to others to embellish it as befits their application of this mathematical concept.

A relation is a subset of the set of ordered tuples (A1, A2, ... Am) formed by the Cartesian cross-product of sets S1 x ... x Sm where each An is an element of Sn.

Note that a relation is a set. Bags are not sets. Therefore, SQL result sets are not relations.

The prior example of a set with couples consisting of bowler id and bowler first name is a relation.

domain

Given a relation R, a domain is a set Sn such that for each tuple (A1, A2, ...An, ...Am) in R, An is an element of Sn.

The example relation has a domain of bowler ids and another domain of bowler first names.

function

A binary mathematical relation with at most one b for each a in (a,b). Note that either or both a or b could be a relation, for example.

type

A type is a domain, so it is a set, plus related functions. Some folks define domain and type to be identical, typically tossing operations into the definition of each. You might note that this is either similar to or the same as the term class, depending on your definition of that term.

scalar

As with many of the terms above, this term may be applied to either a variable or a value. A scalar variable can hold only one value at a time. A scalar value is a single value. It is the opposite of composite. List, set, bag, relation, and function are examples of composite types. Common scalar types defined in computer languages are char, int, float, double, and boolean.

Take a look at this scorecard for a single game of bowling. There are many ways to pour the scalar values into composite types when modeling these data for use in a software application. If you were not tied to everything being a relation, what would be your first take on how to model frames in a game of bowling? That strikes me as a list. Like one of those how many triangles do you see puzzles, how many bags can you find? Sets? Lists?

To set us up for the NULLs discussion, I'll provide a little information on the MultiValue (MV) model. Conceptual sets are implemented in MV as either sets or lists. If modeling a strong entity, they are typically modeled as sets. When modeling a property of a strong entity, they are often modeled as lists. Bags and lists are always modeled as lists unless they are first changed into sets by adding attributes. Sets are top level data structures, while lists are always child structures.

The translation from the conceptual data model to the MV logical data model (see this mewsing if confused by these terms) includes modeling a conceptual set as a list. If an implemented list is a set conceptually, the developer must provide the logic to test if a value is already present before adding it. If it is a semantic bag or set, the developer needs to ignore the ordering. This is a bit more difficult because the query language only checks equality of lists, not bags or sets. A test for equality of [150, 160, 150] to [150, 150, 160] would evaluate to false. A developer might choose some order for this bag, turning it into a list, to avoid this complexity.

While the full collection of information about whether a value is a conceptual list, set, or bag is not available for implemented lists in an MV system, you can use the MV array (list) data type to implement all three concepts. If you try your hand at modeling data this way, you might find that it is often conceptually simpler than modeling exclusively with sets. I, for one, like modeling with data sets that have nested lists. At the risk that you might be too young to be hip to this slang, I'll ask: when it comes to modeling data, what's your bag?

Continue to next blog →

In Every Job That Must be Done There is a Data Element of Fun

2006-03-12T22:10:00.000-06:00

In every job that must be done there is an element of fun. You find the fun and snap, the job's a game. -Mary Poppins

While SQL has many advantages over the languages stemming from GIRLS, the reverse is also true as the latter is quite charming. I will continue my exploration of some of the advantages of the GIRLS family of query languages. I am not proposing that the entire industry adopt GIRLS, but there are things we can learn from this language and related data model.

There are fewer files in an MV-based application than tables in a similar SQL-DBMS solution. This makes sense given that MV solutions are not in 1NF, so properties with multiple values need not be split out into separate tables. With more nouns modeled as properties of entities, more files than tables are related to primary entities of people, places, things, or events. There are fewer nouns modeled as weak entities in an MV solution.

Users of the GIRLS-related query language ask questions about the data by way of exactly one file at a time, somewhat analogous to using a single SQL view in a query. That doesn't mean they can only report against base (stored) data identified in that file dictionary, but that they are viewing the data in the context of a single entity at a time. Just as a secretary in the 1950's picked a file cabinet in which to find information, the user of the MV query language chooses a single file for each question they ask.

Each MV file, or entity, provides one lens through which a user looks at the world of data. This might seem very restrictive, but just like Mary Poppins hauling a large coat rack out of her small carpet bag, data of all sorts can come from each such file view of the data. Through the eyes of a file dictionary, a user can query data stored under that name, or data derived from any of the data in the system, including aggregated data.

Each dictionary is like a Land's End catalog section related to one entity. The user can shop from any one of the available catalog sections in a system. Did you want to be in the STUDENTS section, FACULTY, CLASS_ROSTERS, or ACCOUNTS_RECEIVABLE section...? Once you are there, you can pick the data you want from the list of data elements on-hand. Some data are cross-listed in multiple catalog sections.

A file dictionary is a logical view of data. We could see not only a student ID, name, and class level, but a computed GPA, all majors, and the name of each of the students' advisors through a single dictionary. Anything you can think to ask about a student, you could ask the STUDENTS entity.

LIST STUDENTS NAME CLASS GPA MAJORS ADVISOR_NAME

As mentioned in the previous mewsing, this listing will show one result (not necessarily one line) per student. The majors and advisor names will show on multiple lines when there are multiple entries for these.

Pickies have been doing OLAP directly against their OLTP data since the 70's.

It is likely worth a full exploration in a future mewsing, but I'll just make a quick note that because of this approach to the data, metadata, and the related query language, Pickies have been doing OLAP directly against their OLTP data since the '70's, without a need to reshape it first. This is very cool. Of course they also create data marts for some analysis when appropriate, such as when data need to be frozen. A file with a compound key (called a multi-part key in MV) is automatically a virtual fact table from which you can retrieve or derive any data you would want from the dimensions, forming a virtual fact-dimension or cube perspective of the data.

One downside for this approach is that the carpet bags are not pre-packed with everything. You must add to the vocabularly for the queries (the schema) by defining virtual fields for anything that is not defined as a stored data element under this file name. So the file dictionaries become very long catalogs of base or stored data and definitions for derived data. To help develop complex virtual elements, which I'll refer to as taking your medicine, the spoonful of sugar is that you may refer to subroutines written in Pick/BASIC, a very data-savvy general purpose programming language. So, the sky's the limit, but typically each virtual element dictionary item must be constructed either in the dictionary or inline in a query. Unlike SQL, GIRLS is not powerful enough to write every query strictly from the base schema for stored data.

Adding derived data elements is quite gratifying in a big-bang- for-the-buck way

Developing virtual data elements in Pick has the feeling of enlarging the catalog of offerings in the logical view for the user querying the data. Developing views in a SQL-DBMS has the feeling of shrinking the view of the data for users. With the former, the dictionaries, or logical views, can get junked up with lots of unnecessary vocabulary. Developers talk about the need to clean the dictionaries as one might suggest cleaning an office. In spite of that, there is a sense among both end-users and developers that this is our junk and, therefore, within our control. The environment is very empowering for users. With SQL, your views can get brittle with difficulty adding attributes and retaining performance and existing reports on the view (for example, when you add a property with multiple values resulting in more rows in the changed view). There are variations on virtual field definition in several other DBMS's, with Microsoft SQL Server likely having the best such implementation among the SQL-DBMS's.

I can't quite put my finger on it, but compared to coding SQL views, adding derived data elements is quite gratifying in a big-bang-for-the-buck way. It is often a big point of collaboration between developers and end-users. It's very plain to see how simple it is to think in terms of querying your enitities individually. No matter what new requirements end-users have for querying the data, you can add virtual data elements into the carpet bag for an entity without performance penalties for adding elements nor the same performance penalities often incurred with an SQL JOIN. This makes the possibilities for queries about that entity even more magic. "In every job that must be done there is an element of fun."

Continue to next blog →

Data for Every 1

2006-03-08T09:01:00.000-06:00

Queries and reports. Let's talk about 'em. My plan was to do an overview comparing the query language formerly known as GIRLS with SQL in this one blog entry and then move on to features of reporting tools in general. But I'm scratching that and blending the two. Instead of listing a whole bunch of differences, I'm going to take it nice, and easy, 1 thought at a time. Please recognize that this is not about trying to convince anyone to use my favorite query language; it is much more general than that, with concepts that might also apply to XQuery or any other query language.

Terms that crop up with SQL include the RM, 1NF, and 3VL. Terms that relate to GIRLS are MV, Pick, NF2, and 2VL. Through the eyes of these two query languages, I hope to illustrate some significant differences in data models.

Let's dive into a contrived example of a school system booster club sale of frozen pizzas: multiple pizzas sold through multiple school booster clubs in a single school system. If we had an information system, we might want to query it asking about the people who bought the pizzas, perhaps for a mailing list. We might also want information about the pizzas, maybe for forecasting of supplies for future events. We might ask which booster clubs sold which pizzas so we deliver the pizzas to the right places. Those are all 1 entity instance questions. We are asking a question about each 1 instance of people, pizzas, or clubs.

It is typical to want 1 chunk of information for every 1 instance of an entity.

It is typical to want information about some entity or other, whether people, places, things, or events. Additionally, it is typical to want 1 chunk of information for every 1 instance of such an entity. Optionally, you might want to aggregate information into groupings bigger than 1 asking, for example, how many pizzas each booster club ordered. We might otherwise split out the instances by requesting information about a multivalued property. For example, we might ask how many pizzas have pepperoni as a topping.

In any case, starting with 1 and then listing, grouping, or splitting from there is a very common way to think. 1 might even suggest it is natural. If you walked into a business in the 1970's, or even today, you would find filing cabinets dedicated to 1 single entity, such as Customers or Orders, often with a folder for each instance of that 1 entity. Why? It has to do with how we think, I would think. 1 might be the loneliest number, but it is conceptually very powerful.

A pizza chef team has been organized to assemble all of the uncooked pizzas and put them in boxes, with cooking instructions attached. They would like each box to have a Pizza Description Label, including a unique Pizza ID, the type of crust, the list of cheeses used, and the list of toppings. Sure, you might have mixed cheeses and toppings in a single attribute or split out meats and veggies into separate ones, but just stick with me and don't let your mind wander in that way. (I'll also hold off any discussion of list compared to set or bag for a rainy day. Let's just say that this list of toppings is rather like a shopping list.)

Using GIRLS (please excuse all upper case in these examples, it's my age showing):

LIST PIZZAS CRUST CHEESES TOPPINGS WITH PIZZA_ID = "12345"

Pizza Description

Seemingly comparable SQL:

SELECT
  PIZZA_ID, CRUST, CHEESE, TOPPING
FROM
  PIZZA_VIEW
WHERE
  PIZZA_ID = "12345"
ORDER BY
  CHEESE

But this isn't going to be quite the same, is it? There are insignificant details such as whether column headings and unique id's come along for the ride, but then there is the matter of how to create the PIZZA_VIEW and what the output from the above SELECT would be like. The view might be produced with something like this:

CREATE VIEW PIZZA_VIEW AS (SELECT P1.PIZZA_ID, CRUST, CHEESE, TOPPING FROM (PIZZA AS P1 JOIN PIZZA_CHEESE AS P2 ON P1.PIZZA_ID = P2.PIZZA_ID) JOIN PIZZA_TOPPING AS P3 ON P2.PIZZA_ID = P3.PIZZA_ID)

If we take the above view and the previous SELECT on it, we might get a label something like the one below.

Pizza Description

I'm a bit rusty on SQL and I'm not doing anything clever here to handle the multiple 1-M (one-to-many) relationships, but feel free to add comments on how the view or query on this data could be set up better. I think the party line is that the data are all there and how it is displayed is a task for a reporting tool. But look at how the query language shows up this key difference in the data models. With GIRLS we ask questions while thinking about 1 thing, listing, aggregating, or splitting them all the while with a sense of 1-ness.

A significant issue for professionals who are learning SQL after only knowing languages that derived from GIRLS is the change to thinking about going from the many to the 1 in their thinking instead of from the 1 to the many. It didn't seem like they ever had to learn to start with 1, but they definitely do have to learn to start with many instead.

Moving the story forward, we have just found out that the green peppers are delayed and we have decided not to wait any longer. We are going to have the delivery team get all the pizzas without peppers to the right booster clubs now. The delivery team has requested a list of the Pizza IDs that need to be loaded in the delivery van. So, we need a listing of all Pizza IDs for pizzas without peppers as a topping.

GIRLS:

LIST PIZZAS WITH EVERY TOPPING <> "PEPPERS"

SQL:

SELECT P1.PIZZA_ID FROM PIZZA AS P1 WHERE P1.PIZZA_ID NOT IN (SELECT P2.PIZZA_ID FROM PIZZA_TOPPING AS P2 WHERE TOPPING = "PEPPERS")

It just doesn't matter to GIRLS if there is an attribute, such as topping, that has multiple values for a single pizza. We can still look at data for every 1.

Continue to next blog →

The LIST of GIRLS

2006-02-28T12:48:00.000-06:00

The Relational Model (RM) is neither necessary (see The Naked Model) nor sufficient (see Don't Suffer Impedance). That said, it is useful. There are pros and cons to employing it. In order to be able to compare its usefulness to that of tools based on approaches other than the RM, we need to know what else is out there. This mewsing is not an opinion piece, but includes some things old, some things new, some things borrowed, and some things blue (Big Blue, that is).

GIRLS stands for Generalized Information Retrieval Language & System

While I plan to talk about a variety of possibilities in the future and am particularly curious about the future of XML-DBMS tools, the RM alternative with which I have the most direct experience is the MultiValue or PICK data model. I'll start there with a couple of blog entries.

Codd's papers, including the pdf version of his 1970 paper, are readily available on the web. That is not the case with early papers related to the Nelson-Pick data model. Although my source materials are not-always-easy-to-read copies and my scan, resize, and Adobe skills also leave room for improvement, I spent my time allocated for this mewsing to turn two historical papers into pdfs. I think these papers are available only from this site, but if anyone knows of other sources, please inform me as I would be happy to point to better versions of them. I also provide a link below to Don Nelson's resume, which indicates that he worked under F. George Steele, who invented and developed the Digital Differential Analyzer. After studying under Steele, Nelson developed the GIRLS and GIM-1 specifications at TRW.

What is your preference, GIRLS or SQL?

You might have noticed that GIRLS stands for Generalized Information Retrieval Language & System. Many flavors of Pick have been developed over the years, as indicated in the MultiValue Family Tree poster. Unlike SQL, which has a single name covering many different implementations, GIRLS has had almost as many names as implementations. GIRLS has been named UniQuery, ENGLISH, FRENCH, AQL, ACCESS, Info/Access, jQL, RetrieVe, Vision, RECALL, QMQuery, CMQL, queryON, R/LIST, and INFORM. Current implementations of GIRLS are available from many different vendors, most (all?) of which are listed here, ordered by a complex algorithm.

What is your preference, GIRLS or SQL? There are many differences between them, but I'll save that discussion for next time and just mention a few right now. GIRLS can perform queries with data that is in NF2 (non-1NF), it employs a two-valued logic (no SQL NULLS), and in place of the SELECT of SQL is the LIST of GIRLS.

Continue to next blog →

Don't Suffer Impedance

2006-02-21T09:42:00.000-06:00

A considerable amount of software development consists of mapping and converting data from one format to another, one schema to another. When software is used to bridge the gap between the person and the machine, there is an obvious need for translation. Impedance mismatch refers to the difference between the output of one process and input of another, requiring a transformation to connect the processes. There is a huge impedance mismatch between the thoughts of a person and the 0's and 1's of a computer, for example. (Although there is perhaps less of a mismatch in the case of Nick here than for others.)

Let's narrow the scope. As fascinating as it might be to discuss in a future mewsing, I don't want to start with a person's thoughts here, so let's move to the point where software components collect data from, or present data to, a person. So we will start with the user interface at one end. Without loss of generalization for these purposes, we can narrow further to text-based UI data. We could return to this UML class as a model for an example XHTML (and therefore XML) web page. [Tip: mouse-over acronyms to get the expanded form.]

At the other end, the operating system works with the hardware to handle the translation of data to 0's and 1's. Additionally, let's assume a database product that has at least CRUD services and communicates with the OS. For example, this could be an SQL-DBMS. In summary, we will look at text data at the point of the user interface to and from the interface with a database product. As an example, we will start with an XML page at one end and an SQL-DBMS at the other.

While impedance can be measured in electrical engineering, in software development it is a much more loosely-defined term often used to sell products or claim superiority. Most definitions of impedance mismatch within software development, as used in the phrase OO-RM impedance mismatch, provide information specific to OO and RM, so I'll try my hand at a more generic description. An impedance mismatch occurs when there is enough difference in the data model used for the output of one process and the data model employed for the input of another to require a transformer. This transformer would be analogous to an electrical transformer, with the definition left to the reader.

The number of transformations of any kind relates at least to the size and scope of any given project, but the number of places where there is an impedance mismatch relates to the architecture and product choices for the solution. If there might be such a mismatch wherever we switch data models, and data models are abstractions for programming languages or sublanguages (see The Naked Model for a description of a data model), we can search for them by looking at places where we switch programming languages.

In our example, we could use JavaScript to read and write UI values via the DOM of the XML page. We could pass these data using XML to Java, PHP, Ruby, C, C++, Perl, Python, or even your favorite derivative of Dartmouth Basic, going from data entry on our XHTML page into some middle tier. We could otherwise GET or POST into this middle tier with name=value pairs, but I only mention that so you don't point it out.

If we take our data into an OO structure in the middle tier, there is a change between the UI and the middle tier or within the middle tier that requires a transformation. This XML-OO or Strings-OO transformation is worth a closer look in a later discussion, but permits similar or identical data structures to be used. Each language has the ability to work with XML, for example.

What would it take to minimize the number of impedance mismatches in a particular application?

Then we have a transition between our middle tier and the database by way of SQL. This is well-documented as a place where there is an impedance mismatch. Of course there are many proprietary extensions to SQL, but for most implementations (e.g. SQL-92) three of the differences that will need to be addressed somewhere between the front-end and SQL are 1) NF2 vs. 1NF 2) Lists vs. unordered data and 3) two-valued vs three-valued logic (or nulls as empty sets/strings vs. SQL-style NULLS).

It might be worth noting that the SQL side does not feel the pain. SQL is not a general purpose programming language, and the expectation is typically that the transformer required to address this impedance mismatch will be handled entirely by "the other guy." Whether this has been a cause of resentment in companies that organize with a separate group responsible for development and maintenance of the database aspects of software development is anyone's guess.

There might be good reasons to put up with these mismatches, but what would it take to minimize the number of impedance mismatches in a particular application? As indicated in the ripple delete example, we could use a data model similar to the UI on the back-end. Could we similarly choose to implement the front-end using the RM? What would an RM UI be? I don't mean that the data are stored using the RM, but that the data model for the actual UI form would conform to the RM. If we were to apply the Information Principle to the UI, we would need the entire information content to be represented only as attribute values within tuples within relations. While that is feasible with a data store, that would require no lists or ordered multivalued attributes, for example, which is not a sacrifice that can be made in a user interface.

Unlike other data models, the RM is not sufficient for writing software.

An arbitrary UI, therefore, cannot use the RM for its data model. Given that the RM was developed for the purpose of working with large shared data banks, it is understandable that it might not also be useful as a UI data model. But if we were to decide that life is too short for impedance, we would have to eliminate the RM from the solution. Unlike other data models, the RM is not sufficient for writing software.

Continue to next blog →

The Model Behind the interFace

2006-02-07T16:07:00.000-06:00

An interface is the face that computer software shows to a person, other software, or possibly hardware devices. While data models are often discussed related to databases and storing data, this mewsing is about data models behind software interfaces in general and user interfaces in particular.

Let's take an example of a browser-based UI page with three text fields, one of which requires an integer value; one single selection drop-down; two multi-selection drop-downs; one text area; one radio button; and one date entry via a free-form text field. Using all the creativity I can muster right now, I'll name them as indicated in the UML class shown here.

Developing software is a process of modeling data and behavior. One set of data we can model is that which will be entered by the user. This single page of data could be backed by a view/schema modeled with this single UML box. We could use XML or JSON, for example, within the software to define and work with this view of data.

Similarly if not working with a UI but a data exchange interface, such as one using web services, we could use this same data model. This could be the model for a single record of data. For this example I'll include some sample values. I'll use an xml-ish format (because I wish XML had arrays like this) to model this view. [Note: I'll start the array index at 1, but I'm noting that I'm doing that just to retain my credentials in the real-programmers-start-counting-at-zero world.]

<MyExchange> <text1>elephant</text1> <text2>ears</text2> <text3>2</text3> <singleSelect>mouse</singleSelect> <multiSelect1> <multiSelect1[1]>grey</multiSelect1[1]> <multiSelect1[2]>pink</multiSelect1[2]> <multiSelect1[3]>ivory</multiSelect1[3]> </multiSelect1> <multiSelect2 /> <textArea>These are the times that try men's souls </textArea> <radioButton>Africa</radioButton> <dateText>01JAN06</dateText> </MyExchange>

An arbitrary web page cannot have an SQL view as a data model.

An arbitrary web page cannot have an SQL view as a data model. While views need not be in 3rd or 5th normal form or BCNF, you cannot define an SQL view that is not in 1NF. Using my favorite definition of an SQL view being a stored query, we see that while we can get a lot of different result sets in an SQL query, we cannot get a single web page of data if said data includes lists. Lists or arrays are very common in user interfaces as well as throughout the rest of software development. SQL-DBMS advocates have been known to say things like "You can use reporting tools to represent the view in whatever form you like--that is a representation issue". You might recall from a previous blog that the RM is all about representation, however.

The inability to get a view that is not normalized is a failure of SQL-DBMS tools, while the current state of the RM has made accommodations by redefining 1NF. I suspect I'll bring that up in a future blog, but for now I'll just make the point that even with some new variations on the RM that permit relation-valued attributes, ordered lists are still not included in the model.

Now that we have our UI or web services interface modeled, what might we want to do with data that are hosted by this model? We might want to select, project, join… basically we might want to do anything we otherwise do with data. These data need not come from a disk, they could come from a web page or pages, a web service or other interface, or a process that generates data and stores it in memory, for example.

Are there any of these statements with which you disagree?

Data modeling is required for all interfaces and, therefore, throughout the process of software development.
When data values are provided in data models related to a UI or any other interface, there might be a requirement to do any type of manipulation of or queries against this data.
When working with a UI data model, it is not possible to work exclusively with normalized data.

Therefore, it is not just important, but necessary, to have models of data other than the RM.

Therefore, it is not just important, but necessary, to have models of data other than the RM. Whatever the other data model, it has the same requirements for manipulating and querying the data as data models that are specific to DBMS tools. Data in these models must be projected, inspected, dejected, neglected, and selected (apologies to Arlo Guthrie and Alice's Restaurant).

Even if we decide to make changes to whatever data model we use for the UI when we work with large shared data banks, we cannot make the RM the data model across the board in software development. We must have ordered lists, for example. Before we turn our attention to the face of the database, I want to be sure you are with me on this point. The User simply requires a more full-featured model behind their Interface.

Continue to next blog →

The Data Movement

2006-01-31T14:40:00.000-06:00

Data. Movement. Toss in some musings about differences between the sexes and that will do for today.

<CYA>
Gender can be a divisive subject, and I want to be clear that I have no expertise in the field of gender studies, nature vs. nurture, or how brains differ. One of my readers from down under tells me that feminism is a sensitive subject there due to a lack of full-time jobs for men over forty right now. Apparently there has been some reverse discrimination. I do not think I have been discriminated against in my career at all. Unlike some of you, I have never even been in an all-male meeting. But I have taken note of statistics indicating a downward trend of women in computing and have seen a lack of female computer science majors in at least one college in the USA. I decided to weave this topic into my mewsings today, giving possibly a new angle to the otherwise tired data vs. process topic.
</CYA>

It's time to put a definition of data on the table, data as in data modeling and database, and even the good old term data processing.

Data: encoded propositions, a combination of form and meaning; accurate data are facts.

Accurate data are facts. Ho Hum.

Ho hum. Here's a fact: I'm in my forties. If you capture that fact as data today and then present it thirty years from now, it will be as accurate as if I were to attach a high school picture to this blog today. Alas, data changes. It can also be disseminated. Data with movement—now that's much more interesting to me.

I started in data processing with a summer job during college in 1977. I had pounded the pavement for a job. If I had not gotten this job at the last minute, I would have started as a waitress the following Monday. I had memorized the menu already, but was petrified, knowing I was not really waitress material. But I also knew I didn't want to return to being a nurse's aide in a nursing home ('73-'75) or a maid in a Holiday Inn (Maid of the Month, July 1976).

My qualifications for this job programming COBOL on a Pr1me 300 were that I had taken the equivalent of one semester course covering COBOL, BASIC, and Fortran (two half courses, to be precise). The person hiring me said the other qualification I had was that I was majoring in mathematics and to him that meant that I was smart.

While it is impossible to know what would have happened under different circumstances, I am as sure as I can be that I would not major in computer science were I to enter college today. Often a computer science major is required now for those entering software development professions. I can think of no reason why I would have chosen to major in a machine. I have even avoided joining ACM until last year when I wanted to download more papers than would make sense on a pay-per basis. I didn't like Machinery in the name. I have no interest in machinery, much less an association for such machinery.

What is the percentage of women in the car industry compared to the travel industry? What is the percentage of women in computer science compared to those who were once in that now-called-something-else profession of data processing? Some think the decline is a failure of the women's movement or, perhaps, of women. But I see it as a success with the women's movement in that girls know they have options. It is a failure of our discipline to appeal to these girls as it once appealed to the girl I was when it captured me.

The percentage of women receiving bachelor-level degrees in computer or information sciences has declined from a peak of 35.8 percent in 1984 to 26 percent today.

Among the science and engineering workforce, computer science is the only area where women's participation has declined since 1993. (umbc.edu/cwit/computer_mania.html)

I'm not interested in bases of data. I'm interested in uses of data.

I like movement and connections. To me, data processing is like travel, like movement. Databases are more like computers and cars. I'm not interested in bases of data. I'm interested in uses of data, in data movement. I went on for a master's degree in pure mathematics, not applied mathematics or science. So there is little appealing to me about a discipline called Computer Science. I'm drawn to it about as much as to the term Database Management System.

That is the great distinction between the sexes. Men see objects, women see the relationships between objects. (John Fowles)

Of course this is a generalization. I do like language and data. But I would say that I like the relationship of language and data within systems that include people. I like connections, change, impact, and movement. Am I really about to blame some part of the decrease in women in computing on the change from a focus on data processing and terms implying movement to a focus on computer science and nouns of data, DBMS, tables, domains, constraints, and objects? Yup. (Do you like how I also just tossed the OO folks into the same bucket as the RM folks?)

Thinking in terms of static data, without concurrently addressing changes to the shape, content, and distribution of that data, the use of and interactions with the data, the movement of data, is simply not compelling. These encoded propositions exist in fluid, changing systems, not as data in a vacuum.

Put the processing back with the data, please.

If you take a girl today similar to the girl pictured here, with similar interests and aptitudes, my hypothesis is that she will not end up in computer science. That's what happens when the women's movement meets a discipline defined in terms that suggest no movement. Girls rarely choose to focus on objects or data. Let's tap into the data movement and put the processing back with the data, please.

Continue to next blog →

The Naked Model

2006-01-24T21:25:00.000-06:00

Strip the term relational from relational model and you have an unadorned model. So as not to confuse this with other possible meanings, we should be more precise. This model is typically termed a data model. A data model is employed in the design, construction, and maintenance of computer software systems.

The goal of this article is to get us to a common understanding of the term data model while also giving more indication of where these mewsings are headed. Before zeroing in on the meaning of data model, let's look at some similar terms used in software development that are NOT the same. For example, is this data model minus the relational adjective a...

...Conceptual Data Model (CDM)? Nope.
The CDM results from analyzing an area to be automated, capturing requirements, and communicating these between those who know the subject areas and those who will develop a software system. While the CDM can be back-of-a-napkin informal, there are many techniques for adding rigor, including the use of Entity-Relationship or UML Class Diagrams.

...Logical Data Model (LDM)? Nope.
This is the one that concerns me. Please don't confuse the naked data model with the logical data model, OK? When talking about a particular system, an LDM might be called the data model by some. However, the LDM is different from the term data model being discussed in this blog, so when I write data model sans adjectives, I am not referring to an LDM. The LDM results from structuring a specific CDM and communicating that structure to the computer.

...Physical Data Model (PDM)? Nope.
Only those writing the low-level database software need to know anything about the physical model, in theory (knowing grin goes here). Pretty much the only time you will hear me talk about the physical data model is if I am saying that I am not talking about the physical data model.

Each of these three possible glossary entries is related to a particular problem space being modeled for incorporation in a computer system. The data model we are talking about is more abstract. Data models such as the RM have implications for all LDMs.

Now that we know what our data model is not, let's turn our attention to what it is. The Relational Model (RM), introduced in an earlier blog, is a sweet, tight, mathematical model based on set theory and predicate logic. While you might have a hint that I'm putting the RM on trial over the course of these mewsings, I really do appreciate predicate logic and adore set theory. I applaud the cleverness in modeling data with both set theory and predicate logic. It can be quite helpful. For example, if we organize data and prepare query languages aligned with first order predicate logic, we can prove that our queries will return accurate results with respect to the data, in a finite amount of time. Also, if we choose a mathematically simplified data model, we can implement a mathematically simplified query language.

In addition to appreciating mathematics, I also like religion. But I hope to debunk some of the RM religion that has come along with the application of these mathematics to data. The current use of the RM has been pervasive-enough in the industry that it will take me some time to lay out a case. If all goes well, I plan to have closing arguments sometime before the end of 2006. I will also admit that while I think I have a good case, I don't have it all formed into words in my head just waiting to hit paper. Writing in blog-sized units should help me refine and crystalize my thinking. I hope that you, the jury, enjoy taking the journey through the evidence with me.

I would like to enter into evidence the Information Principle as Exhibit A. I will use a quotation from C. J. Date who is quoting E. F. (Ted) Codd. Both of these men have been at the center of relational data modeling.

Exhibit A: The Information Principle

"The Information Principle (which I heard Ted refer to on occasion as the fundamental principle underlying the relational model) [is]...

The entire information content of a relational database is represented in one and only one way: namely, as attribute values within tuples within relations." (Date, Edgar F. Codd, A Tribute, www.sigmod.org/codd-tribute.html)

A data model is related to the representation of data

Tuck this point away: a data model is related to the representation of data. Now let's move on to a definition of a generic data model, using Date to rephrase Codd.

Codd defines a data model in a 1980 paper Data models in database management. By his definition a data model consists of a collection of data structure types, operators that can be applied to instances of these types and consistency rules that define valid states for the data.

Objects, operators, and, effectively, rules for assignment…Hmmm… If we were to implement a data model what would we have? Let's take a look at a recent definition of data model from Date.

A data model is an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact. The objects allow us to model the structure of data. The operators allow us to model its behavior. (C. J. Date, An Introduction to Database Systems, Addison Wesley, 8e, 2003, p 15-16)

The implementation of a data model is a programming language

I conclude from this that the implementation of a data model is a programming language, whether a general purpose programming language or not. Also, each programming language provides an implementation of a data model or perhaps more than one. Put another way, a data model is an abstraction of a programming language or programming sublanguage.

Now that we have some clarification of the term data model, I will make a claim that is likely agreeable to readers as I have never heard anyone argue otherwise. The RM is not necessary. It is not necessary for developing software solutions, maintaining large shared databases, or any other purpose in the world of software development. Any software solutions that can be developed while employing the RM could be written without it, using other data models. I will follow this up in a future blog by showing that the RM is not sufficient for developing and maintaining data-based software. Once we are all on the same page that the RM is neither necessary nor sufficient, we can look at what the purpose of the RM is and discuss its comparative usefulness.

My beef with the RM is related both to normalization theory as taught in colleges and universities, discussed in the Is Codd Dead? blog and to the way the RM, or parts thereof, are used in the practice of software development and maintenance today. It shapes the thinking of software developers in ways that are often not the most effective.

The RM is not necessary

And, by the way, if you are thinking that the RM need not be obvious in a developer's programming language but could be hidden behind the scenes, then my work is done. That would mean that no computer language would need to use the Information Principle, and neither you nor I would need to use the RM as a data model. We can use any programming language that does not represent itself as an implementation of the RM to employ an alternative data model. Did I mention that the RM is not necessary?

Continue to next blog →

Who Ordered the Ripple Delete?

2006-01-16T23:50:00.000-06:00

I have dabbled a bit in digital video editing, inserting and deleting frames, for example. If I select frames and hit the keyboard Delete, the frames are removed, but a gap remains where they once were. That is often not what I want. Enter: the ripple delete. I'll admit to having a slight shiver of delight when I perform a ripple delete. Behind the scenes it not only deletes the frames, but moves frames up to cover the gap. Editors who have worked with physical film and a razor blade must be ecstatic.

The frames of a video are similar to any other ordered list of data. This ripple delete feature can be added to any software application that shows users an ordered list. Product features are not determined by a particular underlying database data model. However, I am using the feature of an ordered list to set the stage for investigating the meaning and implications of chosing one or another data model, with a definition for "data model" coming in the next blog. If the same features can be implemented in software whether using the Relational Model (RM) or not, why might a team choose not to employ the RM?

Rather than avoiding ordered lists, you start seeing how common they are when you free your mind of the RM.

Let's turn to an example of a simple ordered list. If I were not such a novice with AJAX, I might have provided an example of a ripple delete on a list, but my working example should help with the illustration none-the-less. Also, please forgive my burst of saccharin marketing spin, but because of using

tag-delimited strings
NF2 and
two-valued logic (2VL)

throughout the entire development process, I'm naming this style of development End-to-End AJAX or maybe N2N AJAX. You can gag now, but it isn't like naming JavaScript after Java, given that AJAX really is used at the front-end.

Using the example from the last blog, I'll add a requirement for the e-mail addresses to be ordered. Someone using the database would send either bulk or individual e-mails first to the first address and if that bounces, then to the second. I can place an ordered list in my logical model and then in my implementation. That way I can enjoy use of the ordered list without managing a separate ordering attribute myself, without having to remember to sort the output, nor writing my own ripple delete process.

See this example as a hint at developing using End-to-end AJAX. While a ripple delete is not a standard feature of an RDBMS, it is part of the charm of MV databases. So, the e-mail list is defined as an ordered list to our database. AJAX is used on the front-end, including xhtml, css, and JavaScript. The output comes from a query of the database without procedures to reshape it, so you can see that the database includes NF2 data, as described in the first blog.

In practice, data modelers are influenced in their choices of a logical data model by their target DBMS. If the target database is based on the RM, the data modeler is less likely to select a property list for an entity (i.e. a multi-valued attribute requiring a new table in a SQL-DBMS). I have heard analysts convince users that a single-valued attribute would be best or at least appropriate "for this phase." It makes sense that if you have to split an attribute into a separate table, add in an ordering attribute and roll your own insert and ripple delete functions, you are simply less likely to even consider it. A technique sometimes used when lists are implemented in an RDBMS is to number using intervals that permit easy insertion in the midst of the list as long as you do not run out of numbers in the interval. There is then no hint in any given entry in the list what its ordinal position might be. If the first e-mail address were identified as address 10 and the second were numbered 20, another e-mail address could be inserted as 15. But, rather than avoiding ordered lists, you start seeing how common they are when you free your mind of the RM.

Even once you go through the work of implementing an ordered property list for an entity, the end-user might still be affected if you take your RM thinking to the UI. Think what the digital video editing tool user interface might be if it were to think like the RM. It is unlikely that this editing software holds these frames in a relational database but it shows the interface your users might want even if you did have a relational database backing an ordered property list. Now don't forget, software developers are users too. The DBMS APIs they use can make a significant difference.

Some...might think I'm about to confuse the data model with a representation.

Some readers around the world (I was thrilled to have readers from every continent except Antartica this past week with the first blog) might think I'm about to confuse the data model and the representation. I'm not. I am laying the groundwork for examining the definition and use of a data model. What is the relationship between a data model and the API that developers use in working with a DBMS? If we can have ordered lists and perform ripple deletes no matter what data model we are using, then what is my point? It has to do with the title of this blog—between the developer and the database, who did what work; who ordered a list of properties? For flexibility or productivity, does it make a difference who ordered the ripple delete?

Continue to next blog →

Is Codd Dead?

2006-01-09T22:14:00.000-06:00

Time Magazine issued a cover in 1966 asking "Is God Dead?" This is not the first time the name "Codd" has replaced "God" in a phrase, but in this case it is not for the purpose of comparison. In this blog, I will dare to question some of Codd's legacy including some of the dogma passed along in college database textbooks today.

E.F. (Ted) Codd died in 2003 leaving a significant contribution. Codd is often called the father of relational theory. His 1970 ACM paper A Relational Model of Data for Large Shared Data Banks (E. F. Codd, Communications of the ACM, v.13 n.6, p.377-387, June 1970) is a significant industry milestone.

In this paper, Codd discusses what he sees as the advantages in modeling data by use of mathematical relations compared to mathematical graphs of trees or networks.

Relations are often represented as tables of rows and columns. Trees are often visualized as nested folders and documents. The network graph, seen by Codd as overly complex and a cause of some of the problems he was addressing, can be visualized as a web. While a web, or directed-graph, might be a more complex mathematical structure than a relation, I predict that this data model might just catch on anyway (wink).

There are many viable models for data. Each has its advantages and disadvantages. This blog will not be about right and wrong as much as better and worse approaches. I'm a practitioner dabbling in theory in order to help improve the practice and not the other way around.

My advice is this: Stop normalizing your data. Stop removing all repeating groups.

Codd also introduces the term "normalize" to refer to removing nonsimple domains, such as lists or tables of data often referred to as "repeating groups." He is very clear in this paper that a relation could include repeating groups, but that normalizing it would make the data model simpler for some purposes.

The simplicity of the array representation which becomes feasible when all relations are cast in normal form is not only an advantage for storage purposes but also for communication of bulk data between systems which use widely different representations of data. (Codd, p. 381)

Anyone communicating bulk data by way of XML or JSON will recognize that we have different issues to solve today than we had in 1970. The rise of XML with its associated unnormalized data model is part of the impetus for what will likely be significant changes on the database landscape.

My advice is this: Stop normalizing your data. Stop removing all repeating groups. Note that I am using the original description of normalization from this paper. This meaning of normlization was later termed, or at least rolled into the term, "First Normal Form" or 1NF. The higher normal forms, such as BCNF, include laudable work with functional dependencies, but all are defined to first require normalization. There is definitely some good that can be salvaged from this normalizing debacle of the past few decades, but we must first ditch the requirement for data to be normalized, placed in 1NF, stripped of repeating groups. I will refer to relations that are not normalized, as others have, as NF2 for Non-First Normal Form.

Do this —>

   Id: 123456
First: Jayne
 Last: VanDoe
Email: jvdoe@abc123.com
       jov@xyz123.com
       jo3@aol.com

Not this —>

   Id: 123456
First: Jayne
 Last: VanDoe

   Id: 123456
Email: jvdoe@abc123.com

   Id: 123456
Email: jov@xyz123.com

   Id: 123456
Email: jo3@aol.com

In this way you will model entities, such as the person above, with their dependent properties, such as the list of e-mail addresses. You only need to remove lists from your model, thereby going from the first example to the second above, if you are using tools that require it. Given that SQL-92 requires it, that is a big if. There are other viable, time-tested NF2 options, however.

Don't be fooled— there is no mathematical requirement to normalize data.

But the Relational Model (RM) is based on mathematics, right? Mathematics is precise. What part of the argument for the RM is amiss? Don't be fooled—there is no mathematical requirement to normalize data. Mathematics provides a means for modeling propositions to be handled in software, presented to end-users, passed as messages, or stored on secondary storage devices. The RM is a mathematical model. It is a model. Models are not the real thing. Models are often anorexic versions of the real thing. The mathematics of the relational model is sound, but the process of determining what this model should be used for is flawed.

The RM has been useful, but not as useful as some pre-relational models, in my opinion. Post-relational models of data for messages, such as those mentioned above, look very much like pre-relational models. I am hoping for a return to best practices for data models, whether or not the theory keeps up. I would, of course, prefer that theory be better aligned with excellent practices. Many pre- and post-relational tools use an NF2 model.

You are likely familiar with RDBMS products, often referred to as relational databases. Purists might prefer these be called SQL-DBMS products since SQL does not promote a pure relational model. I will use this column to dispel what I think to be myths that have helped SQL and the relational model rise to become king of the hill for a couple of decades. While this introductory column is admittedly not meaty, I will delve into this further and provide working examples in the coming weeks.

While I have not yet experimented with any XML DBMS tools, I have been working with one NF2 model, often referred to as the MultiValue (MV) or Pick® data model, for over a decade. This is not the only such model, but one with which I am comfortable, so I will introduce it here and use it in future illustrations and implementations.

Putting the RM and MV side-by-side while wearing both a technical and business woman's hat is what prompted me into further exploration of why the MV data model seems to yield higher productivity for developers, greater flexibility for changes over time, and lower risk of project failure. This was particularly perplexing when I started researching the topic because the RM was developed to help improve database maintenance. While the RM addresses some maintainability issues better than MV, MV seems more flexible in many respects. There are different risks and benefits associated with each approach.

Products employing an MV or NF2 data model include the IBM U2 products, Temenos jBASE, Revelation OpenInsight, Raining Data D3, Northgate Reality, EDP Plc UniVision, Ladybridge Systems OpenQM, and InterSystems Caché. There are other viable functional data model implementations with which I am less familiar, such as Berkeley DB from Sleepycat Software and other products marketed as embedded databases. This is definitely not a small niche market.

OpenQM is an open source implementation, so I will use that for my examples in future blogs. I will be the first to admit that MV isn't new, and although various flavors have tools to make it prettier, it typically doesn't look new. It is unlikely to wow you at first glance, but it often grows on developers quickly with its big bang for the buck results and maintainability. The same principles can be applied to many environments, however, and will typically not be specific to MV tools.

I would like to see the industry start with an NF2 model and move it forward rather than squeeze more out of SQL, as has been attempted with the more recent SQL standards. SQL will be with us for many years, but it is time to make an abrupt cut away from it wherever feasible.

Codd will long be remembered for some very innovative work in the area of database theory. But, yes folks, Codd is dead.

Continue to next blog →

A Modeling Profession

2005-11-27T16:17:00.000-06:00

Look for this column (blog?) to start up in early 2006. This first installment is for setup and testing purposes, although it is hopefully still worth reading.

I'm a bit old to begin a career in modeling, although I still have nice legs. Enough of that. I am interested in all aspects of software development, but have recently been studying modeling, particularly data modeling. While research is never-ending, I'm ready to start the writing side of this effort, beginning with this "tincat musings" column.

There. I called it a column. I can't bring myself to call it a "blog" just yet as that would make me a blogger. I don't know that it is fair to call it a column, because no one has hired me as a columnist, I simply took it upon myself. If it looks like a blog and it reads like a blog...

I will be writing text that will flow into rows in this column. The column will hold data of a type we could call "Document." Of course, we could declare that any character-based column holds data of type Document, where we define this type to include strings of unicode characters (or ascii, for those over 30). Other columns could then be of type "Mime" perhaps. That's it - that's all I need for my "database" - those two types.

So I've moved from the vocabulary of working with character and binary files in the 70's to databases with Document and Mime types today. Some of you might be suggesting that a Document could include Dime/Mime types, but that's an implementation technique. Documents, as far as I'm concerned, are entities that could be implemented using paper and pen (as the implementation for those blessed with eyesight - it could be implemented otherwise for those not). With my definition, if it looks like a document and reads like a document...

Now that I know that everything I want to model will be either of type Document (or a subtype) or of type Mime (ditto), I'm ready for the runway. After I get a few more of these columns under my belt (a fine silk sash from Paris), I'll be ready for takeoff. Speaking of mixed metaphors, what is a model anyway?

Continue to next blog →