tag:blogger.com,1999:blog-35109872393489860692008-08-20T14:44:14.488-07:00Queries From HellJoehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comBlogger19125tag:blogger.com,1999:blog-3510987239348986069.post-67322590808163105402008-07-23T07:46:00.000-07:002008-07-23T08:14:52.099-07:00Misoneism - A hatred or fear of change or innovationI subscribe to <a href="http://www.wordsmith.org/">a.word.a.day</a> and a few days ago the word was <a href="http://wordsmith.org/words/misoneism.html">misoneism</a>.<span style=""> </span>The definition (hatred of change or innovation) made me think of the time when I was an IT manager trying to get our business users to accept a new technology that would make their job easier. Today though, browsers, standardized application interfaces and the general ubiquity of computers have reduced this problem almost to insignificance.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/misoneism-fear-of-technology-701657.jpg"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://ianalyze.net/uploaded_images/misoneism-fear-of-technology-701576.jpg" alt="" border="0" /></a>The interesting point is that it seems like some of the IT folks have now become the misoneists (my apologies to <a href="http://en.wikipedia.org/wiki/Anu_Garg">Anu Garg</a> if that is not a valid form of the word). Oracle people (users, not employees of Oracle) can’t imagine using SQL Server and the SQL Server folks can’t imagine using Oracle—and DB2 is an anathema for both groups. None of them would dream of looking at a database that was not relational regardless of the business needs and appropriateness of the solution. Of course, there are exceptions to every blanket statement and there certainly are a few visionaries who will look at the possibilities of a new technology.<span style=""> </span>But it seems there are only visionaries and hard core conservatives in the industry. The middle ground seems to be very empty.<br /><br /><a href="http://www.i-lluminate.com/data-warehousing-platform-comparison.php">Alternative database structures</a> are available today that provide much better information access for analysis and BI applications than any RDBMS yet they have virtually no customers. Compared to the number of licenses of Oracle and/or SQL Server, the presence of all non-relational database installations combined is nearly invisible unless one counts every XML file as a database. Non-relational database companies with more than a few dozen customers consider themselves to be extremely successful.<br /><br />What happened to the pioneers in information technology? The industry is not so old that they are all retired or dead. Why do the younger members of the industry have their feet so firmly planted in the clay? If innovation in software is to keep pace with <a href="http://www.answers.com/topic/moore-s-law">Moore’s law</a>, we need many more innovators and fewer traditionalists.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-50105952676677317422008-06-25T09:10:00.000-07:002008-06-27T07:12:48.346-07:00How expensive is free software?I was just looking at a white paper from Greenplum describing how fast, easy and inexpensive a <a href="http://www.i-lluminate.com/advanced_data_warehouse_design.html">data warehouse</a> is because it runs on commodity hardware. I have seen uncountable comments about how inexpensive the Microsoft data warehouse infrastructure is and even more comments about the low cost of using open source software. It would be easy to believe that for the price of a few computers and a little software you could have a data warehouse running next week.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/Larissa-Moss-735979.jpg"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://ianalyze.net/uploaded_images/Larissa-Moss-735974.jpg" alt="" border="0" /></a>At a recent <a href="http://www.sigs-datacom.de/sd/tdwi/conferences/2008/index.htm">TDWI conference</a> in Germany, <a href="http://www.eiminstitute.org/meet-the-experts/larissa-moss/">Larissa Moss</a> delivered a presentation defining the development steps and the required team members in a <a href="http://www.i-lluminate.com/comprehensive_data_warehouse_services.html">data warehouse implementation</a> project. The presentation did a beautiful job of defining what has to be done and who has to do it. It involves a minimum team of five or six full time people and a number of part time specialists. The entire project is divided into sixteen steps with each step having multiple activities.<br /><br />While many organizations will not follow every step and activity of her plan, it is an excellent definition of what really should be done to ensure quality and minimize total cost of ownership. So, using her project definition as a guideline, it is easy to see that the cost of the database software and the server hardware represents only a small portion of the total cost of a data warehouse.<br /><br />The use of commodity hardware and open source software can reduce the total cost of the project by a couple of percentage points at best. Reducing the amount of human resource required can reduce the cost of the project by half. Some of the tasks identified in her presentation won’t change with any technology but there are some large improvements possible that can result in cost reductions that substantially exceed the total cost of the hardware and software.<br /><br />For example, the Data Analysis Step includes six activities. Each of these activities involves a number of people and takes a significant amount of time. Using free (or nearly free) relational database software and/or inexpensive commodity hardware, all six steps are still required to have a reasonable prospect of a successful project. If another type of database can reduce the time and expense of these steps by just ten percent, the entire cost of the software and hardware can be offset.<br /><br />Using the <a href="http://smartenoughsystems.com/wp/tag/cdbms/">CDBMS</a> structure, two of the activities, “Refine logical data model” and “Expand enterprise logical data model” are no longer needed. The other activities, like “Analyze source data quality”, will remain essentially the same regardless of database structure. Removing two of the six activities will reduce the overall time and cost of just one step by about one third.<br /><br />As another example, take a look at one of the activities in the first step, “Cost Justification”. If the cost of a project is a few hundred dollars there is no justification needed other than someone saying it will help them with their job. If the cost of a project is a few tens of millions, it will require extremely detailed and robust justification and approval at top executive level. In between, the justification should have a relatively linear relationship to the expected cost. If the Data Analysis step is reduced by one third, and other steps are reduced by a substantial amount, the cost of the entire project is reduced and the cost justification time and effort can also be reduced.<br /><br />Going through all of the steps, it is easy to see how the total effort required for a successful data warehouse project could be reduced by half with no compromise on the quality of the result.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-40592230697117558952008-06-03T09:52:00.000-07:002008-06-04T07:31:33.863-07:00When is an appliance not an appliance?<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/DW_Appliance-763254.jpg"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://ianalyze.net/uploaded_images/DW_Appliance-763252.jpg" alt="" border="0" /></a><br />I recently had an interesting discussion with <a href="http://www.linkedin.com/in/hackathorn">Richard Hackathorn</a> about <a href="http://www.i-lluminate.com/value_based_storage_VBS_model.html">data warehouse appliances</a> and he convinced me that the illuminate database is an appliance in spite of the complete absence of hardware. In <a href="http://www.beyeresearch.com/study/4639">Data Warehouse Appliances: Evolution or Revolution?</a>, he and <a href="http://www.bi-research.com/aboutus.html">Colin White</a> describe the attributes of a DW appliance and the illuminate database aligns well with all points.<br /><br />So is a software-only database really an appliance? Like so many things in the IT world, the answer is “it depends.” His article states “From the classical definition, an appliance is designed for a specific purpose.” And he defines that purpose, in the case of the DW appliance, as <a href="http://www.i-lluminate.com/correlation_database_engine.html">data management</a>. His diagram specifically excludes Data integration services, BI applications and tools and Information delivery from the DW appliance functionality.<br /><br />So when the illuminate profiling loader provides a report on incoming data quality, is that just data management or is it data integration services? Or perhaps one could call it information delivery?<br /><br />A display of the extended metadata from the illuminate dictionary shows the range of values in a field, the total count of values and the count of unique values. If the field is numeric, it also shows min, max, average and standard deviation. An ordered display of all values and the associated count of occurrences is also available. Is that information delivery or possibly a BI tool?<br /><br />The line between the appliance and the services is blurry. The features described above, and others, were included in the illuminate database system to facilitate data management. However, they are frequently used to assist in discovery and analytical processes. BI and data mining users can use these features to enhance their processes.<br /><br />I would still call the illuminate database an appliance; it just has a few extra features. As the database becomes more aware of the context and meaning of the data it manages, these extra features will broaden until it will be nearly impossible to identify the end of the appliance and the start of related services. Another discussion on this topic a year from now will be very interesting.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-32190840671211180522008-05-29T04:11:00.000-07:002008-06-02T12:05:04.780-07:00Denormalizing your worldHow many tables are in your data warehouse? 500? 1000? 5000? More? One data warehouse that I know of has approximately 500,000 tables!<br /><br />Is it any wonder that business people can not use a <a href="http://www.i-lluminate.com/advanced_data_warehouse_design.html">data warehouse</a>? Even most data marts are more complex than the business activity they represent. The paradox here is that if you try to list all of the real world entities that constitute a business information environment it is very hard to find more than fifty or sixty and a good case could be made for as few as twenty.<br /><br />Where then do all of these tables come from and what do they mean? Most of them are the result of optimizing data structures to improve the performance of a computer. Good computer performance is gained at the expense of good business performance—that seems like a poor tradeoff.<br /><br />Normalizing data structures and metadata rather than normalizing data is the first step in making information accessible to business users. A data structure that never repeated a metadata value, including repeats in the form of contrived names to represent the same thing in different contexts and levels of aggregation, would be the most understandable structure for business information users. This means one table for all customer data regardless of source or usage, providing a <a href="http://www.i-lluminate.com/service_company_customer_profiling.html">single customer view</a>. The same goes for product, service, order, invoice, etc.<br /><br />In an environment where one table represents one entity the average business user could easily navigate through the tables and BI for the masses would become a more realistic concept.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-85293810976785140612008-05-20T05:36:00.000-07:002008-05-23T03:53:35.250-07:00B-Eye Podcast from TDWIDuring the whirlwind that was last week's <a href="http://www.tdwi.org/">Data Warehouse Institute (TDWI)</a> conference in Chicago, I had the opportunity to sit down with David Loshin of the <a href="http://www.b-eye-network.com/listen/7440">Business Intelligence Network</a> for a rapid-fire discussion on data warehouse design, implementation and use.<br /><br />In this podcast, David and I talk about how organizations can quickly deploy enterprise data warehouses, eliminate the business requirements definition phase and use all of their data for functions such as <a href="http://www.i-lluminate.com/financial_data_analysis_dashboards.html">finance and performance management</a>. We also discuss how a correlation data warehouse lowers the barrier to entry by reducing both up-front and ongoing maintenance costs, and enables business users to perform ad hoc analysis, discovery and exploration.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.b-eye-network.com/listen/7440"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://ianalyze.net/uploaded_images/tdwi_podcast-752536.jpg" alt="" border="0" /></a>Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-82911938665922909752008-05-15T06:04:00.000-07:002008-05-21T15:06:52.956-07:00What's in a Name?Naming of data objects has always been an interesting issue. Long ago, programmers named files and fields arbitrarily, at their own whim. Later, most IT departments standardized naming rules. Now there are industry standard naming conventions like <a href="http://metadata-standards.org/11179/">International Standard ISO/IEC 11179-5</a> and complete naming systems like the <a href="http://dublincore.org/">Dublin Core Metadata Initiative</a>. Yet most business users today could not even begin to interpret the meaning of the names used for the hundreds or even thousands of tables and the multiple thousands of columns in an enterprise<a href="http://www.i-lluminate.com/data_warehouse_analytics_tools.html"> information management infrastructure</a>. These names identify the information they use every day but the names are so unintelligible that no one could interpret them without a written standard.<br /><br />Why should objects that are used every day by the business user have names that are incomprehensible?<br /><br />The primary reason is that data objects are structured to optimize computer performance rather than to reflect the business structures. A common thing like a customer record quickly becomes split into five or ten or more tables in <a href="http://www.i-lluminate.com/advanced_data_warehouse_design.html">third normal form</a> and business information users can no longer find anything.<br /><br />The best solution to date has been to build supplementary data structures especially for user access and name all objects in those structures according to a centrally managed naming control system. The business users then only see a data mart with a small subset of all data objects and are usually able to learn the meaning of that subset of names.<br /><br />A better solution for naming is to keep just one data structure for each real world object and give the attributes of that structure names that align with the business names. In this environment, every business entity is represented in a single table and each attribute occurs once and has a meaningful name. Each customer has one customer record and each product has one product record. All information access systems use these tables and access the same data using the same names. The <a href="http://www.i-lluminate.com/correlation_database_engine.html">data marts, star schemas and OLAP cubes</a> of conventional systems are just ordinary queries in this environment.<br /><br />This concept, as simple as it may seem, has a couple of issues that keep it from being implemented. The first issue is computer performance. The massive redundancy created by this structure would be crippling in an RDBMS environment. The solution for this issue is simply to use the <a href="http://www.i-lluminate.com/value_based_storage_VBS_model.html">Value Based Structure</a> in the CDBMS. In this structure redundant data storage does not cause any problem and normalizing provides no gain.<br /><br />The second problem is application design. Applications are still designed to interact with data stores the same way they did thirty years ago. New data storage capabilities provide functions that didn’t exist thirty years ago and applications should be upgraded to use the best that is available. However, that is a discussion for another day.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-51469101529916981972008-04-19T07:51:00.000-07:002008-04-19T05:48:03.055-07:00When is data mining not data mining?Product affinity is one of those processes that used to be relegated to the statisticians in the back room. The process was so time consuming and expensive that it wasn’t done except for very high-profile situations. Also, the results frequently proved to be of little value. On one discovery project I found an amazing affinity between bananas and dog food. When I told the client, he laughed and said, “Check for bananas and toilet paper.” Sure enough there was a strong affinity there as well. It seems that bananas have an affinity with almost everything in the store.<br /><br />With high costs, and results like "bananas and everything," affinity was one of the more obscure data mining processes. Now though, we can do product affinity as a simple query. To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets. Ordering the results by units sold gives us a list of the top items sold with our selected product.<br /><br />Using the <a href="http://www.i-lluminate.com/incremental_database_queries.html">incremental query</a> feature of the <a href="http://www.i-lluminate.com/data_warehouse_exploration_tool.html">iCorrelate</a> query screen, the real value of affinity can be extracted from tons of raw data. For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings. Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.<br /><br />We can also extend the affinity beyond baskets if customer information is available. When we have the desired baskets, we ask for the customers who purchased those baskets. Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level. Rather than market basket analysis, we are doing customer purchase analysis. Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.<br /><br />The old data mining process for product affinity had limited value and high cost. The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-16656993223648112272008-04-09T08:24:00.000-07:002008-04-09T09:44:16.356-07:00An inexpensive data warehouse?What makes a data warehouse system expensive or affordable? Let’s take a quick look at two examples; one based on an RDBMS that is free and another based on a CDBMS that sells for $100K. To keep it simple, I’ll use 100 gigabytes of raw data as the input to these systems and assume a small number of users directly on the warehouse and a larger number of users with OLAP screens, dashboards using star schemas, etc. The costs of the front end systems will be essentially the same for both systems so they are not included.<br /><br /><span style="font-weight: bold;">Design – RDBMS:</span> Six calendar months, about $150,000 in labor cost.<br />Get user requirements defined, design level one schema, design level two schema, design all required OLAP cubes, star schemas and other information access structures, prepare load, indexing and aggregation processes, prepare physical design.<br /><br /><span style="font-weight: bold;">Design – CDBMS:</span> Three calendar days, about $750 in labor cost<br />Define aggregations, prepare load and aggregation processes.<br /><br /><br /><span style="font-weight: bold;">Startup – RDBMS:</span> 20 calendar days, about $15,000 in labor cost<br />Load data, index level one. Build and index level two. Extract data for access structures, build and index structures. Validate all levels for accuracy and completeness.<br /><br /><span style="font-weight: bold;">Startup – CDBMS:</span> Four calendar days, about $1000 in labor cost<br />Load data and validate for accuracy and completeness<br /><br /><br /><span style="font-weight: bold;">Operation – RDBMS:</span> One full time person, about $75,000 per year<br />Weekly load and update and rebuild of all access structures. Design and build of new access structures as needed. Periodic reorganization of level one.<br /><br /><span style="font-weight: bold;">Operation – CDBMS:</span> Less than 1/10 of one full time person<br />Weekly loads.<br /><br />It looks like even "free" gets expensive if you can’t reduce the time required to design, build and maintain the data warehouse. Of course, this comparison ignores the opportunity cost of not having the information available for six months. And this sample is a small data warehouse. The cost of implementing a data warehouse is roughly proportional to the amount of data and the number of users. A system with a terabyte of data and a thousand users would be much higher. The cost of creating the warehouse with both the RDBMS and the CDBMS would increase by at least a factor of ten.<br /><br />Hmmm, ten times $165,000 or ten times $1,750; that seems like an easy decision.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-19562923879307087652008-03-26T00:19:00.000-07:002008-03-30T06:24:52.857-07:00Airplanes and Data WarehousesRecently I have been flying more than a pilot, and while idly passing time waiting until I could get my computer back out I thought about how similar the progression of flying is to the progression of data warehouses. A hundred years ago, a pilot could use a plane to go places where there were no direct roads. He could fly faster than a car could drive or a horse could run. Fifty years ago a technician could run a program in a computer that would store and retrieve information faster and more accurately than any person.<br /><br />Airplanes got faster, better and cheaper. Soon ordinary people like me routinely flew around the world. Computers got faster, better and cheaper. Soon every person in their business and personal life was using them every day.<br /><br />However, I had to travel from Barcelona, Spain to San Diego, California. To accomplish this simple task, I flew from Barcelona to Amsterdam. There I waited in an airport for a couple of hours before flying to Minneapolis. After a few hours there, I flew on to San Diego. The return to Barcelona was a similar trip.<br /><br />With the computer, if I want to see information about sales for a day, I have to start by accumulating all of the daily transactions. Then I group them and sum the important statistics like volume and net prices. Then I need to build my star schema. Now I can use by BI tool to finally see the sales. If I want to see all of the sales of non-food products to female customers that were made on weekday mornings, it would be more difficult than a trip from an Inuit village in Alaska to a jungle village deep in the Congo basin.<br /><br />The planes meet 80% of the travel needs well and the computers meet 80% of the information needs well. And better, faster, cheaper technology will not improve that ratio much for either technology. In fact, as people become more used to the technologies, the 80% portion is taken for granted and there is increased demand for the 20% part. Soon the ratio changes to 70/30 or 60/40.<br /><br />New technologies like a personal transporter or a correlation database are needed to change the old cast-in-concrete methods. I don’t know where to find the personal transporter, but with a <a href="http://www.i-lluminate.com">correlation database</a> you can ask any question directly from the raw data as soon as it is loaded. You can change the direction of the questioning at any time. You can move from one question to the next without starting over and tracing out a new path. It gives you the ability to navigate through your information as you need to rather than through the “airports” that were built to designate paths for you. You can arrive quickly at your desired destination even if you weren't sure of it when you started your journey.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-18816178207011733802008-03-18T12:46:00.000-07:002008-03-18T12:51:40.557-07:00But how fast can you load? Part 2The key consideration in load time for a data warehouse is <span style="font-style: italic;">time to value</span>: the elapsed time between having a raw data file ready to load into the data warehouse and being able to use that data for analysis.<br /><br />A correlation database, using the value-based storage model, provides faster time to value than record-based relational or column-oriented columnar database approaches because there is no predefinition of business requirements needed, no need to predefine database schema, and no post-design of data cubes. The correlation database creates its own optimized index structure during data loading, which includes all relationships within the data. And it <span style="font-style: italic;">never</span> needs restructuring.<br /><br />So how easy is to load data and have it ready for analysis? This easy:<br /><br /><object width="425" height="355"><param name="movie" value="http://www.youtube.com/v/Kv08rgaJc78&hl=en"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/Kv08rgaJc78&hl=en" type="application/x-shockwave-flash" wmode="transparent" width="425" height="355"></embed></object>Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-64897013911116564042008-03-18T09:47:00.000-07:002008-03-18T12:56:10.246-07:00An incremental approachHere is a real query from hell in the making. One of our customers wanted to prepare a list of customers for a promotion. They wanted the usual selection criteria like geographic and demographic selections and they wanted at least 5,000 good names and not more than 6,000.<br /><br />With the incremental approach, you ask for one qualification at a time and you can review the results between each step. The marketer who was preparing the promotion was able to select the customers from the area where the promo was going to take place. A quick look showed that there were far more customers there than they wanted to contact so another selection on a demographic code narrowed the result but there were still too many names.<br /><br />Scanning through some of the results, it was quickly clear that there were many old customers who had not shopped recently. Now the marketer had the option to go after the current customers and focus the promo on retention or go for the old ones to try to get them back. Selecting the old ones resulted in a few more names than were really desired. Another quick scan through the data showed a wide range in average spending. An addition to the query selected the big spenders resulted in the right sized group and the promotion was launched. Total time for the discovery process was less than one hour.<br /><br />Imagine attempting that with a typical DW! What was the original requirement? It wasn’t really known. If this had to be done as an IT project, it would never start since the requirements definition would take longer then the cycle between promotions.<br /><br />If the marketer had access and skill in SQL, then the same process could be attempted. However, the probability of having all of the required fields indexed would be near zero. If the exploration and business decisions had taken another direction and they went after active customers, a different set of fields would be required. What are the chances that they would also be indexed?<br /><br />Without a doubt, there is a query from hell somewhere in this business process.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-17945432177333574912008-03-12T00:31:00.000-07:002008-03-12T10:56:49.853-07:00Value Based StorageJust what is value based storage? This is a common question I often need to answer. It sounds like some kind of marketing jingle but, in fact, it is the underlying iLuminate storage structure technology that we are talking about.<br /><br />The iLuminate internal storage looks dramatically different from either row based storage (Oracle, SQL Server, etc) or column based storage (Sybase IQ, Vertica, etc). Value Based Storage (VBS) stores the data values in sets (the "value pool") based on data type and each distinct value is stored only once. The original record definition is maintained by an indexing system that links to values the value pool. These two objects are enhanced by the extended metadata giving the iLuminate database its complete functionality.<br /><br />This storage structure provides several benefits. The database is very small, in some cases even smaller than the raw data that went into it. In addition to simply being small, it concentrates the most active elements of the data into very compact sets. The value pool is often small enough to fit entirely in memory. As a result, the slowest part of database processing, transfer of data from disk to memory, is dramatically reduced.<br /><br />The VBS also provides some data access features that can not be duplicated in any other structure. A search can be executed starting with a data value in the pool. By going from the value pool back to the index, it is possible to quickly locate every use of the value wherever is may be used in the logical record structures.<br /><br />This structure also enables our incremental query capability. As the result of a query, the database returns a set of instance identifiers rather than a set of records. This is because there are no records, only pointers and values. With the response being a set of pointers, it is a simple matter to perform the next query step and then get the union or difference between the two sets of pointers for the result of the second query step. This process can be continued indefinitely with the result set shrinking or growing as the new results are merged with the old.<br /><br />The extended metadata and the indexing structure and logic provide more features but that can be the subject of another discussion.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-9101068133561914002008-03-05T07:06:00.000-08:002008-03-05T09:02:36.430-08:00But how fast can you load?When I talk to people about the Illuminate CDBMS, the first reaction I usually get from them is, “That sounds great, but how fast can you load data?”<br /><br />My usual answer is “How fast is fast enough?”<br /><br />This is not a flippant answer, but an honest question that is intended to start a discussion on the meaning of load speed. The real measurement of load speed should be the time needed from starting the transfer of source data from the operational data stores, to the availability of the business information for the end user. In a multi-layer information infrastructure, that will include time for:<br /><br /><ul><li>Importing raw data</li><br /><li>Indexing the imported data</li><br /><li>Running all aggregation processes and building intermediate layer access structures as needed</li><br /><li>Further aggragations, rebuilding and re-indexing star schemas, OLAP cubes and any other customized, end use information delivery structures.</li></ul><br />In an RDBMS environment, steps 1 and 2 are very fast since there is very little indexing on the foundation data layer. Depending on the hardware, communications methods and schema, this could be at a rate of perhaps 2 to 10 gigabytes per hour or even more in extreme cases with extreme hardware.<br /><br />Steps 3 and 4 are the most unpredictable since every organization will have differing amounts of work in each layer. However, it is safe to say that this will be the longest part of the load process. A factor of ten times the duration of steps 1 and 2 would not be unusual. An overnight load process that loads for thirty minutes would then need five more hours to complete all of the remaining preparation steps.<br /><br />In the illuminate environment, the load is a single step with some aggregations needed in limited cases. When data is loaded into <a href="http://www.i-lluminate.com/illuminate_server.html">iLuminate</a>, the load process indexes everything. This means that step 2 in this case is zero.<br /><br />Because everything is indexed, there is no star schema or OLAP cube building required. The foundation layer with 100% indexing already provides the full functionality of those structures. Also, due to the indexing, aggregations will not be needed as soon as they are with an RDBMS. The iLuminate database will minimize the I/O needed for the aggregation so the computational speed of the computer is the determining factor for aggregations. Small aggregations can be done as a part of the query process without impacting the user response time.<br /><br />While an exact comparison can not be made because of variability of hardware, schema design and the number of information layers, an example comparison can demonstrate the conceptual difference between loading with a conventional system and loading with iLuminate.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.ianalyze.net/uploaded_images/blog_table-797662.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://www.ianalyze.net/uploaded_images/blog_table-797646.jpg" alt="" border="0" /></a><br />So the answer to the original question is that iLuminate loads very fast. More fast enough for almost any enterprise.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-24466659856175976652008-03-03T06:50:00.000-08:002008-03-03T08:36:51.539-08:00Analytical Queries???Analysis – A breaking up of a whole into its parts to find out their nature (<a href="http://www.merriam-webster.com/dictionary/analysis">Websters New World Dictionary</a>).<br /><br />It seems strange to talk about analytical queries when the entire analysis must first be defined in order to create the SQL query. The analysis is really in the preparation, and the actual query becomes just another report generation process. If the analysis is not exactly what is needed, there is no way to know that until the report is done. Then the analysis/report process starts again or, all too frequently, the process is abandoned without a result.<br /><br />The <a href="http://www.i-lluminate.com/illuminate_explorer.html">incremental query</a> moves the analysis into the query process where it really belongs. The person who needs the result of the analysis can ask questions until the analysis is complete and the answer has been found.<br /><br />For example, a retailer wants to prepare a list of customers for a promotion. They want the usual criteria like geographic and demographic selections and they want at least 5,000 good names and not more than 6,000.<br /><br />The initial geographic and demographic questions are asked and there are 25,000 customers qualified. Now, the retailer can just take the first 5,000 names, or the quality of the list can be improved through some analysis, narrowing the list to the best 5,000—not simply the first 5,000.<br /><br />With the incremental query process of Illuminate, they continue by asking for only customers who had previously purchased a specific type of product. Now the 25,000 names have been reduced to 10,000 better-qualified customers. Add another query step to select only those who shop on weekends. Now they have the 5,000 customers who precisely fit the profile that they want.<br /><br />With the incremental approach, you ask for one qualification at a time and you can review the results between each step. The process just described would take a minute or two. In an RDBMS environment, reaching the same result would require either an extensive analytical process to define the final query or multiple sessions of: query design, examination of results and query redesign until the final result is reached.<br /><br />Due to the cost of the old approach, the typical process will be to stop with the first query step and business user will just take 5000 off the list. The quality of the promotion will be compromised—but if the promotion planning is done on a weekly basis, there is no time for the alternative.<br /><br />Result? - The business suffers due to the limits of technology.<a href="http://technorati.com/faves?sub=addfavbtn&add=http://ianalyze.net"><br /></a>Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-80154639646375707292008-02-28T11:58:00.000-08:002008-02-29T07:13:35.901-08:00Which tool do I use?I monitor a few BI web sites and I talk with many IT folks working in DW and BI. In spite of how many times I see and hear it, I am still astonished when people discuss which is the best BI tool. Many years ago I heard a smart person say the whole world looks like a nail when the only tool you have is a hammer. It seems that rule is as constant as the sunrise.<br /><br />Comparing <a href="http://creese.typepad.com/pattern_finder/2007/11/microsofts-free.html">SQL Server</a>, <a href="http://www.biinaction.com/blog/2007/07/bi_bellwether_cracks.html">Cognos</a>, <a href="http://blogs.forrester.com/information_management/2007/11/at-the-risk-of-.html">SAS</a> and <a href="http://www.qlikview.com/home.aspx"></a><a href="http://discuss.joelonsoftware.com/default.asp?design.4.303910.12">QlikView</a> is like comparing hammers, saws, drills and wrenches. These are all very different products with different strengths and weaknesses. Each of them may be the best tool for something but none is best for everything in BI any more than one of the builders’ tools is the best for building a house.<br /><br />My interest is in databases and I am frequently told something like “I can do that with (<a href="http://www.talkaboutdatabases.com/group/comp.databases.olap/messages/26923.html">SSAS</a>, <a href="http://jkobielus.blogspot.com/2007_12_01_archive.html">Business Objects</a>, <a href="http://www.b-eye-network.com/blogs/linstedt/archives/2006/09/hot_new_open_so.php">Pentaho</a>,…) when I describe a feature of our database.<br /><br />Yes, perhaps, but I guess you could use a hammer to smash the end off a board if you didn’t have a saw. Does that make sense? Of course not, but many people do not see the obvious parallel.<br /><br />For example, the <a href="http://www.i-lluminate.com/illuminate_server.html">iLuminate database</a> indexes everything, always. This makes it an excellent foundation for analytics and information discovery. Because of the indexing, any column of any table can be either a dimension or a fact depending on the nature of the business question. Any column or columns of any table or tables can be used as selection criteria and there will never be a full table scan.<br /><br />Can you duplicate these features with SQL Server? Theoretically yes, but in the real world, it can never be done. This would require building as many <a href="http://www.ciobriefings.com/whitepapers/StarSchema.asp">star schemas</a> and indices as there are columns in the database. You would need more disk than can be attached to a computer and most of the stars would never be used. Unfortunately, you can’t predict which of them would never be used. With SQL Server you can resolve some of the business issues but never all of them.<br /><br />Can you duplicate these features with an <a href="http://www.olapreport.com/fasmi.htm">OLAP</a> product? Even theoretically this is not possible. How about with SSAS or Business Objects or??? No. But can it be done with the right suite of products? Again, no.<br /><br />The only product in a complete suite that could attempt to provide the indexing power is the database and the relational databases can not do it. Front end systems and mathematical algorithms can not even address the issue. There is no product or combination of products that can duplicate the benefits of the indexing function.<br /><br />The real statement that people are making is that they know how to work well with a given tool and they would rather focus on the things they can do with that tool than solve business problems. As a result, the query from hell seems to have a long and healthy future.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-67531792165540487022008-02-27T14:01:00.000-08:002008-02-28T06:23:42.268-08:00How many DBAs does it take to...How many DBA’s does it take to change a light bulb? Come on! They have more important things to do.<br /><br />How many DBA’s does it take to optimize performance of a data warehouse? The answer should be the same.<br /><br />A DBA should be focused on making the information owned by the enterprise as accessible and useful to the enterprise as possible. Unfortunately, when the data warehouse has been built on a relational database, performance tuning issues will consume most of their time. Precious little time is left for tasks like identifying user needs, coaching designers and developers, developing and managing database access policy and the myriad of other tasks that directly benefit the business.<br /><br />The value based structure inherently provides all of the performance tuning that can be done. In this system, each value is stored once regardless of where or how frequently it may be used. The core of the database is an index and linking system that links all of those values into their proper place in records.<br /><br />Because of this structure, every column of every table is always indexed. There is no way to access anything in this database except through the index. Inserts, updates and deletes never result in overflow areas or empty spaces in the database – no reorganizing ever. The size of the database is dramatically less than it would be with the same data in an RDBMS based data warehouse, nearly eliminating physical design issues.<br /><br />When the database system does all of that for the DBAs, they can spend a lot more time with users and developers to maximize the benefit that the organization gets from its information. The organizations information is an asset and it is the job of the DBA to ensure that the asset is used as effectively as possible. Spending time on ensuring that the database is designed to take advantage of every possible processor cycle does very little to help the business.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-13570033928217615592008-02-26T07:30:00.000-08:002008-02-26T10:36:03.936-08:00Flexibility is CashContinuing the last discussion, how can the information in an RDBMS be converted from a drill press into a pile of cash? Add flexibility. A drill press does one thing very well while cash can be used to accomplish many objectives. The structure of data in an RDBMS defines the boundaries of its use just as the motors, gears and levers of a drill press define its use.<br /><br />A data structure that allows complete flexibility of use frees the corporate information asset to be used more like cash. As long as data is stored in records and indexed for access, it will do one thing very well and it will require armies of smart people to make it deliver minimal results for all other information functions.<br /><br />There are some relatively new options for supporting information discovery and analytics in the market today that were not available when most data warehouses were being planned and implemented. Column structure was introduced by <a href="http://www.sybase.com/products/datawarehousing/sybaseiq">Sybase</a> and other companies have joined in with their new products. Value structure is available only from <a href="http://www.i-lluminate.com/illuminate_server.html">Illuminate</a> today.<br /><br />Column structured database systems store data in columns rather than records. Adding a new record in this structure requires splitting the new record into individual values and then updating the column sets individually. Retrieving information involves finding the desired value in a column and the retrieving all of the values from the same relative position in all other columns. While this may sound complicated and difficult, it actually takes advantage of the fact that processor cycles are cheap and plentiful while disk bandwidth is limited and expensive. This structure reduces the I/O load and disk usage when compared to record structures.<br /><br />The value structure stores unique data values once in a data pool and maintains an indexing system that reconstructs the records as needed. Like the column structure, this structure uses processor cycles to reduce I/O. However, the value based structure reduces I/O even more than the column structure and makes possible new analytical and discovery functions that are unique to this structure. This is as close as information gets to cash today.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-43048714499621933002008-02-25T10:18:00.000-08:002008-02-25T13:09:09.371-08:00There Are No Queries From Hell"Queries from hell" really represent "preparation from hell." If you have a really nasty query that is consuming the system, it is already too late. But don't give up and get a job packing groceries yet.<br /><br />In most cases, the biggest flaw in the preparation is the selection of the wrong infrastructure components. A database that is ideal for supporting an OLTP system is dead wrong for supporting information discovery and analytical processes. And, of course, those are the processes that are usually responsible for those famous queries. Trying to support a data warehouse with a record-based storage system is like trying to fly a plane with a diesel engine; while it may be possible to do it, it makes no sense.<br /><br />Looking at some alternative storage technologies opens new business opportunities for the data warehouse, develops the real asset value of corporate information and banishes those famous queries.<br /><br />“Information is a corporate asset” is a mantra that many IT people have been chanting for years. Recognition of information as a real corporate asset would give the IT industry a new status they have longingly sought for decades. However, what kind of asset should it be?<br /><br />A drill press is an asset and so is a pile of cash. A drill press can be used to efficiently perform one step of a manufacturing process and a pile of cash can be used to accomplish almost any business objective in any line of business. If information is rigidly structured, it is like the drill press. If it can be accessed in any way at any time for any purpose, then it is more like cash. The good old RDBMS makes a wonderful drill press.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.comtag:blogger.com,1999:blog-3510987239348986069.post-35137962776233742952008-02-20T16:24:00.000-08:002008-02-21T04:32:48.997-08:00What are "Queries from Hell"?In the world of data warehousing, administrators need to be wary of "queries from hell." These are variously defined, by <a href="http://www.itjungle.com/mso/mso061102-story04.html">IT Jungle</a> as "gigantic queries that may have been accidentally set off by users and that can consume enormous amounts of systems resources," and by the <a href="http://books.google.com/books?id=6MIzW0IejjkC&pg=PA328&lpg=PA328&dq=%22queries+from+hell%22&source=web&ots=PVRDylsW4s&sig=_gHf6ENu2VE8DkUY__P9msJ-WDM">IS Management Handbook</a> as database queries that "can destroy perceived levels of performance if they are not identified and managed carefully."<br /><br />They are defined here simply as queries that are <span style="font-style: italic;">really really hard</span> to execute using standard data warehouse technology because they require database restructuring, days of SQL coding, inordinate processing resources, or just don't seem to adhere to the rules that 90+% of "standard" queries follow.<br /><br />Telling business users that they can't have an answer to a question because even though the answer lies within a database, it is simply too difficult to extract, is not a feasible option. Neither, often, is explaining that the answer can be obtained, but it will take weeks or months of programming.<br /><br />The answer is to segregate these queries from the 90+% of standard BI and business analysis queries, and treat them differently. Use the right tools for the job. That's the kind of thing you'll be seeing in this blog over the coming months and years.Joehttp://www.blogger.com/profile/09610805269377009736noreply@blogger.com