tag:blogger.com,1999:blog-243594212008-07-15T13:03:15.335+02:00PrimeBase XTPaul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comBlogger44125tag:blogger.com,1999:blog-24359421.post-61639604583405645542008-07-14T15:34:00.001+02:002008-07-14T15:32:29.903+02:00Mutex contention and other bottlenecks in MySQLOver the last few weeks I have been doing some work on improving the concurrency performance of <a href="http://www.primebase.org">PBXT</a>. The last Alpha version (1.0.03) has quite a few problems in this area.<br /><br />Most of the problems have been with r/w lock and mutex contention but, I soon discovered that MySQL has some serious problems of it's own. In fact, I had to remove some of the bottlenecks in MySQL in order to continue the optimization of PBXT.<br /><br />The result for simple SELECT performance is shown in the graph below.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.primebase.org/images/std_vs_opt.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 360px;" src="http://www.primebase.org/images/std_vs_opt.jpg" alt="" border="0" /></a>Here you can see that the gain is over 60% for 32 or more concurrent threads. Both results show the performance with the newly optimized version of PBXT. The test is running on a 2.16 MHz dual core processor, so I expect an even greater improvement on 4 or 8 cores. The query I ran for this test is of the form <tt>SELECT * FROM table WHERE ID = ?</tt>.<br /><br />So what did it do to achieve this? Well first of all, as you will see below, I cheated in some cases. I commented out or avoided some locks that were a bit too complicated to solve properly right now. But in other cases, I used solutions that can actually be taken over, as-is, by MySQL. In particular, the use of spinlocks.<br /><br />All-in-all though, my intension here is just <b>to demonstration the potential for concurrency optimization</b> in MySQL.<br /><br /><b>Optimization 1: LOCK_plugin in plugin_foreach_with_mask()</b><br /><br />The <tt>LOCK_plugin</tt> mutex in <tt>plugin_foreach_with_mask()</tt> is the first bottleneck you hit in just about any query. In my tests with 32 threads it takes over 60% of the overall execution time.<br /><br />In order to get further with my own optimizations, I <b>commented out</b> the <tt>pthread_mutex_lock()</tt> and <tt>pthread_mutex_lock()</tt> calls in this function, knowing that the lock is only really needed if plug-ins are installed or uninstalled. However, later I needed to find a better solution (see below).<br /><br /><b>Optimization 2: LOCK_grant in check_grant()</b><br /><br />After removing the above bottleneck I hit a wall in <tt>check_grant()</tt>. <tt>pthread_rwlock_rdlock()</tt> was taking 50%, and <tt>pthread_rwlock_unlock()</tt> was taking 45.6% CPU time! Once again I commented out the calls <tt>rw_rdlock(&amp;LOCK_grant)</tt> and <tt>rw_unlock(&amp;LOCK_grant)</tt> in <tt>check_grant()</tt> to get around the problem.<br /><br />In order to really eliminate this lock, MySQL needs to switch to a different type of read/write lock. 99.9% of the time only a read lock is required because a write lock is only required when loading and changing privileges.<br /><br />For similar purposes, in PBXT, I have invented a special type of read/write lock that requires almost zero time to gain a read lock ... hmmmm ;)<br /><br /><b>Optimization 3: Mutex in LOCK and UNLOCK tables</b><br /><br />I then discovered that 51.7% of the time was taken in <tt>pthread_mutex_lock()</tt> called from <tt>thr_lock()</tt> called from <tt>open_and_lock_tables()</tt>.<br />And, 44.5% of the time was taken in <tt>thread_mutex_lock()</tt> called from <tt>thr_unlock()</tt> called from <tt>mysql_unlock_tables()</tt>.<br /><br />Now this is a tough nut. The locks used here are used all over the place, but I think they can be replaced with a spinlock to good effect (see below). I did not try this though. Instead I used LOCK TABLES in my test code, to avoid the calls to LOCK and UNLOCK tables for every query.<br /><br /><b>Optimization 4: LOCK_plugin in plugin_unlock_list()</b><br /><br />Once again the <tt>LOCK_plugin</tt> is the bottleneck, this time taking 94.7% of the CPU time in <tt>plugin_unlock_list()</tt>. This time I did a bit of work. Instead of commenting it out, I replaced <tt>LOCK_plugin</tt> with a spinlock (I copied and adapted the PBXT engine implementation for the server).<br /><br />This worked to remove the bottleneck because <tt>LOCK_plugin</tt> is normally only held for a very short time. However, when a plugin is installed or unstalled this lock will be a killer and some more work probably needs to be done here.<br /><br /><b>Optimization 5: pthread_setschedparam()</b><br /><br />I was a bit shocked to find <tt>pthread_setschedparam()</tt> was now taking 17% of the CPU time required to execute the <tt>SELECT</tt>. This call can be easily avoided by first checking to see if the schedule parameter needs to be changed at all. For the moment, I commented the call out.<br /><br />Of course, the more optimized the code is, the worse such a call becomes. After all other optimizations <tt>pthread_setschedparam()</tt> CPU time increases to 52.6%!<br /><br /><b>Optimization 6: LOCK_thread_count in dispatch_command()</b><br /><br />The <tt>LOCK_thread_count</tt> mutex in dispatch_command() is next in line with 96.1% of the execution time.<br /><br />Changing this to a spinlock completely removes the bottleneck.<br /><br /><b>Optimization 7: LOCK_alarm in thr_end_alarm() and thr_alarm()</b><br /><br /><tt>my_net_read()</tt> calls <tt>my_real_read()</tt> which calls the functions <tt>thr_end_alarm()</tt> and <tt>thr_alarm()</tt>. At this point in the optimization these 2 calls required 99.5% of the CPU time between them. Replacing <tt>LOCK_alarm</tt> with a spinlock fixed this problem.<br /><br /><b>Conclusion:</b><br /><br />Without too much effort it is possible to make a huge improvement to the threading performance of MySQL. The fact that such bottlenecks have not yet been investigated may be due the fact that MySQL currently has no performance analysis team.<br /><br />Following the last optimization, execution time was divided as follows:<br /><br /><b>25.8%</b> of the time in <tt>net_end_statement()</tt>, which hangs in <tt>net_flush()</tt><br /><b>32.8%</b> of the time in <tt>my_net_read()</tt><br /><b>7.6%</b> in <tt>ha_pbxt::index_read()</tt>, this is the time spent in the engine<br /><b>32.2%</b> in <tt>init_sql_alloc()</tt> which waits on the spinlock in <tt>malloc()</tt><br /><br />From this you can see that the optimization is almost optimal because the program is spending almost 60% of its time waiting on the network.<br /><br />However, it is also clear where the next optimization would come from. Remove the call to <tt>malloc()</tt> in <tt>init_sql_alloc()</tt> which is called by <tt>open_tables()</tt>. This could be done by reusing the block of memory required by the thread, from call to call.<br /><br />Ultimately, the goal of optimizing for scale like this is to bring the code to the point that it is either network, CPU, or disk bound. Only then will the end-user really see an improvement in performance as the hardware is upgraded.<br /><br />I think I have shown that it is worth putting some effort into such optimizations. Even more so as multi-core systems become more and more commonplace.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-13676185074039718782008-06-13T15:10:00.000+02:002008-06-13T15:10:05.805+02:00PBXT compiles without change under MySQL 5.1.25!OK, now I know that the <i>GA version of 5.1</i> is rapidly approaching. PBXT compiles with the latest release of MySQL <b>without any changes</b>!<br /><br />This has <i>never been the case before</i>. Just search the PBXT code for <tt>MYSQL_VERSION_ID</tt>, and you will find things like:<br /><pre>#if MYSQL_VERSION_ID < 50114<br /> XT_RETURN_VOID;<br />#else<br /> XT_RETURN(0);<br />#endif<br /></pre>and, even worse:<br /><pre>#if MYSQL_VERSION_ID < 60000<br />#if MYSQL_VERSION_ID >= 50124<br />#define USE_CONST_SAVE<br />#endif<br />#else<br />#if MYSQL_VERSION_ID >= 60005<br />#define USE_CONST_SAVE<br />#endif<br />#endif<br /></pre>The lack of changes that affect pluggable storage engines can only mean that the bug fixes required are diminishing in scope.<br /><br />And I believe this is a far better gauge of whether GA is close than any other <a href="http://www.sun.com/aboutsun/pr/2008-04/sunflash.20080415.1.xml">marketing orientated statements</a>! :)Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-37636575054338234152008-06-04T09:57:00.000+02:002008-06-04T09:57:18.297+02:00PBXT 1.0.03 Alpha has been released!I have released PBXT 1.0.03 Alpha and it is available for download from <a href="http://www.primebase.org/download">http://www.primebase.org/download</a>. I have also posted binary plugins for a few platforms.<br /><br />If you are building from source I have added a <a href="http://www.primebase.org/download/index.php#qg_source">Quick Guide: Building and Installing PBXT from Source</a>, which I hope makes the task really simple. If not, I would appreciate any feedback!<br /><br />With this version I have completed the implementation of full-durability, and other features that are scheduled for RC and ultimately for the first GA release.<br /><br />Still to be done is the Windows port which I plan to do before the first Beta release.<br /><br />Please send any comments, questions, bug reports, etc. directly to me: paul dot mccullagh at primebase dot org.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-655487906910531012008-05-01T12:55:00.006+02:002008-05-01T13:08:53.027+02:00PBXT & BLOB Streaming Conference Presentations & VideosThe slides of my presentations at the MySQL Conference &amp; Expo 2008 are now available for download. Videos of the presentations have been uploaded to <a href="http://www.youtube.com/">YouTube</a>:<br /><center><br /><b><i>Inside the PrimeBase XT Storage Engine</i></b><br /><br />Presentation: <a href="http://www.primebase.org/download/pbxt-uc-2008.pdf">pbxt-uc-2008.pdf</a><br />Videos: <a href="http://www.youtube.com/watch?v=TyXk7rQ5hFo">Part 1/7</a>, <a href="http://www.youtube.com/watch?v=A1XjcA4MpHY">Part 2/7</a>, <a href="http://www.youtube.com/watch?v=CAiFE28Z22E">Part 3/7</a>, <a href="http://www.youtube.com/watch?v=EVQKsGCfJTU">Part 4/7</a>, <a href="http://www.youtube.com/watch?v=HHTPvd1U1R4">Part 5/7</a>, <a href="http://www.youtube.com/watch?v=gsLEdZuR3nk">Part 6/7</a>, <a href="http://www.youtube.com/watch?v=3XlDozTxA9E">Part 7/7</a><br /><br /><b><i>Introduction to the BLOB Streaming Project</i></b><br /><br />Presentation: <a href="http://www.blobstreaming.com/download/mybs-uc-2008.pdf">mybs-uc-2008.pdf</a><br />Videos: <a href="http://www.youtube.com/watch?v=pgx3XJlmAeo">Part 1/5</a>, <a href="http://www.youtube.com/watch?v=KFRs_8liVBg">Part 2/5</a>, <a href="http://www.youtube.com/watch?v=BBiLMaY-GEk">Part 3/5</a>, <a href="http://www.youtube.com/watch?v=gT8Jrycrj80">Part 4/5</a>, <a href="http://www.youtube.com/watch?v=nXnBGxf8vuk">Part 5/5</a><br /></center><br />With <a href="http://www.youtube.com/results?search_query=BLOB+PBXT&amp;search_type=">this link</a> you will find all the videos at once. If you watch the movies, then it may help to look at the PDF presentation slides at the same time, because the video quality is "not ideal" :)Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-16547841639468133272008-04-22T11:06:00.000+02:002008-04-22T11:06:11.409+02:00Sun is serious about Open Source and the MySQL CommunityIn probably the best move by Sun during the whole MySQL Conference and Expo, <a href="http://www.forbes.com/finance/mktguideapps/personinfo/FromPersonIdPersonTearsheet.jhtml?passedPersonId=1099090">Rich Green</a> and <a href="http://blogs.sun.com/jonathan">Jonathan Schwartz</a> turned up at the Community Dinner on the Sunday night before the conference.<br /><br />As we walked into the restaurant I saw a face that I thought was familiar. Jonathan and Rich were standing outside the restaurant talking. However, only when we got inside did I hear <a href="http://jpipes.com/index.php">Jay</a> saying that that was Jonathan Schwartz.<br /><br />So just before we all took our places, and while we were trying to work out how we were going to organize payment for the dinner, Rich and Jonathan turned up and quickly ended the discussion. Rich said his credit card would be good for the tab. So thanks to Sun for that!<br /><br />But besides good food and plenty to drink, it was a great opportunity to talk and ask some questions that have been on my mind since the acquisition of MySQL by Sun. I have expressed these concerns on this blog, and they can be summarized as follows:<br /><br /><b>How important is open source, and in particular the MySQL community to Sun?</b><br /><br />Both Rich and Jonathan were able to give me an adequate answer to this question. I will summarize this in my own words.<br /><br />Sun bought MySQL to expand its business and influence in the open source world. So the MySQL community is the key to this. <br /><br />I believe this means that Sun is not interested in commercializing any parts of the MySQL server, and here I am referring to the massive discussion that has resulted from the announcement <a href="http://jcole.us/blog/archives/2008/04/14/just-announced-mysql-to-launch-new-features-only-in-mysql-enterprise/"><i>MySQL to launch new features only in MySQL Enterprise</i></a> on <a href="http://jcole.us/blog/">Jeremy Cole's blog</a>. After all, it is clear that MySQL's bottom line (although profitable) makes no difference to Sun. They are interested in access to the over 10 million users of MySQL to sell services and hardware, those things that Sun already does well.<br /><br />It is the MySQL's task to expand the user base, <b>not endanger it</b>. So I think we will see a change of strategy in the coming weeks and months.<br /><br />And I can add the following: from what I have seen of it, MySQL's enterprise offering is really a great package without having to add a proprietary version of the server. It has everything a serious user of MySQL wants: 24 hour support, monitoring tools, design tools, service packs and priority bug fixing. And with Sun's backing, nobody doubts anymore that they can deliver this service.<br /><br />Jonathan and Rich clearly demonstrated their support for the MySQL community by coming to the dinner. Besides clearing up some important questions, it was a great photo op.:<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.prmebase.org/download/paul_and_jonathan.jpg"><img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px;" src="http://www.primebase.org/download/paul_and_jonathan.jpg" border="0" alt="" /></a><br /><br />You may have seen this photo already on <a href="http://blog.arabx.com.au/">Ronald's blog</a>. The picture is of Jonathan and I with the <a href="http://www.primebase.org">PrimBase Technologies</a> conference T-shirt. If you look closely you will see another little detail. I have a dolphin in my pocket! I wonder if that has any symbolic meaning...<br /><br />Jonathan tells a great story on his <a href="http://blogs.sun.com/jonathan/entry/freedom_s_choice">blog</a>. But what is significant is the picture of <a href="http://monty-says.blogspot.com">Monty</a> he posted, who is wearing a shirt that says "my free software runs your company". We have every reason to believe Jonathan fully supports this sentiment. So note that the T-shirt does not say "my partially free software ..."!<br /><br />Oh, and in the <a href="http://farm3.static.flickr.com/2416/2414423398_2711e9df3f.jpg?v=0">picture of Monty</a>, do you recognize the shirt of the person standing next to him? Since I generally only wear a shirt once, we know that this picture was also taking at the Community Dinner.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-80948465122675857062008-04-11T11:59:00.000+02:002008-04-11T11:58:21.103+02:00BLOB Streaming presentation at the MySQL ConferenceMy <a href="http://en.oreilly.com/mysql2008/public/schedule/detail/362">presentation</a> on BLOB Streaming at the MySQL Conference next week will be very practical.<br /><br />I have made quite a few graphics to show how it works, and plan to demonstrate the current version of the <a href="http://www.blobstreaming.org/">BLOB Streaming engine</a>.<br /><br />"To BLOB or not to BLOB?" is a common question in the database world. There are advantages and disadvantages to both sides. I'll be explaining why I believe that the "BLOB Repository" (a central component of the BLOB Streaming Architecture) combines the advantages of both approaches.<br /><br />Check it out:<br /><br /><div style="text-align: center;"><span style="font-style: italic; font-weight: bold;">An Introduction to BLOB Streaming for MySQL Project</span><br />3:05pm - 3:50pm Wednesday, 04/16/2008<br />Ballroom A<br /></div>Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-44194671416773943462008-04-08T13:56:00.001+02:002008-04-11T11:54:11.874+02:00Replication is dead, long live Replication!Brian Aker has found general agreement with his post: "<a href="http://krow.livejournal.com/590912.html">The Death of Read Replication</a>".<br /><br />Arjen Lentz says "<a href="http://arjen-lentz.livejournal.com/105951.html">I think Brian is right...</a>", and Frank Mash confirmed: "<a href="http://mysqldatabaseadministration.blogspot.com/2008/04/is-read-replication-really-dying-in.html">what Brian says about replication, caching and memcached is very true</a>".<br /><br />Just like <b><i>Video killed the Radio Star</i></b> it looks like maybe <b><i>Memcached killed the Replication Hierarchy</i></b>!<br /><br />But of course, Brian and others are talking about <b>replication for scaling reads</b>.<br /><br />In my <a href="http://en.oreilly.com/mysql2008/public/schedule/detail/361">session on PBXT</a> next week at the conference I will be talking about how we plan to use <b>synchronous replication</b> to produce an <a href="http://en.wikipedia.org/wiki/High_availability">HA</a> solution for MySQL at the engine level.<br /><br />I will also discuss how some flexibility in the PBXT architecture makes it possible to actually scale writes efficiently as mentioned by Arjen <a href="http://arjen-lentz.livejournal.com/105951.html">in his blog</a>.<br /><br />So don't miss it:<br /><br /><div style="text-align: center;"><span style="font-style: italic; font-weight: bold;">Inside the PBXT Storage Engine</span><br />10:50am - 11:50am Thursday, 04/17/2008<br />Ballroom G<br /></div>Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-71926553657370838992008-04-02T11:41:00.008+02:002008-04-02T16:38:45.481+02:00Welcome Ronald! Great to have you on board!If you've been following his <a href="http://blog.arabx.com.au/?p=975">blog</a>, then you will already know that Ronald Bradford has joined <a href="http://www.primebase.org">PrimeBase Technologies</a>. We are very pleased to have him on board! As many know, Ronald has always been very active in the MySQL community as far as his job has made this possible.<br /><br />Ironically during his time at MySQL he was less present in the community than before. When we discussed our plans for PrimeBase with him, Ronald was interested because it was an opportunity to return to a more active role in the community. I am very glad that this motivation was understood by almost everyone at MySQL and we are all looking forward to seeing and hearing more from Ronald.<br /><br />But, of course, Ronald is not "just a pretty face" ;) He will be helping us to design and specify our open source products (including <a href="http://www.blobstreaming.org">Blob Streaming</a>). Ronald's extensive experience with both MySQL and end-users will contribute significantly to what we produce.<br /><br />Ronald will also be helping us to refine our business model. We want all PrimeBase software to be open and free, so we've been thinking hard about how we can make this possible. All this makes it a very exciting time for us, and we will be talking more about of our plans in the days and weeks to come.<br /><br />Of course, <a href="http://en.oreilly.com/mysql2008/public/schedule/speaker/148">Ronald</a> and <a href="http://en.oreilly.com/mysql2008/public/schedule/speaker/517">I will</a> be at the MySQL conference, so be sure to look us up!Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-45955407567609314542008-03-14T00:40:00.000+01:002008-03-14T12:39:23.271+01:00New version and a new home for PBXT!I have just released the first fully durable version of PBXT. Because of the amount of new code I have reverted PBXT to Alpha status. This version, 1.0-alpha, can be downloaded from: <a href="http://www.primebase.org/download">http://www.primebase.org/download</a>.<br /><br />Oh, which reminds me: PBXT now has a <span style="font-weight: bold;">new home</span> at <a style="font-weight: bold;" href="http://www.primebase.org/">http://www.primebase.org</a>, so take a look around! I have actually found a bit of time to write some <a href="http://www.primebase.org/documentation">documentation</a>. Right now the documentation describes building, installation, and the PBXT system parameters. Future additions will include information on performance tuning and a road map for PBXT development.<br /><br />But there is more to the new home than just a new web-site. The PBXT project is now owned and funded by PrimeBase Technologies, an open source software development company. So altogether this is a important step forward on the road to my goal which is to make PBXT a significant contribution to the MySQL community and business/eco-system.<br /><br />Besides full durability, the latest release includes the following improvements:<br /><ul><li>Calculation of <span style="font-weight: bold;">index statistics</span> as required by the optimizer (execute <span style="font-family:courier new;">FLUSH TABLES</span> to refresh the statistics).</li><li><span style="font-weight: bold;">New system variables</span>: <span style="font-family:courier new;">pbxt_log_cache_size</span>, <span style="font-family:courier new;">pbxt_log_file_threshold</span>, <span style="font-family:courier new;">pbxt_transaction_buffer_size</span> and <span style="font-family:courier new;">pbxt_checkpoint_frequency</span> (details <a href="http://www.primebase.org/documentation#sysvar">here</a>).<br /></li><li>Implementation of <span style="font-weight: bold;font-family:courier new;" >SELECT FOR UPDATE</span>, which performs row-level locking to prevent concurrent updates.</li><li><span style="font-weight: bold;">Group commit</span>: increases update throughput by committing multiple transactions concurrently.</li><li>Support for <span style="font-weight: bold;font-family:courier new;" >SHOW ENGINE PBXT STATUS</span>, which displays information about memory usage.</li></ul>What this release does not have is an option to relax durability. The transaction log is always flushed on commit. I plan to add a system parameter shortly that will allow you, in the spirit of the original version, to trade performance for durability if this suites your application.<br /><br />Even better would be to be able to specify this per table. Now if only MySQL would allow engines to specify custom table attributes...Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-22289669914756413562008-02-08T15:24:00.001+01:002008-06-13T14:41:48.222+02:00PBXT & DBT2: Dubugging C/C++ 101Yesterday I starting testing PBXT using the DBT2 benchmark. Following the implementation of durability and SELECT FOR UPDATE for the engine I was more interested in the benchmark as a test for stability and concurrency than performance. I was not disappointed...<br /><br /><span style="font-weight: bold;font-size:100%;" >Which bug first?</span><br /><br />Well I immediately ran into 3 bugs. Isn't it funny how bugs often come in batches, which leaves you thinking: "Oh sh.. where do I start?". Here's my advice: start with the bug that is most likely to disappear if you fix the others!<br /><br />A simple example, you have 2 bugs: an unexpected exception is occurring, and you're loosing memory. First look for the memory loss, because it may disappear when you fix the exception (because you may be loosing memory in the error handler).<br /><br /><span style="font-weight: bold;">Take things one problem at time:</span><br /><br />Another thing: once you have decided for one of the bugs, stick with it (no matter how hard it gets) to the bitter end! Thrashing around will build frustration!<br /><br />So what happened with the DBT2 test? I started the test and immediately noticed that the engine was throwing "duplicate key" errors (it was too much to hope that this behavior was intended). Next I hit an assertion that claimed that a semaphore was not initialized (but I knew the semaphore <span style="font-weight: bold;">was</span> initialized). Finally, on restart after the assertion failed the engine crashed on recovery, in the clib memory manager (not a good sign!).<br /><br />So were to start? Taking my own advice I quickly secured the state of the database before the restart, and confirmed that I could repeat the restart crash. So that one could wait for later.<br /><br />The duplicate key error seemed be a fairly stable repeat, so I took a closer look at the semaphore problem. Here I noticed that the assertion was failing because the check bytes that indicate that the semaphore was initialized had been overwritten, not a happy situation!<br /><br /><span style="font-weight: bold;">Make the bug quick and easy to repeat:</span><br /><br />This bug was also difficult to repeat, I had to restore a fresh environment to get it to repeat consistently. So this is where I started.<br /><br />But before we go on: make sure, in such a situation, that you can repeat the bug as quickly and easily as possible. Eliminate as many manual steps as you can, it will save time in the long run. For example, in this case I wrote a line of shell commands to delete and copy in the database to provide the correct starting point for repeating the bug.<br /><br /><span style="font-weight: bold;">Check your last bug fix first!</span><br /><br />Unfortunately this bug turned out to be the result of a short laps of concentration during my last bug fix. But I did not notice the error during my testing of the big fix and so I moved on to DBT2. When the error occurred during the DBT2 test, I did not relate the problem to my last bug fix.<br /><br />If I had, I would have found the problem quick enough by a simple code read of the bug fix again. This has happened to me before, so my advice is: check your last bug fix, even when the new error does not seem to be related!<br /><br /><span style="font-weight: bold;font-size:130%;" >Debugging C/C++ 101, 3 lessons:</span><br /><br />Conveniently for my little refresher course, each of the 3 bugs proved to touch on a different aspect of C/C++ debugging:<br /><br /><span style="font-weight: bold;">Bug 1. Using an uninitialized pointer.</span><br /><br />For goodness sake, if you suspect this, then compile you program with optimization on, and the warning for "uninitialized variables" enabled. I didn't do this, and I may have saved myself a lot of time. Anyway, this does not always work (for example if you used '&amp;'). Unfortunately, if the compiler does not help, there is no easy way to find these bugs.<br /><br /><span style="font-weight: bold; font-style: italic;">Debugging method: Probing</span><br /><br />I call the method I used to find this error "probing". The idea is to write a special piece of check code which tests for the memory overwrite. The semaphore that was being overwritten was not global, but I added some code when it was initialized to set a global pointer to the semaphore. Then I wrote a little check (or probe) function which tested to see if the check bytes were still OK.<br /><br />Next I spread calls to the check function around my program, trying to close in on the point were things go wrong. When doing this you have 2 difficulties to deal with:<br /><br />Finding the right thread - If you are probing the wrong thread, then you get very miss-leading results. For example, I started by adding the probes to the engine API functions. When the probe was failing on the entry point it took me a while to realize that the problem must be in the PBXT background threads. So, using the probe try to first isolate the thread(s) that are causing the corruption.<br /><br />This meant removing all probes from the engine API functions and placing them in the background threads, starting with the main loops. Then by elimination I managed to narrow the problem down to one particular thread.<br /><br />Dealing with disappearing repeatability - The problem with this kind of bug is that it is really shy! As soon as you start to probe it, it disappears. I mean the changes made to the program change the executions so that the bug does not repeat.<br /><br />At this stage it is very tempting to leave the debug statements in the code and declare the bug as fixed! But, alas, the bug is still there, it has just moved on to overwrite some other part of memory in some quite little corner.<br /><br />Here I can give the following advice: When the bug disappears always return to the last repeat point and try taking smaller steps this time.<br /><br />Another thing: approach the corruption point from the bottom. By this I mean, close the probe in from a point after the corruption has occurred. This is because if the corruption is due an uninitialized stack value, then as you move the probe towards the corruption point from the top, the probe disturbs the state of the stack.<br /><br />As I mentioned above, when I found this bug it turned out to be a result of the last bug fix, bummer :(<br /><br /><span style="font-weight: bold;">Bug 2. Overwriting memory.</span><br /><br />Next I decided to look at the crash on startup. This bug caused a crash in the memory manager. This is most often due to a memory overwrite which has wiped out some of the management data stored per block by the memory manager.<br /><br />Fortunately I have to right tools to deal with this problem.<br /><br /><span style="font-weight: bold; font-style: italic;">Debugging method: Scanning Memory</span><br /><br />Its like "don't leave home without it": don't start C/C++ program without a debug memory manager that does at least the following:<br /><ul><li>Adds and checks headers and footers on every block of memory allocated.</li><li>Wipes out blocks that are freed (for example set all bytes to 0xFE).</li><li>Always moves a block of memory that is reallocated.</li><li>Records every block allocated, and notes the line number and file where allocated.</li><li>Checks on shutdown that all memory has been freed, and reports blocks not freed.</li></ul>This can also be done for objects allocated in C++ by overriding the delete and new operators.<br /><br />Now using the fact that I have a list of all allocated pointers I have implemented a function which scans all allocated pointers and checks the headers and footers to tell me if anything is corrupted.<br /><br />So to find the recovery crash I added the scan call to some of the loops that do recovery and soon managed to narrow things down and find the point were the corruption was occurring.<br /><br />Note that with this method it may not even be necessary to do such a search. One call to the scan routine tells me which block has been corrupted. If it was a simple overwrite of the end of the block, then my debug memory manager will tells me which block it was, and were it was allocated. This may be enough to find the problem.<br /><br />In my case it turned out that I had taken a pointer to a block and then some sub-function reallocated the block. But this is why it is so important that the debug memory manager always moves blocks on realloc(). If it had not done this, I probably never would have noticed the bug until it happened in some production situation (ugh!).<br /><br /><span style="font-weight: bold;">Bug 3. Concurrency problems.</span><br /><br />I was worried about the 3rd bug which was causing an unexpected "duplicate key" error, because I was afraid it might be a conceptual problems. This fear stems from the fact that there are indeed serious conceptual problems involving MVCC and SELECT FOR UPDATE (which requires locking), but fortunately, my fear was unfounded, and it turned out to be just a normal bug, whew!<br /><br />I new this bug must be related to concurrency because I had tested all aspects of the row level locking in a simple controlled environment.<br /><br /><span style="font-weight: bold; font-style: italic;">Debugging method: Trace it</span><br /><br />The way to find concurrency problems is to trace them. The better your trace, the easier it is to find the bug. I think it is almost impossible to find a concurrency bug just using the debugger (unless you have a deadlock, for example). This is because in the debugger you only have a snapshot of the situation, and you don't see the interaction between the threads.<br /><br />In my case the duplicate key error turned out to be the result of a SELECT FOR UPDATE that failed earlier.<br /><br />There is, of course a big problem with tracing and concurrency. Sometimes the error is robust, and sticks around while you bombard it with printf() statements. Mine was more of the shy type where the timing was really critical, and it disappeared when I added print statements.<br /><br />In this case, I also have the right tool for the job. It is a trace function which records the information only in RAM, in one huge block of memory which rolls over if necessary. It is worth also taking the bit of extra time to make the trace function handle printf() type syntax, so that it is as easy to using as printf() itself.<br /><br />What I did was I set a breakpoint at the spot in my code where the duplicate key error is generated. At this point in the code, I built in a call to my "trace dump" function. This function dumps the trace information which I have collected so far to a file.<br /><br />Then I can examine the trace to find out how I got to this point, and I can also use the debugger to examine the threads and find other information I need.<br /><br />Now some advice on trace code:<br /><ul><li>Firstly, never build in trace code that you think you might need! This is a waste of time if it is never used, and it clutters the code unnecessarily.</li><li>Secondly, when faced with a problem that needs to be traced, do not waste to much time or thought trying to guess what information you will need. Initially, just get something out there. Examining the trace is the best way to decide what information is missing.</li><li>Finally, when you are done, and you have found your bug, you will probably feel quite attached to you new trace code and not want to part with it. Don't worry, I understand, and I am not going now tell you that you have to remove it :) Well, not all of it, anyway.</li></ul>Really, you have to be very critical and decide what parts of the trace are good in general, and what parts helped you just find this bug. That stuff must go. And the other stuff: take a bit of time to clean it up, and make sure it can only be enabled in debug mode! Ever have to recall a version because you forgot a trace in it? Hmmm...<br /><br />Well in the end I was very happy with my trace code. It allowed me to pinpoint the bug in SELECT FOR UPDATE, and I added a GOTCHA to my code which you can search for as soon as I get this version released (soon I hope).<br /><br />OK, so this has been quite a long post, thanks for sticking with me :)Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-62662214316093493582008-01-17T11:25:00.000+01:002008-01-17T11:30:51.530+01:00Good move, congratulations MySQL and Sun!Its already a day old, but the news is as hot as ever. Sun will acquire MySQL before the end of the year.<br /><br /><span style="font-weight: bold;">Congratulation to MySQL and Sun!</span><br /><br />And well done to all who were involved in making this deal, in particular, those I know personally: Marten, Monty, David, Zack and Kaj!<br /><br />As I mentioned to Kaj, I am sure that MySQL has a very bright future under the wings of Sun. A deal for $1 billion made in 5 weeks can only mean both sides are <span style="font-weight: bold;">extremely</span> motivated to make it work.<br /><br />I have just 3 concerns:<br /><ul><li>I hope that the MySQL web-site will not disappear into the Sun web-site like the proverbial needle in a haystack! Sun's download page alone is as big as the MySQL web-site ;) I would like to see a mysql.sun.com, where we can find our way around easily.<br /><br /></li><li>And the second is similar to the first but relates to the people. There is a massive difference between dealing with a company that has 400 employees, and one with 34000! I hope that this deal will not affect the access we have to the decision makers, the community support team and the developers.<br /><br /></li><li>Thirdly, Sun wants to sell MySQL to enterprise customers, but I want to be involved in a database that is there for everyone. I am concerned that the non-paying customers, which is a large part of the community, may become neglected.</li></ul>I know that Kaj and many others at MySQL will be working hard to alleviate these concerns, so thanks in advance. Time will tell how successful they are. But they can be sure, <span style="font-weight: bold;">in this quest they have my full support</span>!Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-85662951756116332262007-12-20T09:54:00.000+01:002007-12-20T09:53:51.779+01:00Making PBXT Fully DurableUntil now PBXT has been ACId (with a lower-case d). This is soon to change as I have had some weeks to work on a fully durable version of the transactional engine (<a href="http://www.primebase.com/xt">http://www.primebase.com/xt</a>).<br /><br />My first concern in making PBXT fully durable was to what extent I would have to abandon the original "write-once" design. While there are a number of ways to implement durability, the only method used by databases (as far as I know) is the write-ahead log.<br /><br />The obvious advantage of this method is that all changes can be flushed at once. However, this requires that all data be written twice: once to the log and after that, to the database itself.<br /><br />My solution to this problem is a compromise, but I think it is a good one. In a nutshell: short records are written twice, and long records are written once. When it comes to durability, this compromise, I believe, is a good one.<br /><br />If a transaction writes only short records, then one flush will suffice to commit it. Because the records are short, contention on the write-ahead log is at a minimum. If a transaction writes any long records, most of the data will be written once to a data log (as opposed to a transaction log). Contention for writing on the data log is zero (because each writer has its own data log), but two flushes are required to commit the transaction.<br /><br />By doing this I have saved other transactions having to wait while a certain transaction copies a large amount of data to the transaction log. Although the transaction log uses a double buffering system, this will still cause a hold up.<br /><br />In summary: if you have a transaction which writes large records, then it will basically just hold up itself, and not everybody else.<br /><br />Another innovation I have introduced to reduce contention on the transaction log is an "operation sequence number".<br /><br />Normally operations must be synchronized on the transaction log to ensure consistency. For example, the allocation of a block must be written to the log before usage. But this means all threads need to lock the transaction log when performing an operation.<br /><br />Instead of doing this, I issue a unique sequence number for each operation done on a table. The operations are then written to the log in batches without concern about the order.<br /><br />The process that then applies the changes in the log to the database sorts the operations by sequence number before they are applied. This is also done on restart, during recovery.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-33140117390445937452007-11-08T11:04:00.000+01:002007-11-08T11:04:43.647+01:00BLOB Streaming presentation at the Hamburg MySQL User GroupI have just posted the presentation that I gave at the Hamburg MySQL User Group last Tuesday. You can download the presentation <a href="http://www.blobstreaming.org/download/mysql_ug_nov_2007.pdf">here</a>.<br /><br />I have added a few slides on advanced topics: backup, replication and the distributed repository, which did not actually make it into my talk. However, these topics came up in the discussion over a few drinks afterwards.<br /><br />Thanks to <a href="http://lenz.homelinux.org/">Lenz</a> for the opportunity to present the <a href="http://www.blobstreaming.org/">BLOB Streaming Project</a> and to those that were there for the good feedback.<br /><br />As Lenz said, it was a "pretty technical crowd". For example, it did not go unnoticed that a denial of service attack could be launched by a malicious client, that establishes many upload connections that fill up the server's file system. Although unreferenced BLOBs of this type are deleted from the repository after 5 minutes, this is still a serious threat for anyone that exposes the MyBS HTTP port to the internet.<br /><br />To prevent this it may be necessary to limit upload to clients with specific IP addresses (which could be specified by a system variable). Lenz suggested using HTTP-based authentication such as <a href="http://en.wikipedia.org/wiki/Digest_access_authentication">digest access authentification</a>. Any other ideas would be welcome.<br /><br />Another question was whether a BLOB could be deleted while it is being downloaded from the repository. Although BLOBs are not locked while they are downloaded, I have just realized that this is not a problem. The BLOB remains in the repository after deletion until the compactor thread removes it by deleting a repository file that contains the BLOB. And this is only done once all readers have release the repository file.<br /><br />I have submitted this talk under the heading <span style="font-style: italic;">An Introduction to BLOB Streaming for MySQL Project</span> as a proposal for the <a href="http://en.oreilly.com/mysql2008/public/content/home">MySQL Conference &amp; Expo 2008</a>. And, if it is approved I will also be presenting <span style="font-style: italic;">Inside the PBXT Storage Engine</span> at the conference. <a href="http://blog.arabx.com.au/?p=870">Ronald</a> mentioned that there have been nearly 300 submissions so I will be quite lucky to get both talks approved! :)Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-79446995690322418052007-10-19T10:20:00.000+02:002007-10-19T10:21:27.202+02:00New PBXT/MyBS release enables JDBC-based BLOB streaming!This is quite a milestone for me! At last it possible to actually do some practical work with the BLOB streaming engine (MyBS)!<br /><br />For this release I have completed changes to the MySQL Connector/J 5.0.7, to allow BLOB data to be transparently stored and retrieved from the MyBS BLOB repository. The new version of the driver is called MySQL Connector/J SE (streaming enabled).<br /><br />Uploading a BLOB is as simple as using <span style="font-family:courier new;">setBinaryStream()</span> or <span style="font-family:courier new;">setBlob()</span> on <span style="font-family:courier new;">INSERT</span> or <span style="font-family:courier new;">UPDATE</span>. By using <span style="font-family:courier new;">getBinaryStream()</span> or <span style="font-family:courier new;">getBlob()</span> after a <span style="font-family:courier new;">SELECT</span> you get direct access to the data stream coming from the repository. More information and some examples are provided in the documentation at: <a href="http://www.blobstreaming.org/documentation">http://www.blobstreaming.org/documentation</a>.<br /><br />To try this out you need to install the latest versions of PBXT and MyBS. Both are available from: <a href="http://www.blobstreaming.org/download">http://www.blobstreaming.org/download</a>.<br /><br />Binary versions of the storage engines are also available for MySQL 5.1.22 running on 32-bit Linux and x86 Mac OS X. The modified version of the JDBC source code is included in the MyBS source code distribution, but the driver can also be downloaded <a href="http://www.blobstreaming.org/download/mysql-connector-java-5.0.7se-bin.jar">here</a>.<br /><br />I have included a small test program, TestJDBC.java, as part of the JDBC driver. So once you have installed the engines, you can test BLOB streaming as follows:<br /><br /><span style="font-family:courier new;">java -cp mysql-connector-java-5.0.7se-bin.jar TestJDBC</span><br /><br />TestJDBC connects to a local MySQL server, creates a PBXT table and tests <span style="font-family:courier new;">INSERT</span> and <span style="font-family:courier new;">SELECT</span> of rows containing BLOBs. The program also serves as an example of how to do BLOB streaming with JDBC, but this is all pretty much standard stuff.<br /><br />To get started quickly, the most important things to note are:<br /><ul><li>Set <span style="font-family:courier new;">EnableBlobStreaming=true</span> in your JDBC connection URL.</li><li>Streamable BLOBs can only be stored in <span style="font-family:courier new;">LONGBLOB</span> columns in PBXT tables.</li><li>Use <span style="font-family:courier new;">setBinaryStream()</span>, <span style="font-family:courier new;">setAsciiStream()</span> or <span style="font-family:courier new;">setBlob()</span> and specify the length to upload a BLOB.</li></ul>A streamable BLOB is a BLOB where the data is stored in the MyBS BLOB repository and a reference is inserted into the row. If you use <span style="font-family:courier new;">setBinaryStream()</span> on <span style="font-family:courier new;">INSERT</span>, for example, but specify a length of <span style="font-family:courier new;">-1</span>, then the JDBC driver reverts to the default (non-streaming enabled) behavior which is to store the data directly in the table. The data will be returned correctly on <span style="font-family:courier new;">SELECT</span>, but is not streamable.<br /><br />As usual, any comments, questions or bug reports can be sent directly to me: paul-dot-mccullagh-at-primebase-dot-com. Make sure you put the word PBXT or MyBS in the e-mail title to make it through my spam filter! :)Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-3026660246624550102007-09-25T11:16:00.000+02:002007-09-25T11:26:07.035+02:00PBXT & MyBS at the MySQL Developer Meeting in HeidelbergI was glad to have the opportunity to join the MySQL developers in <a href="http://www.planetmysql.org/kaj/?p=87">Heidelberg</a> for a few days, so thanks to MySQL for the invitation. In between great food, quite a few beers and a number of boat trips we managed to get a significant amount of work done!<br /><br />In what could be considered a follow-up to the engine summit at Google following the MySQL User's conference, I joined Calvin Sun, <a href="http://krow.livejournal.com">Brian Aker</a>, Jeffrey Pugh, <a href="http://www.mysql.com/company/management.html">Monty</a> and others from MySQL and the engine developing community to discuss things concerning storage engines.<br /><br />One of the main topics of the meeting was features and other changes to the MySQL front-end as required by the engines. Some of the requirements (such as an interface to the MySQL optimizer) would really require huge changes, but most agree that freely defined attributes on tables, columns and indexes would be very useful (and relatively easy to implement). Monty would like error handling on the commit call to be added ASAP, but Jeffrey said that's a feature, not a bug fix, and so for MySQL 5.1 it's a no-go. It will be interesting to see who wins that one! I would also like to have engine defined, custom data types. My most pressing problem: how can I indicate that a BLOB column is streamable?<br /><br />I presented the ideas behind the BLOB Streaming engine to the connector developers and we discussed how the PHP and JDBC connectors could be extended to support BLOB Streaming. Mark Matthews, responsible for JDBC, showed me the spot in the code where the ResultSet would handle a MyBS data stream. Mark also pointed out that JDBC will need to upload a BLOB without specifying which table it would be going into or the JDBC driver will have to parse the SQL statement. Hmm, ... I should have realized this before!?<br /><br />I am also looking forward to discussing things further with <a href="http://www.hristov.com/andrey/">Andrey Hristov</a>, developer of the mysqlnd PHP Connector, after he has tried out the new engine. Making the BLOB streaming functionality easily available to PHP developers will be a great step forward.<br /><br />I was also glad to be able to meet with <a href="http://mysqlmusings.blogspot.com/">Mats Kindahl</a> whose experience on the MySQL replication team is very useful to the BLOB streaming project. His main concern is to maintain the flexibility of the system as he points out in his <a href="http://mysqlmusings.blogspot.com/2007/06/blob-locators-blob-streaming.html">blog</a>. He suggested a more loosely coupled system, for example to use database triggers instead of the MyBS server-side API calls. While flexibility is important, I want to avoid too many moving parts, and make sure that the basic setup is simple. We both agreed that an embedded scripting language (ala MySQL proxy) may be a good compromise.<br /><br />In a bit of time between sessions <a href="http://www.flamingspork.com/blog/">Stewart Smith</a> and I took a look at adding the BLOB streaming functionality into the NDB cluster engine. We didn't get all that far with our quick hack, but we both saw that it could be done relatively easily. The potential for the combination of MySQL cluster and BLOB streaming is huge.<br /><br />Altogether it is very helpful to any developer in the community to have such concentrated access to the MySQL developers as is possible at the internal developer's conference. This is a great offer on the part of MySQL, and I can only imagine that they will have to continue to limit the number of external developers that can be accommodated at these meetings.<br /><br />So my recommendation: try to book a ticket as early as possible for next year!Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-33558593790304683052007-08-28T12:06:00.000+02:002007-08-28T12:06:36.410+02:00MySQL Camp: a Secret Tip?Where can you get access to some of the most informed people from MySQL and the community, for free?<br /><br />The answer: at <a href="http://mysqlcamp.org/">MySQL Camp</a>. And then throw in lunch and breakfast for free, being able to influence the session topics and you have quite a package deal.<br /><br />So it is strange why so few people took up the offer in New York this year!?<br /><br />My talk was about the BLOB Streaming engine, MyBS, and I have posted the slides: <a href="http://www.blobstreaming.org/download/mybs-camp2007.pdf">Presentation - MySQL Camp 2007: The BLOB Streaming Project</a>.<br /><br />OK, so I got pretty much ragged about the name, MyBS. Why, I was asked, did I name it that? Jay, even suggested a session to find a new name for the engine! Thanks, Jay, very considerate of you... :)<br /><br />But it was quite unnecessary, because I really can't see what the problem is. I think the name is cool. Uhm, totally ... cool.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-8025136093204767692007-07-27T11:40:00.000+02:002007-07-27T11:40:08.186+02:00BLOB streaming engine (MyBS), version 0.5 Alpha released!With some effort just before my holiday, I have managed to complete the release of the next version of MyBS, the BLOB streaming engine for MySQL.<br /><br />This version includes all the basic functionality required to stream BLOB data in and out of MySQL tables.<br /><br />The main features are:<br /><ul><li><i>Uploading</i> of BLOB data directly into the database using HTTP <span style="font-weight: bold;">PUT</span> or <span style="font-weight: bold;">GET</span> methods.</li><br /><li>Downloaded of BLOB data directly from the database using HTTP <span style="font-weight: bold;">GET</span>.</li><br /><li>BLOB size may exceed 4GB - theoretical BLOB size limit of <span style="font-style: italic;">256 Terabytes</span>.</li><br /><li>BLOBs are stored in a <span style="font-style: italic;">repository</span> which manages references from other storage engine tables.</li><br /><li>BLOBs are referenced by a <span style="font-weight: bold;">URL</span>.</li><br /><li>URLs referencing BLOBs in the repository have a unique <span style="font-style: italic;">access code</span>, for security.</li><br /><li>The theoretical maximum repository size is <span style="font-style: italic;">4 Zettabytes</span> (2^72 bytes) per database.</li><br /><li>The <span style="font-style: italic;">server-side streaming API</span> allows any storage engine to store BLOB data in the repository.</li><br /><li>MyBS <span style="font-style: italic;">system tables</span> provide a view of the BLOBs and associated references in the repository.</li></ul>MyBS works together with the PBXT transactional storage engine, version 0.9.88, which supports the MyBS streaming API. Both engines can be downloaded from: <a href="http://www.blobstreaming.org/download">http://www.blobstreaming.org/download</a>.<br /><br />Documentation for MyBS is also available. It includes details about all features so far, and some examples of use: <a href="http://www.blobstreaming.org/documentation">http://www.blobstreaming.org/documentation</a>.<br /><br />If you try out the new engine, I'd like to hear from you. Any comments, questions and bug reports can be sent directly to me.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-54327463730323822912007-07-17T10:46:00.000+02:002007-07-17T10:46:14.285+02:00The MyBS Engine and the BLOB RepositoryAfter some consideration I have decided to move the BLOB repository from PBXT to MyBS <span style="font-style: italic;">(§)</span>. This has the advantage that any engine that does not have its own BLOB repository (or is otherwise not suitable for storing large amounts of BLOB data) can reference BLOBs in the MyBS BLOB repository.<br /><br /><span style="font-style: italic;">(§) MyBS stands for "BLOB Streaming for MySQL". The BLOB Streaming engine is a new storage engine for MySQL which allows you to stream media data directly in and out of the database. More info at </span><a style="font-style: italic;" href="http://www.blobstreaming.org/">www.blobstreaming.org</a><span style="font-style: italic;">.</span><br /><br />Lets look at an example of this. Assume my standard example table:<br /><pre>CREATE TABLE notes_tab (<br /> n_id int PRIMARY KEY,<br /> n_text longblob<br />) ENGINE=PBXT;</pre>And assume we have a file called <tt>blob_eg.txt</tt> with the contents "This is a BLOB Streaming upload test".<br /><br />Firstly, I can upload a BLOB to the MyBS BLOB Repository using the HTTP PUT method:<br /><br /><tt>% curl -T blob_eg.txt http://localhost:8080/test/notes_tab<br />test/1-326-4891cdae-0</tt><br /><br />Here I uploaded a BLOB to the repository and specified the database, <tt>test</tt>, and the table, <tt>notes_tab</tt>. The URL returned, <tt>test/1-326-4891cdae-0</tt>, is the reference to the BLOB in the BLOB repository, returned by MyBS. Note that the BLOB is not yet in the table (to store the BLOB directly in the table, I would have to specify a column and a condition which identifies a particular row in the table).<br /><br />However, the BLOB is already stored in the database, and I can download as follows:<br /><br /><tt>% curl http://localhost:8080/test/1-326-4891cdae-0<br />This is a BLOB Streaming upload test</tt><br /><br />Since the BLOB is not yet referenced by a table, the MyBS BLOB repository sets a timer. If the BLOB is not <span style="font-style: italic;">retained</span> (reference count incremented) within a certain amount of time it is removed from the BLOB repository.<br /><br />To actually insert the BLOB into the table you just insert the BLOB reference, for example:<br /><br /><tt>mysql> insert notes_tab values (1, "test/1-326-4891cdae-0");</tt><br /><br />On the MySQL server the <tt>notes_tab</tt> table engine will call the MyBS engine (using the server-side BLOB Streaming API) and retain the <tt>test/1-326-4891cdae-0</tt> BLOB reference. So I can now download the BLOB by referencing the table, column and row as follows:<br /><br /><tt>% curl http://localhost:8080/test/notes_tab/n_text/n_id=1<br />This is a BLOB Streaming upload test</tt><br /><br />Note: this example will only work with MyBS 0.5 (<a href="http://www.blobstreaming.org/download">www.blobstreaming.org/download</a>) or later. Coming soon!Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-35960744834309968272007-06-28T10:03:00.000+02:002007-06-28T10:28:22.029+02:00PBXT: Top 5 wishes of a Storage EngineIn response to Ronald's challenge in <a href="http://blog.arabx.com.au/?p=739">Top 5 wishes for MySQL</a>, here is my top 5 wish list. However, it make sense for me to put a slightly different spin on the top 5 series, and write from a storage engine developers perspective.<br /><br /><b>1. A generic engine test suite</b><br /><br />A set of <tt>mysql-test-run</tt> test scripts and results that are intended to be run by all engines. The tests will verify basic functionality and compatibility, and form the basis for an engine certification process.<br /><br /><b>2. Internal APIs</b><br /><br />PBXT already has to call into MySQL to open <tt>.frm</tt> files, and transform path and file names. The <a href="http://www.blobstreaming.org">BLOB Streaming</a> engine will need to access user privilege information. Other engines use the cross-platform functionality provided by <tt>mysys</tt>. What we need is a number of official, well-defined APIs to various MySQL internal functionality.<br /><br /><b>3. Customizable table and column attributes</b><br /><br />Specialized engines require specialized information. Right now, this information is being packed into table and column comments (hack, hack, ...).<br /><br /><b>4. Push-down restrict and join conditions</b><br /><br />This is a big one for engines in general. Many engines are being created that can do certain searches better than the MySQL query processor. However, for the optimizer to know whether to push down a condition or not will probably require a better performance metric.<br /><br /><b>5. Custom data types</b><br /><br />SQL-92 has the concept of a domain, which is basically a named data type. This could be used as the basis for custom data types provided by a storage engine, made available in the form of a new domain.<br /><br />And without numbering them, let me slip in a few more wishes. How about MySQL community project development hosted on <a href="http://forge.mysql.com/">MySQLForge</a>, complete with integration into the MySQL bug tracking system?! And I have heard that this may also be possible: PBXT and other GPL community engines on the MySQL Community distribution :)Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-70588633268277841082007-06-26T13:14:00.000+02:002007-06-26T13:15:28.159+02:00First release of the BLOB Streaming engine for MySQLI have just released the first version of the BLOB Streaming engine for MySQL (MyBS). You can download the source code of the engine from <a href="http://www.blobstreaming.org/download">http://www.blobstreaming.org/download</a>. Pluggable binaries for MySQL 5.1.19 (32-bit Linux and Mac OS X) are also available.<br /><br />To install the plug-in copy <tt>libmybs.so</tt> to the /usr/local/mysql/lib/mysql directory, connect to your server using mysql, and enter:<br /><br /><tt>mysql> install plugin MyBS soname "libmybs.so";</tt><br /><br />This version allows you to download BLOBs that are already stored in the database using HTTP. The URL is specified as follows:<br /><br /><tt>http://mysql-host-name:8080/database/table/blob-column/condition</tt><br /><br />Where <tt>condition</tt> has the form: <tt>column1=value1&column2=value2&...</tt><br /><br />I gave an example of this in my previous blog: <a href="http://pbxt.blogspot.com/2007/06/geting-blob-out-of-database-with-blob.html">"GET"ing a BLOB from the database with the BLOB Streaming Engine</a><br /><br /><tt>8080</tt> is the default port, which can be set using the <tt>mybs_port</tt> system variable on the mysqld command line. For example: <tt>mysqld --mybs_port=8880</tt><br /><br />In order for BLOB streaming to work you also need PBXT version 0.9.87 which is <i>streaming enabled</i>. Streaming enabled simply means the engine supports the MyBS server-side streaming API.<br /><br />This version of PBXT is also available from <a href="http://www.blobstreaming.org/download">www.blobstreaming.org</a>, or from <a href="http://sourceforge.net/projects/pbxt">Sourceforge.net</a>.<br /><br />Note that this version is currently only for use behind the firewall because the HTTP access is unrestricted.<br /><br />The next step will be to enable the uploading of BLOBs using the HTTP PUT method, and the implementation of basic security.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-9346598542077053242007-06-05T13:04:00.000+02:002007-06-05T13:07:07.520+02:00"GET"ing a BLOB from the database with the BLOB Streaming EngineCurrent plans call for the use of the HTTP protocol to upload and retrieve BLOBs to and from the database. The BLOB Streaming Engine makes this possible by integrating a lightweight HTTP server directly into the MySQL server.<br /><br />I am currently working on an alpha implementation of this, and would like to give a short demonstration of what is possible using this system.<br /><br />We start by creating a table using any streaming enabled storage engine (a streaming enable storage engine, is an engine that supports the server-side streaming API):<br /><pre>use test;<br />CREATE TABLE notes_tab (<br /> n_id INTEGER PRIMARY KEY,<br /> n_text BLOB<br />) ENGINE=pbxt;<br />INSERT notes_tab VALUES (1, "This is a BLOB streaming test!");</pre><br />Now assuming the MySQL server is on the localhost, and the BLOB streaming engine has been set to port 8080, you can open your browser, and enter this URL:<br /><br /><tt>http://localhost:8080/test/notes_tab/n_text/n_id=1</tt><br /><br />With the following result:<br /><br /><img height=103 width=410 src="http://www.primebase.com/xt/download/streamed-blob.jpg"><br /><br />So without even doing a <code>SELECT</code>, you can <b>GET</b> a BLOB directly out of the database!<br /><br />Note that there is no need for the BLOB in the database to be explicitly "streamable" for this to work.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-36917732997724508492007-05-18T14:18:00.000+02:002007-05-18T14:18:02.390+02:00The Scalable BLOB Streaming discussion so farHaving discussed BLOB streaming for MySQL with a number of people I have gathered quite a bit of input on the subject.<br /><br />So here are some details of the issues raised so far:<br /><br /><b>Server-side API</b><br /><br />One of the first things that Brian Aker pointed out to me was that the server-side API must make it possible to use the sendfile() system call. The call does direct disk to network transfer and can speed up delivery of a BLOB stream significantly.<br /><br />The server-side API must also include a mechanism to inform the stream enabled engine that an upload is complete. Assuming the streaming protocol used is HTTP, there are 2 ways of determining the end of a download: either the connection is closed when download is complete, or the length of the BLOB data is specified in a HTTP header. <br /><br /><b>ODBC Issues</b><br /><br />Monty was concerned about compatibility with the MySQL ODBC driver. The ODBC function SQLPutData() is used to send BLOB data, and SQLGetData() is used to retrieve BLOB data. Both functions transfer data in chunks. The implementation of these functions would have to be made aware of streamable BLOB values.<br /><br />This can probably be done in a way that is transparent to the user. It may be necessary to designate certain database types as streaming BLOBs. One suggestion is to add 2 new types to MySQL. LONGBIN and LONGCHAR. These keywords are not yet used by MySQL.<br /><br />However, from the OBDC side the data types LONGVARBINARY and LONGVARCHAR are already mapped to the MySQL types BLOB and TEXT respectively. But this simply means that it would be transparent to an ODBC application whether a BLOB is being streamed, or retrieved using the standard client/server connection.<br /><br /><b>Upload Before or After INSERT</b><br /><br />There is some debate about how exactly a BLOB should be inserted. The 2 main possibilities are:<br /><br />1) Upload the BLOB. The upload returns a handle (or key). Perform the INSERT setting the BLOB column to the key value.<br /><br />2) Perform an INSERT, but specify in a CREATE_BLOB(<size>) function the size of the BLOB that is to be inserted into the column. The caller then SELECTs back the inserted data. In the BLOB column the caller finds a handle. The handle can then be used to upload the data.<br /><br />For method (1) it was also suggested that the client application specify a name, instead of using a handle returned by the server. The name could then be used as input to a "create_blob" function which is specified in the INSERT statement. The advantage of this is that the text of the insert statement is readable.<br /><br />One advantage of method (1), as pointed out by Monty, is that the client does not have to wait for the upload to complete before doing the INSERT. The server could wait for the upload to complete. This would require a client side generated identifier for each BLOB uploaded.<br /><br />Note that method (2) requires use of a transaction to prevent the row with the BLOB from being selected before the BLOB data has been uploaded.<br /><br /><b>Storage of BLOBs</b><br /><br />Stewart Smith suggested that the BLOBs be stored by the streaming engine itself. This is something to be considered, and probably will be necessary in most scale-out scenarios. Otherwise the current plan is that each engine that is streaming BLOB enabled (i.e. that supports the server-side streaming API) will store the BLOBs themselves.<br /><br /><b>Scaling Writes</b><br /><br />Also mentioned in regard to scaling is the fact that read scale-out is usually more important than write. However, there are web-sites that have a heavy write load. To be investigated would be how NBD could used in combination with the streaming engine to scale-out uploading of BLOBs.<br /><br /><b>mysqldump & mysql</b><br /><br />As pointed out by Monty, the output of mysqldump is a readable stream. So any changes to mysqldump in order to support streamable BLOBs should ensure that this continues to be the case. This is related to the question of how to upload BLOBs using the mysql client. One idea would be to provide a command in mysql to upload a file or a block of binary data or text.<br /><br /><b>Security</b><br /><br />In some cases it may be necessary to check whether a client is authorized to download or upload a BLOB. The most straight forward method would be for the server to issue authorization tokens which are submitted with an upload or download. These tokens expire after a certain amount of time, or when the associated server connection is closed. However, if the streaming protocol is to be HTTP then we need to investigate the possibility of using standard HTTP authentication when retrieving and sending BLOBs.<br /><br /><b>In-place Updates</b><br /><br />In-place updates of BLOBs is allowed by some SQL servers (for example by using functions UPDATETEXT and WRITETEXT supported by MS SQL Server). In this case the streaming enabled engine would be responsible for locking or using some other method to ensure that concurrent reads of the BLOB remain consistent.<br /><br /><b>PHP Upload</b><br /><br />Georg Richter, who is responsible for the PHP Native Driver at MySQL, noted that using BLOB streaming, data could be uploaded from the Web directly into the database. The PHP extension to the standard MySQL client software should include a function to make this possible.<br /><br /><b>ALTER TABLE</b><br /><br />Georg also pointed out that it should be possible to convert BLOBs currently stored in MySQL tables into streamable BLOBs using ALTER TABLE. This would enable developers to quickly try out streaming BLOBs and test the performance characteristics, in combination with their applications.<br /><br />These are the main issues raised so far. Any further ideas, comments and questions are welcome! :)Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-58716979336992969732007-05-01T13:30:00.000+02:002007-05-02T12:52:24.683+02:00PBXT and the MySQL Conference & Expo 2007The conference is over, and it was a great week but pretty exhausting! From what I saw, and from what I have heard the sessions were of a very high standard this year. I have come back with a number of new ideas and quite a few things I would like to implement in PBXT. One of these is new <a href="http://forge.mysql.com/wiki/OnlineBackup">online backup API</a> for MySQL which will enable you to make a snapshot backup of data stored by all engines, and do <i>point in time</i> recovery.<br /><br />For me the week started off with the BoF on <a href="http://www.blobstreaming.org">scalable BLOB streaming</a> on Tuesday evening. The BoF was well attended and there was significant interest in the topic. I will be reporting on some of the issues discussed soon.<br /><br />On Wednesday morning I presented: <i>PrimeBase XT - Design and Implementation of a Transactional Storage Engine</i>. I was pleased with the number of questions, interest and feedback I received.<br /><br />When I mentioned to Mårten Mickos that someone had said it was the best session they had heard so far at the conference, Mårten told me that they pay guys to say this to the speakers so that they feel good! So I must thank MySQL for this very thoughtful gesture - joke, of course ;)<br /><br />Thanks also to <a href="http://www.sheeri.com">Sheeri Kritzer</a> for video recording the talk so I hope to get a link to that soon. In the meantime I have posted my slides to the PBXT home page:<br /><br /><a href="http://www.primebase.com/xt/download/pbxt_mysql_uc_2007.pdf">http://www.primebase.com/xt/download/pbxt_mysql_uc_2007.pdf</a><br /><br />After that I also took part in 2 lightning rounds sessions: <i>Top MySQL Community Contributors</i> and <i>State of the Partner Engines</i>. Both sessions were interesting in their diversity. One of the top contributors, Ask Bjørn Hansen, described how meeting his goal of filing a bug a week was a lot easier in the early days of MySQL. However, more recently he has been helped by the current state of the GUI tools!<br /><br />During the <i>Partner Engines</i> session, it came as a surprise to some people in the audience that not all engines are for free. Actually there is a wide spectrum from GPL over partially free to highly priced. In fact, one of the developers of a data-warehousing engine found the question as to whether it may be free quite amusing. Solid has not decided to what extent it high-availability offering will be commercial, and the Amazon S3 engine is free, but the service behind it not. So that's something to watch out for in general.<br /><br />I took the opportunity during the <i>Partner Engines</i> session to mention our plans for the future of PBXT which involve building a <i>scalable BLOB streaming infrastructure</i> for MySQL. This is relevant to a number engine developers as we will be providing an open system which will enable all engines to stream BLOBs directly in and out of the server. So please check out <a href="http://www.blobstreaming.org">blobstreaming.org</a> for more details.Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-72713163574391292082007-04-19T14:01:00.000+02:002007-04-19T14:03:30.625+02:00What makes the design of PBXT similar to MyISAM?PBXT is a transactional storage engine, but what does the design have in common with MyISAM?<br /><br />I'll be answering this and other questions during my session at the MySQL Users Conference next week:<br /><br /><center><br /><a href="http://www.mysqlconf.com/cs/mysqluc2007/view/e_sess/10410">PrimeBase XT: Design and Implementation of a Transactional Storage Engine</a><br /><b>Date:</b> Wednesday, April 25<br /><b>Time:</b> 10:45am - 11:45am <br /><b>Location:</b> Ballroom F<br /></center ><br /><br />So be sure to check it out! :)Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.comtag:blogger.com,1999:blog-24359421.post-91000600334878346162007-04-09T16:03:00.001+02:002007-04-09T16:04:42.647+02:00PBXT 0.9.86 the MySQL Conference 2007 release!I have just released PrimeBase XT 0.9.86 which will be my last release before the MySQL User Conference this year. The most significant change in this version is the reduction of the number of data logs used per table. This, and a number of other modifications, makes PBXT fit to handle databases with 1000's of tables.<br /><br />If you would like to learn more about the development and design of PBXT then join me for my session at the MySQL User Conference in Santa Clara, on Wed, April 25, 10:45am - 11:45am, Ballroom F:<br /><br /><a href="http://www.mysqlconf.com/cs/mysqluc2007/view/e_sess/10410">PrimeBase XT: Design and Implementation of a Transactional Storage Engine</a><br /><br />Looking ahead, I would also like to invite all who are interested to the <a href="http://www.mysqlconf.com/cs/mysqluc2007/view/e_sess/14125">"Scalable BLOB Streaming Infrastructure" BoF</a>, which is scheduled for Tuesday, April 24, 8:30pm - 9:30pm.<br /><br />This will be an informal discussion of our plans for the implementation of scalable BLOB streaming in MySQL, and how this all fits together with PBXT and other engines.<br /><br />We would like as much input as possible at this early stage, so I hope you can make it!<br /><br />I will also be taking part in a number of lightning rounds so check out <a href="http://www.mysqlconf.com/cs/mysqluc2007/view/e_spkr/3104">this page</a> for a complete list of my sessions:<br /><br />I'm looking forward to seeing you all there!Paul McCullaghhttp://www.blogger.com/profile/10476275147692948727noreply@blogger.com