tag:blogger.com,1999:blog-78445263962103784822008-07-26T00:22:08.139-07:00Noel O'Blogbaoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comBlogger86125tag:blogger.com,1999:blog-7844526396210378482.post-62841671816575576312008-07-23T08:47:00.000-07:002008-07-23T09:52:54.121-07:00Chemistry in R<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_x5Hz3F0jd4Q/RzmP6AkcQlI/AAAAAAAAAWk/CL9zrFD-u24/s1600-h/Rlogo.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_x5Hz3F0jd4Q/RzmP6AkcQlI/AAAAAAAAAWk/CL9zrFD-u24/s400/Rlogo.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5132291477113422418" /></a>For many cheminformaticians, <a href="http://www.r-project.org/">R</a> is the preferred way of analysing multivariate data and developing predictive models. However, it is not so widely known that there are R packages available that are directly aimed at handling chemical data.<br /><br />Over the last few years, <a href="http://cheminfo.informatics.indiana.edu/~rguha/">Rajarshi Guha</a> (Indiana University) has been doing some nice work integrating the <a href="http://cdk.sf.net">CDK</a> and R. His publication in J. Stat. Soft., <a href="http://www.jstatsoft.org/v18/a05/paper">Chemical Informatics Functionality in R</a>, describes the <b>rcdk</b> and <b>rpubchem</b> packages. The rcdk package allows the user to read SDF files directly into R, calculate fingerprints and descriptors, calculate Tanimoto values, view molecules in 2D (JChemPaint) and 3D (Jmol), calculate SMILES strings, and access the property fields of file formats. The rpubchem package is focused on downloading compounds, property values and assay data from PubChem. See also articles in R news and CDK news [<a href="http://cran.r-project.org/doc/Rnews/Rnews_2006-3.pdf">1</a>], [<a href="http://almost.cubic.uni-koeln.de/cdk/cdk_top/cdk_news/archive/cdknews2.1.article12.pdf">2</a>], [<a href="http://almost.cubic.uni-koeln.de/cdk/cdk_top/cdk_news/archive/cdknews2.1.article11.pdf">3</a>].<br /><br />A more recent development is <a href="http://bioweb.ucr.edu/ChemMineV2/chemminer/">ChemmineR</a>, described in the latest issue of Bioinformatics: "<a href="http://dx.doi.org/10.1093/bioinformatics/btn307">ChemmineR: a compound mining framework for R</a>". The authors appear to be unaware of the earlier work by Rajarshi, and so there is no comparison of available features. However, based on the <a href="http://bioweb.ucr.edu/ChemMineV2/chemminer/tutorial">documentation</a> on their website, it seems that much of the functionality revolves around a type of fingerprint called atom-pair descriptors (APD). SDF files, when read in, are converted to a database of APDs and these can be used for similarity searching, clustering, removal of duplicates and so on. Sets of molecules can be visualised using a web connection to the <a href="http://bioweb.ucr.edu/ChemMineV2/">ChemMine portal</a> (I'm not sure what software is used). According to the documentation, future work will include descriptor calculation with <a href="http://joelib.sf.net">JOELib</a>.<br /><br />So, there you have it. An exhaustive survey of the two available methods for bringing chemistry into R. Is the time ripe for a cheminformatics equivalent to <a href="http://www.bioconductor.org/">Bioconductor</a>?baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-40017208934447047432008-07-15T14:00:00.000-07:002008-07-15T14:36:38.456-07:00Review of "IronPython in Action"<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.manning.com/foord/"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 150px;" src="http://www.manning.com/foord/foord_cover150.jpg" border="0" alt="" /></a>The three most well-known implementations of Python are the original C implementation (CPython), one in Java (Jython), and one in .NET (IronPython). <a href="http://jython.org">Jython</a> makes it easy to write Python programs that use Java classes and to test a Java library at an interactive prompt. Similarly, <a href="http://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython">IronPython</a> allows Python programs to be written that use the .NET libraries (these are the same libraries used by C# and Visual Basic).<br /><br />IronPython (sometimes abbreviated as FePy) is relatively new and had its first release in 2006. It is Open Source and developed by Microsoft (by the same developer who started Jython). Although .NET is strictly for Windows, IronPython can also run on top of <a href="http://www.mono-project.com/Main_Page">Mono</a>, the open source implementation of .NET which is available cross-platform.<br /><br />As a cheminformatician should you be interested in IronPython or .NET? Well, Dimitris Agrafiotis <a href="http://www.ccl.net/cgi-bin/ccl/message-new?2008+01+15+003+raw">thinks so</a>. And Eli Lilly recently <a href="http://www.bio-itworld.com/BioIT_Article.aspx?id=75790">open sourced</a> a life science platform implemented in .NET (here's the <a href="http://www.sf.net/projects/lsg">SF website</a>, but where's the mailing list?). I'm not yet convinced, but I'm keeping an open mind. I'm helping Matt Sprague to prepare C# bindings for OpenBabel. I think this will make OpenBabel the first cheminformatics library accessible from C# (although probably someone will contradict me below).<br /><br />At this point, if you're still interested, you should check out the first chapter of "IronPython in Action", which you can <a href="http://www.manning.com/foord/meap_foordch1.pdf">read on the web</a>. This explains in more detail the background to IronPython and why you might want to use it. <a href="http://www.manning.com/foord/">IronPython in Action</a> will be published later this year by Manning Press, and is written by Michael Foord and Christian Muirhead. Foord (aka Fuzzyman) is a <a href="http://www.voidspace.org.uk/python/weblog/index.shtml">very active blogger</a> on Python and FePy in particular, and works at a company whose main product is implemented in FePy.<br /><br />Just to make it clear, this is not a book for someone who wants to pick up programming from scratch. There's little hand holding here. The obligatory introduction to Python is quickly and efficiently dealt with before moving on to the main subject - accessing .NET classes and creating a GUI application. For those coming from C# to IronPython, there is a whole chapter on unit testing with Python, as well as another that covers everything from Python magic methods to metaprogramming. A particularly nice feature of the authors' style is to logically link each section to the following, so that you always understand the point of what you've just read and where it fits into the bigger picture.<br /><br />From my point of view, it's nice to see that explanations are provided for those using the free version of Visual Studio, Visual Studio Express, as this means that I can test out all of the examples myself. Other chapters yet to be finalised cover using IronPython in a webbrowser (apparently IronPython can run in <a href="http://silverlight.net/">Silverlight</a>, the new browser plugin from Microsoft), and extending IronPython with C# or VisualBasic.<br /><br />In short, for any serious users of IronPython, this book is a must have. It may also convince those using C# to make the switch to a better and more productive life with Python...baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-39826256992506580192008-07-15T13:04:00.000-07:002008-07-15T13:16:47.344-07:00Reacting to questionsDon't blink or you'll miss it. My 15 minutes of fame starts now.* <a href="http://www.sciencebase.com/">David Bradley</a> tracked me down in cyberspace, and turned me into a <a href="http://www.reactivereports.com/74/74_0.html">Reactive Profile</a>.<br /><br />*<b>Update</b>: 15 minutes now over.baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-1727999092697079812008-07-10T08:49:00.000-07:002008-07-10T14:10:04.871-07:00Pipeline Python - Generate a workflow<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://farm1.static.flickr.com/80/225007582_58aa401151_m.jpg"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 240px;" src="http://farm1.static.flickr.com/80/225007582_58aa401151_m.jpg" alt="" border="0" /></a>Workflow packages such as <a href="http://accelrys.com/">Pipeline Pilot</a>, <a href="http://taverna.sf.net/">Taverna</a> and <a href="http://www.knime.org/">KNIME</a> allow the user to graphically create a pipeline to process molecular data. A downside of these packages is that the units of the workflow, the nodes, process data sequentially. That is, no data gets to Node 2 until Node 1 has finished processing all of it. <b>Correction (thanks Egon):</b> The previous line is plain incorrect. Both KNIME and Taverna2, at least, pass on partially processed data as soon as it's available.<br /><br />Wouldn't it be nicer if they worked more like Unix pipes, that is, as soon as some data comes out of Node 1 it gets passed onto the next Node and so on. This would have three advantages: (1) you get the first result quicker, (2) you don't use up loads of memory storing all of the intermediate results, (3) you can run things in parallel, e.g. Node 2 could start processing the data from Node 1 immediately, perhaps even on a different computer.<br /><br />Luckily, there is a neat feature in Python called a generator that allows you to create a pipeline that processes data in parallel. Generators are functions that return a sequence of values. However, unlike just returning a list of values, they only calculate and return the next item in the sequence when requested. One reason this is useful is because the sequence of items could be very large, or even infinite in length. (For a more serious introduction, see <a href="http://www.dabeaz.com/generators/">David Beazley's talk</a> at PyCon'08, which is the inspiration for this blog post.)<br /><br />Let's create a pipeline for processing an SDF file that has three nodes: (1) a filter node that looks for the word "ZINC00" in the title of the molecule, (2) a filter node for Tanimoto similarity to a target molecule, (3) an output node that returns the molecule title. (The full program is presented at the end of this post.)<pre><span style="color: rgb(0, 0, 255);"># Pipeline Python!</span><br />pipeline = createpipeline((titlematches, "<span style="color: rgb(255, 0, 255);">ZINC00</span>"),<br /> (similarto, targetmol, 0.50),<br /> (moltotitle,))<br /><br /><span style="color: rgb(0, 0, 255);"># Create an input source</span><br />dataset = pybel.readfile("<span style="color: rgb(255, 0, 255);">sdf</span>", inputfile)<br /><br /><span style="color: rgb(0, 0, 255);"># Feed the pipeline </span><br />results = pipeline(dataset)</pre>The variable 'results' is a generator, so nothing actually happens until we request the values returned by the generator...<pre><span style="color: rgb(0, 0, 255);"># Print out each answer as it comes</span><br /><span style="color: rgb(128, 64, 64);"><b>for</b></span> title <span style="color: rgb(128, 64, 64);"><b>in</b></span> results:<br /> <span style="color: rgb(128, 64, 64);"><b>print</b></span> title<br /></pre>The titles of the molecules found will appear on the screen one by one as they are found, just like in a Unix pipe. Note how easy it is to combine nodes into a pipeline.<br /><br />Here's the full program:<pre style="overflow: auto; height: 300px; width: 400px;"><br /><span style="color: rgb(160, 32, 240);">import</span> re<br /><span style="color: rgb(160, 32, 240);">import</span> os<br /><span style="color: rgb(160, 32, 240);">import</span> itertools<br /><br /><span style="color: rgb(0, 0, 255);"># from cinfony import pybel</span><br /><span style="color: rgb(160, 32, 240);">import</span> pybel<br /><br /><span style="color: rgb(128, 64, 64);"><b>def</b></span> <span style="color: rgb(0, 128, 128);">createpipeline</span>(*filters):<br /> <span style="color: rgb(128, 64, 64);"><b>def</b></span> <span style="color: rgb(0, 128, 128);">pipeline</span>(dataset):<br /> piped_data = dataset<br /> <span style="color: rgb(128, 64, 64);"><b>for</b></span> filter <span style="color: rgb(128, 64, 64);"><b>in</b></span> filters:<br /> piped_data = filter[0](piped_data, *filter[1:])<br /> <span style="color: rgb(128, 64, 64);"><b>return</b></span> piped_data<br /> <span style="color: rgb(128, 64, 64);"><b>return</b></span> pipeline<br /><br /><span style="color: rgb(128, 64, 64);"><b>def</b></span> <span style="color: rgb(0, 128, 128);">titlematches</span>(mols, patt):<br /> p = re.compile(patt)<br /> <span style="color: rgb(128, 64, 64);"><b>return</b></span> (mol <span style="color: rgb(128, 64, 64);"><b>for</b></span> mol <span style="color: rgb(128, 64, 64);"><b>in</b></span> mols <span style="color: rgb(128, 64, 64);"><b>if</b></span> p.search(mol.title))<br /><br /><span style="color: rgb(128, 64, 64);"><b>def</b></span> <span style="color: rgb(0, 128, 128);">similarto</span>(mols, target, cutoff=0.7):<br /> target_fp = target.calcfp()<br /> <span style="color: rgb(128, 64, 64);"><b>return</b></span> (mol <span style="color: rgb(128, 64, 64);"><b>for</b></span> mol <span style="color: rgb(128, 64, 64);"><b>in</b></span> mols <span style="color: rgb(128, 64, 64);"><b>if</b></span> (mol.calcfp() | target_fp) &gt;= cutoff)<br /><br /><span style="color: rgb(128, 64, 64);"><b>def</b></span> <span style="color: rgb(0, 128, 128);">moltotitle</span>(mols):<br /> <span style="color: rgb(128, 64, 64);"><b>return</b></span> (mol.title <span style="color: rgb(128, 64, 64);"><b>for</b></span> mol <span style="color: rgb(128, 64, 64);"><b>in</b></span> mols)<br /><br /><span style="color: rgb(128, 64, 64);"><b>if</b></span> __name__ == "<span style="color: rgb(255, 0, 255);">__main__</span>":<br /> inputfile = os.path.join("<span style="color: rgb(255, 0, 255);">..</span>", "<span style="color: rgb(255, 0, 255);">face-off</span>", "<span style="color: rgb(255, 0, 255);">timing</span>", "<span style="color: rgb(255, 0, 255);">3_p0.0.sdf</span>")<br /> dataset = pybel.readfile("<span style="color: rgb(255, 0, 255);">sdf</span>", inputfile)<br /> findtargetmol = createpipeline((titlematches, "<span style="color: rgb(255, 0, 255);">ZINC00002647</span>"),)<br /> targetmol = findtargetmol(dataset).next()<br /><br /> <span style="color: rgb(0, 0, 255);"># Pipeline Python!</span><br /> pipeline = createpipeline((titlematches, "<span style="color: rgb(255, 0, 255);">ZINC00</span>"),<br /> (similarto, targetmol, 0.50),<br /> (moltotitle,))<br /><br /> <span style="color: rgb(0, 0, 255);"># Create an input source</span><br /> dataset = pybel.readfile("<span style="color: rgb(255, 0, 255);">sdf</span>", inputfile)<br /><br /> <span style="color: rgb(0, 0, 255);"># Feed the pipeline </span><br /> results = pipeline(dataset)<br /><br /> <span style="color: rgb(0, 0, 255);"># Print out each answer as it comes through the pipeline </span><br /> <span style="color: rgb(128, 64, 64);"><b>for</b></span> title <span style="color: rgb(128, 64, 64);"><b>in</b></span> results:<br /> <span style="color: rgb(128, 64, 64);"><b>print</b></span> title<br /></pre><br /><br />So, if in future someone tells you that Python generators can be used to make a workflow, don't say "I never node that".<br /><br />Image: <a href="http://www.flickr.com/photos/baggis/">Travis S.</a>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-10421935773212958852008-07-08T01:11:00.000-07:002008-07-08T01:12:05.071-07:00Chemoinformatics p0wned by cheminformaticsThe headline says it all. After two weeks, and thousands, er...40 votes, the results are in: 26 for chem<b>i</b><span class="blsp-spelling-error" id="SPELLING_ERROR_0">nformatics</span>, 14 for that other one.<br /><br />Next week, <span class="blsp-spelling-error" id="SPELLING_ERROR_1">bioinformatics</span> versus <span class="blsp-spelling-error" id="SPELLING_ERROR_2">binformatics</span>...baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-45471273144379020132008-07-06T12:20:00.000-07:002008-07-07T00:09:08.987-07:00Cheminformatics toolkit face-off: Speed (Python vs Java vs C++)<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://farm1.static.flickr.com/31/46137561_a243d05871_m.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 240px;" src="http://farm1.static.flickr.com/31/46137561_a243d05871_m.jpg" border="0" alt="" /></a>It's a bit meaningless to compares speeds of different toolkits performing the same operations. Functionality is really the reason you are going to favour one toolkit over another. Here I'm focusing on comparing the speed of accessing the same toolkit from <a href="http://python.org/">CPython</a> or <a href="http://jython.org/">Jython</a> versus accessing it directly in its original language. In other words, what price do you pay to be able to work in Python rather than C++ or Java? (I'll discuss the advantages of working in Python in a later post.)<br /><br />To begin with, let's consider accessing the <a href="http://cdk.sf.net/">CDK</a> from Python versus from Java. The test input file for all of these tests is the first subset of the drug-like ligands in <a href="http://zinc.docking.org/">ZINC</a>, 3_p0.0.sdf, which contains 24098 molecules. The three test cases are (1) Iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) calculate 25 descriptor values for the first 228 molecules. I implemented these in Java, and using <a href="http://baoilleach.blogspot.com/2008/06/ann-cinfony-02-easy-to-install-version.html">cinfony</a>. For example, here's the cinfony version for test case 2:<pre><span style="color:#a020f0;">import</span> time<br /><span style="color:#a020f0;">from</span> cinfony <span style="color:#a020f0;">import</span> cdk<br /><br />t = time.time()<br /><span style="color:#804040;"><b>for</b></span> mol <span style="color:#804040;"><b>in</b></span> cdk.readfile("<span style="color:#ff00ff;">sdf</span>", "<span style="color:#ff00ff;">3_p0.0.sdf</span>"):<br /> <span style="color:#804040;"><b>print</b></span> mol.molwt<br /><span style="color:#804040;"><b>print</b></span> time.time() - t</pre><br />Here are the results (times are in seconds):<table><tbody><tr><td><b>Method</b></td><td><b>Test 1</b></td><td><b>Test 2</b></td><td><b>Test 3</b></td></tr><tr><td>Java</td><td>22.2</td><td>38.9</td><td>31.7</td></tr><tr><td>CPython (cinfony, cdkjpype)</td><td>34.0</td><td>72.6</td><td>38.2</td></tr><tr><td>Jython (cinfony, cdkjython)</td><td>23.7</td><td>44.4</td><td>34.0</td></tr></tbody></table><br />It's clear that accessing the CDK from Jython is almost as fast as using it in a Java program. However, there is an overhead associated with using it from CPython except where, as in Test 3, most of the time is spent in computation.<br /><br />Next, let's look at accessing OpenBabel from Python versus from C++. Here I will compare the following tests cases: (1) iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) apply 30 steps of a forcefield optimisation to the first 200 molecules. Here's an example of the cinfony script for (2). Notice any similarities to the one above? :-)<br /><pre><span style="color:#a020f0;">import</span> time<br /><span style="color:#a020f0;">from</span> cinfony <span style="color:#a020f0;">import</span> pybel<br /><br />t = time.time()<br /><span style="color:#804040;"><b>for</b></span> mol <span style="color:#804040;"><b>in</b></span> pybel.readfile("<span style="color:#ff00ff;">sdf</span>", "<span style="color:#ff00ff;">3_p0.0.sdf</span>"):<br /> <span style="color:#804040;"><b>print</b></span> mol.molwt<br /><span style="color:#804040;"><b>print</b></span> time.time() - t</pre><br />Here are the results (note: measured on a different machine than the tests above):<table><tbody><tr><td><b>Method</b></td><td><b>Test 1</b></td><td><b>Test 2</b></td><td><b>Test 3</b></td></tr><tr><td>C++</td><td>77.7</td><td>126.0</td><td>56.8</td></tr><tr><td>CPython (cinfony, Pybel)</td><td>78.3</td><td>132.9</td><td>60.0</td></tr><tr><td>Jython (cinfony, Jybel)</td><td>80.4</td><td>135.7</td><td>59.4</td></tr></tbody></table>In short, the cost of using Pybel or Jybel is small.<br /><br />Technical notes: For the CDK, I calculated the values of all descriptors (except IP) that didn't return an array of values. This came to 25 descriptors. I also skipped over one molecule, #20, that took several seconds to process. Jython can natively access Java libraries such as the CDK, but to access the CDK from CPython, cinfony uses JPype. cinfony uses the Python SWIG wrappers to access OpenBabel from CPython; Jython is using the Java SWIG wrappers for OpenBabel. I need to repeat the runs a few times on a quiet machine to get better figures, but I note that the figures do not include the cost of loading the OpenBabel DLL or starting up the JVM.<br /><br />Image credit:<a href="http://www.flickr.com/photos/macronin47/">MacRonin47</a>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-81681362094262533182008-07-05T01:39:00.000-07:002008-07-05T01:47:41.208-07:00ANN: OpenBabel 2.2.0 Released<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_x5Hz3F0jd4Q/RpFLR296JRI/AAAAAAAAAVs/MSE1R8fUBKk/s1600-h/logo.bmp"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_x5Hz3F0jd4Q/RpFLR296JRI/AAAAAAAAAVs/MSE1R8fUBKk/s400/logo.bmp" border="0" alt=""id="BLOGGER_PHOTO_ID_5084928224463037714" /></a>Here is the official announcement from <a href="http://geoffhutchison.net">Geoff Hutchison</a>:<blockquote>I am very happy to finally announce the release of Open Babel 2.2.0, the latest stable version of the open source chemistry toolbox.<br /><br />This release represents a major update and should be a stable upgrade, strongly recommended for all users of Open Babel. Highlights include improved force fields and coordinate generation, conformer searching, enhanced plugins including molecular descriptors, filters, and command-line transformations. Many formats are improved or added, including CIF, mmCIF, Gaussian cube, PQR, OpenDX cubes, and more. Improved developer API and scripting support and many, many bug fixes are also included.<br /><br />What's new? See the <a href="http://openbabel.org/wiki/Open_Babel_2.2.0">full release notes</a>.<br /><br />To download, see our <a href="http://openbabel.org/wiki/Install">Install Page</a>.<br /><br />For more information, see the <a href="http://openbabel.org/">project website</a>.<br /><br />I would like to personally thank a few people for making this release a great one. In alphabetical order, Jean Bréfort, Andrew Dalke, Marcus Hanwell, Chris Morley, Noel O'Boyle, Kevin Shepherd, Tim Vandermeersch, and Ugo Varetto.<br /><br />This is a community project and we couldn't have made this release without you. Many thanks to all the contributors to Open Babel including those of you who submitted feedback, bug reports, and code.<br /><br />Cheers,<br />-Geoff</blockquote>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-7545883692072861952008-06-25T00:18:00.000-07:002008-06-26T02:58:42.822-07:00ANN: cinfony 0.2 - the "easy to install" version<div style="height:300px; width: 200px; float:right; overflow:hidden"><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_x5Hz3F0jd4Q/SCH46fz21jI/AAAAAAAAAf4/X8mLGfNZWIE/s1600-h/edited_crop.png"><img style="float:right; margin:0 -100px 10px 10px;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_x5Hz3F0jd4Q/SCH46fz21jI/AAAAAAAAAf4/X8mLGfNZWIE/s320/edited_crop.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5197709128817366578" /></a></div>A new version of <a href="http://code.google.com/p/cinfony">cinfony</a> is now available. The good news is that it now much easier to install for Windows users. <br /><br />All you need now is to have Python and Java, download the CDK and the RDKit, edit a configuration file, and away you go using OpenBabel, the RDKit and the CDK from Python. The full instructions are <a href="http://code.google.com/p/cinfony/wiki/NewInstallationInstructions">here</a>.<br /><br />I've updated some of the docstrings, but documentation is still sparse and the Pybel documentation is still the best guide, as linked to on the <a href="http://code.google.com/p/cinfony">cinfony web site</a>.<br /><br />Here's an example of what's possible with cinfony. Suppose you want to convert a SMILES string to 3D coordinates with OpenBabel, create a 2D depiction of that molecule with the RDKit, calculate descriptors with the CDK, and write out an SDF file containing the descriptor values and the 3D coordinates.<pre><font color="#a020f0">from</font> cinfony <font color="#a020f0">import</font> rdkit, cdk, pybel<br />mol = pybel.readstring(&quot;<font color="#ff00ff">smi</font>&quot;, &quot;<font color="#ff00ff">CCC=O</font>&quot;)<br />mol.make3D()<br />rdkit.Molecule(mol).draw(show=False,<br /> filename=&quot;<font color="#ff00ff">aldehyde.png</font>&quot;)<br />descs = cdk.Molecule(mol).calcdesc()<br />mol.data.update(descs)<br />mol.write(&quot;<font color="#ff00ff">sdf</font>&quot;, filename=&quot;<font color="#ff00ff">aldehyde.sdf</font>&quot;)</pre>This new version of cinfony also includes Jybel, so that you can now access both the CDK and OpenBabel from Jython, as well as from CPython.baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-81746588659673448592008-06-24T06:22:00.000-07:002008-06-24T11:49:31.002-07:00The Forced Authorship Licence - Get your users to write papers for you<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://farm4.static.flickr.com/3009/2471444662_055d8da92d_m.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 240px;" src="http://farm4.static.flickr.com/3009/2471444662_055d8da92d_m.jpg" border="0" alt="" /></a>You are probably familiar with commercial software licences. You may even have heard of open source licences. But are you familiar with the Forced Authorship Licence (FAL) model? Let me give you an example from real life (the name has been removed to focus on the actual licence):<blockquote>"X is available free of charge for researchers belonging to Academic community. The download and use of X is subject to the X Academic Licence...Every users is associated with one of the X team. This will help the user in installing and using X and will be co-author of the first paper published by the user, containing X results. You must contact one of the three group leaders and discuss your proposed project before applying for the use of X."</blockquote> <br />I think this is a great idea. Think of all the publications. If I had thought of this a few years ago, I would now have 30 extra papers instead of the <a href="http://gausssum.sourceforge.net/DocBook/pr02.html">30 citations</a> of <a href="http://gausssum.sf.net/">GaussSum</a>.<br /><br />But it's never to late to start. And why stop at software? If I'm going to be competing with people that use the FAL, I need to think smarter. From now on, all of my papers, software, blog posts, personal communications, and any ideas that arose while reading my papers, attending my talks or reading my posters will carry the Noel O'Blog Licence (NOABL - 'A' for apostrophe and don't you forget it). Instead of citing me, any resulting publication must carry my name as sole author, in bold - no, make that in fire in letters thirty feet high - and when applying for grants, I'm allowed to list these publications together with my own work. <br /><br />Sounds fair, doesn't it? Oh - I almost forgot. It's spelt "Noel M. O'Boyle". And don't forget that apostrophe.<br /><br />Image: <a href="http://www.flickr.com/photos/teflon/">Martin Deutsch</a>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-78218990060842064022008-06-23T14:56:00.000-07:002008-06-23T14:58:42.244-07:00O No - It's cheminformatics!It is time to cast your vote in the greatest polarising debate of our times. Yes indeed: should it be cheminformatics or chem<b>o</b>informatics? Vote now (see poll on right).<br /><br />Not to sway any undecided voters, but I'm definitely in favour of "cheminformatics". My main reason is that I'm worried that if the other camp win out, they'll probably decide to change more words: we'll end up doing chem<b>o</b>istry, like our Australian cousins.<br /><br />You have 13 days to cast your vote...baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-49234795781812472772008-06-23T13:44:00.000-07:002008-06-23T14:08:00.479-07:00An RSS feed for the CCL list<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_x5Hz3F0jd4Q/SCfvP8pYi0I/AAAAAAAAAgA/g6m3D8ObaRQ/s1600-h/rect20.png"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_x5Hz3F0jd4Q/SCfvP8pYi0I/AAAAAAAAAgA/g6m3D8ObaRQ/s400/rect20.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5199387352079305538" /></a>Apparently some people still read the <a href="http://ccl.net">CCL</a> (Computational Chemistry List) using email. I gave up on that some time ago. Here's an RSS feed I threw together some time ago, and which you might find useful: CCL feed <a href="http://publishious.org/site_media/RSS/ccl.rss"><img src="http://www.feedburner.com/fb/images/pub/feed-icon16x16.png" border="0" /></a><br /><br />If you don't know what an RSS feed is, and how it might be useful, it's quicker just to test it out than to explain. First of all, you need a feed reader: for example get an account on <a href="http://reader.google.com">Google Reader</a>, a free online RSS reader. Next, subscribe to whatever RSS feeds you want, by clicking on "Add subscription" and copying and pasting the URL of the RSS feed into the box that appears.<br /><br />Here's how the feed is created:<br /><pre style="overflow:auto; height:300px; width:400px;"><font color="#a020f0">import</font> sys<br /><font color="#a020f0">import</font> email<br /><font color="#a020f0">import</font> datetime<br /><font color="#a020f0">import</font> pdb<br /><br /><font color="#a020f0">from</font> ftplib <font color="#a020f0">import</font> FTP<br /><font color="#a020f0">from</font> StringIO <font color="#a020f0">import</font> StringIO<br /><br /><font color="#a020f0">import</font> PyRSS2Gen<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">breakdaily</font>(messages):<br /> &quot;&quot;&quot;<br /><font color="#ff00ff"> &gt;&gt;&gt; import pickle</font><br /><font color="#ff00ff"> &gt;&gt;&gt; a = pickle.load(open(&quot;tmp20070105&quot;))</font><br /><font color="#ff00ff"> &gt;&gt;&gt; len(breakdaily(a))</font><br /><font color="#ff00ff"> 1</font><br /><font color="#ff00ff"> </font>&quot;&quot;&quot;<br /> broken = []<br /> message = []<br /> <font color="#804040"><b>for</b></font> line <font color="#804040"><b>in</b></font> messages.split(&quot;<font color="#6a5acd">\n</font>&quot;):<br /> <font color="#804040"><b>if</b></font> line.startswith(&quot;<font color="#ff00ff">From owner-chemistry@ccl.net</font>&quot;):<br /> broken.append(&quot;<font color="#6a5acd">\n</font>&quot;.join(message))<br /> message = []<br /> <font color="#804040"><b>else</b></font>:<br /> message.append(line)<br /> broken.append(&quot;<font color="#6a5acd">\n</font>&quot;.join(message))<br /> <font color="#804040"><b>return</b></font> broken[1:]<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">getlatest</font>(N):<br /> ftp = FTP('<font color="#ff00ff">ftp.ccl.net</font>')<br /> ftp.login('<font color="#ff00ff">anonymous</font>')<br /> ftp.cwd('<font color="#ff00ff">/pub/chemistry/archived-messages</font>')<br /> <font color="#0000ff"># ftp.dir()</font><br /><br /> listoffiles = ftp.nlst()<br /><br /> <font color="#0000ff"># Get current year</font><br /> year = datetime.datetime.now().year<br /><br /> <font color="#0000ff"># Get N most recent messages</font><br /> messagetot = 0<br /> months = [str(x).zfill(2) <font color="#804040"><b>for</b></font> x <font color="#804040"><b>in</b></font> range(1, 13)]<br /> months.reverse()<br /> days = [str(x).zfill(2) <font color="#804040"><b>for</b></font> x <font color="#804040"><b>in</b></font> range(1, 32)]<br /> days.reverse()<br /> msgs = []<br /><br /><br /> <font color="#804040"><b>while</b></font> messagetot&lt;N:<br /> ftp.cwd(str(year))<br /><br /> <font color="#804040"><b>for</b></font> month <font color="#804040"><b>in</b></font> months:<br /> ftp.cwd(month)<br /> availabledays = ftp.nlst()<br /> <font color="#804040"><b>for</b></font> day <font color="#804040"><b>in</b></font> [x <font color="#804040"><b>for</b></font> x <font color="#804040"><b>in</b></font> days <font color="#804040"><b>if</b></font> x <font color="#804040"><b>in</b></font> availabledays]:<br /> <font color="#0000ff"># Go thru in reverse order but exclude days that are</font><br /> <font color="#0000ff"># non-existent</font><br /> messages = StringIO()<br /> ftp.retrbinary(&quot;<font color="#ff00ff">RETR %s</font>&quot; % day, messages.write)<br /><font color="#0000ff">## pickle.dump(messages.getvalue(), open(&quot;tmp%i%s%s&quot; % (year,month,day), &quot;w&quot;))</font><br /> listmsgs = breakdaily(messages.getvalue())<br /> <font color="#0000ff"># print &quot;=&quot;*24 + &quot;\n&quot;, messages.getvalue()</font><br /> <font color="#804040"><b>for</b></font> i,msg <font color="#804040"><b>in</b></font> enumerate(listmsgs):<br /> msg_content = email.message_from_string(msg)<br /> text = &quot;&quot;<br /> <font color="#804040"><b>for</b></font> part <font color="#804040"><b>in</b></font> msg_content.walk():<br /> <font color="#804040"><b>if</b></font> (part.get_content_maintype()==&quot;<font color="#ff00ff">text</font>&quot; <font color="#804040"><b>and</b></font><br /> part.get_content_type()==&quot;<font color="#ff00ff">text/plain</font>&quot;):<br /> text = part.get_payload()<br /> text = text.replace(&quot;<font color="#6a5acd">\n</font>&quot;,&quot;<font color="#ff00ff">&lt;br/&gt;</font>&quot;).decode(&quot;<font color="#ff00ff">iso-8859-1</font>&quot;, &quot;<font color="#ff00ff">strict</font>&quot;)<br /> msgs.append( (year, month, day, msg_content, i+1, text) )<br /> messagetot += len(listmsgs)<br /> <font color="#804040"><b>if</b></font> messagetot&gt;=N:<br /> <font color="#804040"><b>break</b></font><br /> <font color="#804040"><b>if</b></font> messagetot&gt;=N:<br /> <font color="#804040"><b>break</b></font><br /> ftp.cwd(&quot;<font color="#ff00ff">..</font>&quot;)<br /> ftp.cwd(&quot;<font color="#ff00ff">..</font>&quot;)<br /> year -= 1 <font color="#0000ff"># Continue into the previous year</font><br /><br /> ftp.quit()<br /> msgs.reverse()<br /><br /> <font color="#804040"><b>return</b></font> msgs<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">main</font>():<br /><br /> <font color="#804040"><b>print</b></font> &quot;<font color="#6a5acd">\n</font><font color="#ff00ff">Starting...</font>&quot;<br /><br /> messages = getlatest(100)<br /><font color="#0000ff">## outputfile = open(&quot;messages.pickle&quot;, &quot;w&quot;)</font><br /><font color="#0000ff">## pickle.dump(messages, outputfile)</font><br /><font color="#0000ff">## outputfile.close()</font><br /><font color="#0000ff">## import sys</font><br /><font color="#0000ff">## sys.exit(1)</font><br /><font color="#0000ff">## messages = pickle.load(open(&quot;messages.pickle&quot;, &quot;r&quot;))</font><br /><br /> rssitems = []<br /> <font color="#804040"><b>for</b></font> year, month, day, msg, id, messagetext <font color="#804040"><b>in</b></font> messages:<br /><br /> <font color="#0000ff"># Add the new item</font><br /> newitem = PyRSS2Gen.RSSItem(<br /> title = msg['<font color="#ff00ff">Subject</font>'],<br /> link = &quot;<font color="#ff00ff"><a href="http://ccl.net/cgi-bin/ccl/message-new?%d+%s+%s+%s">http://ccl.net/cgi-bin/ccl/message-new?%d+%s+%s+%s</a></font>&quot; % (<br /> year, month, day, str(id).zfill(3)),<br /> description = messagetext,<br /> <font color="#0000ff">## What's a guid? A globally unique id...used by RSS readers</font><br /> <font color="#0000ff">## to determine whether they've seen a particular news item</font><br /> <font color="#0000ff">## already</font><br /> guid = PyRSS2Gen.Guid(&quot;<font color="#ff00ff"><a href="http://ccl.net/cgi-bin/ccl/message-new?%d+%s+%s+%s">http://ccl.net/cgi-bin/ccl/message-new?%d+%s+%s+%s</a></font>&quot; % (<br /> year, month, day, str(id).zfill(3))),<br /> pubDate = msg['<font color="#ff00ff">Date</font>']<br /> )<br /> rssitems.append(newitem)<br /><br /> rss = PyRSS2Gen.RSS2(<br /> title = &quot;<font color="#ff00ff">CCL</font>&quot;,<br /> link = &quot;<font color="#ff00ff"><a href="http://www.ccl.net">http://www.ccl.net</a></font>&quot;,<br /> description = &quot;<font color="#ff00ff">RSS Feed of the world's greatest computational chemistry </font>&quot;<br /> &quot;<font color="#ff00ff">mailing list, chemistry@ccl.net (the CCL list)</font>&quot;,<br /> lastBuildDate = datetime.datetime.now(),<br /> items = rssitems)<br /> rss.write_xml(open(&quot;<font color="#ff00ff">ccl.rss</font>&quot;, &quot;<font color="#ff00ff">w</font>&quot;))<br /><br /> <font color="#804040"><b>print</b></font> &quot;<font color="#ff00ff">Finishing...</font><font color="#6a5acd">\n</font>&quot;<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">test</font>():<br /> <font color="#a020f0">import</font> doctest<br /> doctest.testmod()<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">do_debugger</font>(type, value, tb):<br /> pdb.pm()<br /><br /><font color="#804040"><b>if</b></font> __name__==&quot;<font color="#ff00ff">__main__</font>&quot;:<br /><br /> <font color="#0000ff"># sys.excepthook = do_debugger</font><br /> main()<br /><br /> <font color="#0000ff">## test() </font></pre>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-66181354040198001872008-06-16T23:19:00.000-07:002008-06-16T23:24:19.302-07:00Jmol gets competition - OpenAstexViewer now available<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_x5Hz3F0jd4Q/SFdYIV85K7I/AAAAAAAAAgg/A5jDw9_pxFk/s1600-h/openastexviewer.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_x5Hz3F0jd4Q/SFdYIV85K7I/AAAAAAAAAgg/A5jDw9_pxFk/s400/openastexviewer.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5212731994059385778" /></a>AstexViewer has just gone open source (LGPL), and is available from <a href="http://openastexviewer.net/">http://openastexviewer.net/</a>.<br /><br />Both Jmol and AstexViewer are 3D chemical structure applets that run in the browser. AstexViewer was originally developed by Mike Hartshorn at <a href="http://www.astex-therapeutics.com">Astex Therapeutics</a> for in-house visualisation of protein crystal structures. Although <a href="http://www.astex-therapeutics.com/AstexViewer/">available at no cost</a> for some time, it has not until now been open source.<br /><br />Jmol is in many ways an open source success story. It has several enthusiastic developers (who make new releases with new features every few days!), a very busy mailing list, a large number of users worldwide, and even a <a href="http://biomodel.uah.es/anun/manual-jmol/inicio-en.htm">recent book</a>. It will be interesting to see whether OpenAstexViewer is sufficiently different to attract users away from this well-established project.<br /><br />In any case, here's to diversity. Hopefully both projects will get interesting ideas from each other.baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-39614803255866732042008-05-21T13:23:00.000-07:002008-05-21T13:41:37.181-07:00Cheminformatics toolkit face-off - SMARTS matching<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://farm1.static.flickr.com/209/466098410_b3b18cb6ab_b.jpg"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 180px;" src="http://farm1.static.flickr.com/209/466098410_b3b18cb6ab_m.jpg" alt="" border="0" /></a><a href="http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html">SMARTS strings</a> are really useful for substructure searching in molecules. They are sort of like regular expressions for molecules. The idea and syntax of SMARTS comes from the people at <a href="http://www.daylight.com">Daylight</a>.<br /><br /><a href="http://cheminfo.informatics.indiana.edu/%7Erguha">Rajarshi Guha</a>, together with Dazhi Jiao (Uni. Indiana), have put together a test suite for SMARTS matchers, and run the test suite against the CDK, OpenBabel and OEChem. Greg contributed results for RDKit. The overall performance is described on <a href="http://cheminfo.informatics.indiana.edu/%7Erguha/code/smartscomp/">Rajarshi's website</a>.<br /><br />Here's the current summary of the results for the 158 test cases, but check <a href="http://cheminfo.informatics.indiana.edu/%7Erguha/code/smartscomp/">Rajarshi's page</a> for a more up-to-date picture:<table><tr><td><b>Toolkit</b></td><td><b>True</b></td><td><b>False</b></td></tr><tr><td>CDK</td><td>149</td><td>9</td></tr><tr><td>OpenBabel</td><td>149</td><td>9</td></tr><tr><td>RDKit</td><td>151</td><td>7</td></tr><tr><td>OpenEye</td><td>150</td><td>8</td></tr></table><br /><br />Image: <a href="http://flickr.com/photos/chris_radcliff/">Chris Radcliff</a>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-75574379117002072322008-05-20T12:31:00.000-07:002008-05-20T13:46:53.093-07:00Cheminformatics toolkit face-off - Depiction Part 2<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://publishious.org/site_media/blog_b/sdg/82_pubchem_oasa.png"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 194px;" src="http://publishious.org/site_media/blog_b/sdg/82_pubchem_oasa.png" border="0" alt="" /></a>One of the aims of the <a href="http://baoilleach.blogspot.com/2008/05/cheminformatics-toolkit-face-off.html">previous toolkit face-off</a> was to get some feedback on the best options for drawing images with different toolkits. I also hoped that a comparison of the images would allow bugs to be easily identified. And finally, I thought that a bit of competition might help improve depiction and structure diagram generators across the board.<br /><br />Since the last post:<br /><ul><li><a href="http://www.molinspiration.com/">Molinspiration</a> have approached me to be included in the face-off</li><li>I've learnt that I should remove hydrogens from <a href="http://cdk.sf.net/">CDK</a> and OASA depictions and control the size of the generated image<br /></li><li>Beda has done amazing work enhancing depiction in OASA, and has in fact now <a href="http://bkchem.zirael.org/oasa_en.html">released OASA</a> separately from BKChem as an independent cheminformatics toolkit<br /></li><li>Greg is just about to release a new version of the <a href="http://rdkit.org/">RDKit</a> which has improved depiction, and he has fixed the bug <a href="http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=1086">identified by PMR</a></li><li>And I've fixed some bugs myself in <a href="http://baoilleach.blogspot.com/2008/05/ann-cinfony-01-toolkit-of.html">cinfony</a><br /></li></ul>Now the images themselves (<a href="http://publishious.org/site_media/blog/dataset.sdf.gz">same dataset</a> as before): [<a href="http://publishious.org/site_media/blog_b/depict.html">depiction</a>] and [<a href="http://publishious.org/site_media/blog_b/sdg.html">structure diagram generation</a>].<br /><br />Notes:<br /><ol><li>The face-off involves the open source toolkits the <a href="http://cdk.sf.net">CDK</a>, <a href="http://bkchem.zirael.org/oasa_en.html">OASA</a> and the <a href="http://rdkit.org">RDKit</a>, and the proprietary toolkits <a href="http://www.xemistry.com/">Cactvs</a> and <a href="http://www.molinspiration.com/">molinspiration</a>.<br /></li><li>If you want to help improve these depictions or coordinate generators, why not leave a comment below suggesting specific ways to improve them, or highlighting specific things they could do better.</li><li>Double bond stereochemistry doesn't seem to be preserved by the CDK, but this is possibly my fault (I'm awaiting a reply to an email to the cdk-users mailing list)</li><li>Several people complained to me last time that I didn't give OASA a fair chance. In order to make it up, this time round I've made OASA the star attraction. The coordinates generated by the three open source toolkits are all depicted by OASA</li></ol>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-86191253627344130972008-05-12T00:00:00.000-07:002008-05-12T01:49:39.671-07:00RSS feeds for chemistry projects on SourceForge<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_x5Hz3F0jd4Q/SCfvP8pYi0I/AAAAAAAAAgA/g6m3D8ObaRQ/s1600-h/rect20.png"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_x5Hz3F0jd4Q/SCfvP8pYi0I/AAAAAAAAAgA/g6m3D8ObaRQ/s400/rect20.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5199387352079305538" /></a>Wouldn't it be nice to know whether any of your favourite chemistry projects has released a new version? Or to keep abreast of the latest registrations of chemistry projects on SourceForge? No? I guess it's just me then.<br /><br />Anyway, here are some RSS feeds I've thrown together to allow me to do just that:<br /><ul><li>latest chemistry releases <a href="http://publishious.org/site_media/RSS/latestreleases.rss"><img src="http://www.feedburner.com/fb/images/pub/feed-icon16x16.png" border="0" /></a></li><li>registrations of chemistry projects: latest <a href="http://publishious.org/site_media/RSS/newregistrations.rss"><img src="http://www.feedburner.com/fb/images/pub/feed-icon16x16.png" /></a> and two months old <a href="http://publishious.org/site_media/RSS/oldregistrations.rss"><img src="http://www.feedburner.com/fb/images/pub/feed-icon16x16.png" /></a></li></ul><br />Note: the two-months-old RSS feed is better if you want to find some actual code or a working website when you click through.<br /><br />Want to do the same for another software category? Here's the code (requires <a href="http://crummy.com/software/BeautifulSoup">BeautifulSoup</a> and <a href="http://www.dalkescientific.com/Python/PyRSS2Gen.html">PyRSS2Gen</a>):<pre><br /><font color="#a020f0">import</font> datetime<br /><font color="#a020f0">import</font> urllib<br /><br /><font color="#a020f0">from</font> BeautifulSoup <font color="#a020f0">import</font> BeautifulSoup<br /><font color="#a020f0">import</font> PyRSS2Gen<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">download</font>():<br /> urls = [&quot;<font color="#ff00ff"><a href="http://sourceforge.net/search/index.php?words=trove%3A%28384%29">http://sourceforge.net/search/index.php?words=trove%3A%28384%29</a></font>&quot;<br /> &quot;<font color="#ff00ff">&amp;sort=latest_file_date&amp;sortdir=desc&amp;offset=0&amp;limit=100&amp;</font>&quot;<br /> &quot;<font color="#ff00ff">type_of_search=soft&amp;pmode=0</font>&quot;,<br /> &quot;<font color="#ff00ff"><a href="http://sourceforge.net/search/index.php?words=trove%3A%28384%29">http://sourceforge.net/search/index.php?words=trove%3A%28384%29</a></font>&quot;<br /> &quot;<font color="#ff00ff">&amp;sort=registration_date&amp;sortdir=desc&amp;offset=0&amp;limit=100&amp;</font>&quot;<br /> &quot;<font color="#ff00ff">type_of_search=soft&amp;pmode=0</font>&quot;]<br /> urllib.urlretrieve(urls[0], &quot;<font color="#ff00ff">releases.html</font>&quot;)<br /> urllib.urlretrieve(urls[1], &quot;<font color="#ff00ff">registrations.html</font>&quot;)<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">converttodate</font>(text):<br /> <font color="#804040"><b>if</b></font> text==&quot;<font color="#ff00ff">(none)</font>&quot;:<br /> <font color="#804040"><b>return</b></font> None<br /> t = map(int, text.split(&quot;<font color="#ff00ff">-</font>&quot;))<br /><br /> <font color="#804040"><b>return</b></font> datetime.datetime(*t)<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">makerss</font>(filename, items, sortby, title):<br /> rss = PyRSS2Gen.RSS2(<br /> title = title,<br /> link = &quot;<font color="#ff00ff"><a href="http://baoilleach.blogspot.com/2008/05/rss-feeds-for-chemistry-projects-on.html">http://baoilleach.blogspot.com/2008/05/rss-feeds-for-chemistry-projects-on.html</a></font>&quot;,<br /> description = &quot;<font color="#ff00ff">baoilleach's RSS feed of </font>&quot;<br /> &quot;<font color="#ff00ff">Chemistry projects on SourceForge</font>&quot;,<br /><br /> lastBuildDate = datetime.datetime.now(),<br /><br /> items = [<br /> PyRSS2Gen.RSSItem(<br /> title = item[&quot;<font color="#ff00ff">title</font>&quot;],<br /> link = item[&quot;<font color="#ff00ff">link</font>&quot;],<br /> description = item[&quot;<font color="#ff00ff">description</font>&quot;],<br /> guid = PyRSS2Gen.Guid(&quot;<font color="#ff00ff">%s %s</font>&quot; % (item[&quot;<font color="#ff00ff">title</font>&quot;], item['<font color="#ff00ff">lastrelease</font>'])),<br /> pubDate = item[sortby])<br /> <font color="#804040"><b>for</b></font> item <font color="#804040"><b>in</b></font> items]<br /> )<br /><br /> rss.write_xml(open(filename, &quot;<font color="#ff00ff">w</font>&quot;))<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">analyse</font>(project):<br /> ans = {}<br /> ans['<font color="#ff00ff">title</font>'] = project.a.string<br /> ans['<font color="#ff00ff">link</font>'] = &quot;<font color="#ff00ff"><a href="http://sf.net">http://sf.net</a></font>&quot; + project.a['<font color="#ff00ff">href</font>']<br /><br /> data = project.parent.parent<br /> ans['<font color="#ff00ff">lastrelease</font>'] = converttodate(data('<font color="#ff00ff">td</font>')[5].string.strip())<br /><br /> ans['<font color="#ff00ff">registered</font>'] = converttodate(data('<font color="#ff00ff">td</font>')[4].string.strip())<br /><br /> data = project.parent.parent.findNextSibling()<br /> ans['<font color="#ff00ff">description</font>'] = data.td.contents[0].strip()<br /> <font color="#804040"><b>if</b></font> <font color="#804040"><b>not</b></font> ans['<font color="#ff00ff">description</font>']:<br /> ans['<font color="#ff00ff">description</font>'] = data.td.contents[2].strip()<br /><br /> <font color="#804040"><b>return</b></font> ans<br /><br /><font color="#804040"><b>def</b></font> <font color="#008080">processfile</font>(filename):<br /> html = open(filename, &quot;<font color="#ff00ff">r</font>&quot;).read()<br /> soup = BeautifulSoup(html)<br /> projects = soup.findAll(<font color="#804040"><b>lambda</b></font> tag: tag.name==&quot;<font color="#ff00ff">h3</font>&quot; <font color="#804040"><b>and</b></font> tag.a<br /> <font color="#804040"><b>and</b></font> tag.a['<font color="#ff00ff">href</font>'].startswith(&quot;<font color="#ff00ff">/projects/</font>&quot;))<br /><br /> data = [analyse(project) <font color="#804040"><b>for</b></font> project <font color="#804040"><b>in</b></font> projects]<br /> <font color="#804040"><b>return</b></font> data<br /><br /><font color="#804040"><b>if</b></font> __name__==&quot;<font color="#ff00ff">__main__</font>&quot;:<br /> download()<br /><br /> data = processfile(&quot;<font color="#ff00ff">registrations.html</font>&quot;)<br /> sometimeago = datetime.datetime.now() - datetime.timedelta(days=60)<br /> olddata = [d <font color="#804040"><b>for</b></font> d <font color="#804040"><b>in</b></font> data <font color="#804040"><b>if</b></font> d['<font color="#ff00ff">registered</font>'] &lt;= sometimeago]<br /> makerss(&quot;<font color="#ff00ff">oldregistrations.rss</font>&quot;, olddata, &quot;<font color="#ff00ff">registered</font>&quot;, &quot;<font color="#ff00ff">Registrations 60 days ago on SF</font>&quot;)<br /> makerss(&quot;<font color="#ff00ff">newregistrations.rss</font>&quot;, data, &quot;<font color="#ff00ff">registered</font>&quot;, &quot;<font color="#ff00ff">Latest registrations on SF</font>&quot;)<br /><br /> data = processfile(&quot;<font color="#ff00ff">releases.html</font>&quot;)<br /> makerss(&quot;<font color="#ff00ff">latestreleases.rss</font>&quot;, data, &quot;<font color="#ff00ff">lastrelease</font>&quot;, &quot;<font color="#ff00ff">Latest releases on SF</font>&quot;)<br /><br /></pre>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-27082572171642069432008-05-08T07:04:00.001-07:002008-05-10T05:25:00.349-07:00Review of "Gnuplot in Action"<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.manning.com/janert"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 150px;" src="http://www.manning.com/janert/janert_cover150.jpg" border="0" alt="" /></a>"Famous scientific plotting package". This succinct yet accurate description of <a href="http://gnuplot.info">gnuplot</a> appears on the gnuplot <a href="http://www.sf.net/projects/gnuplot">SF project page</a>. Despite being one of the most well-known open source scientific projects around, it has taken 15 years for the first book on gnuplot to be published. <a href="http://www.manning.com/janert/">"Gnuplot in Action"</a> by <a href="http://principal-value.com/who-i-am.php">Phillip K. Janert</a> will be published by Manning Publications later this year (<a href="http://www.manning-source.com/books/janert/janert_meapch2.pdf">one chapter</a> available free) and aims to give an overview of how to use gnuplot, describing everything from how to read in data, style graphs, save images to more advanced topics such as macros, using gnuplot for CGI or calling it from scripting languages. <br /><br />In the past I have used gnuplot to get a quick look at data in text files, and also to <a href="http://baoilleach.blogspot.com/2007/06/plot-data-in-3d.html">look at surfaces</a>. I am used to looking up the gnuplot help system to figure things out, but although it's exhaustive, it's also a bit exhausting especially when you are simply trying to figure out whether something is possible or not. The fact that the gnuplot site has a long list of <a href="http://gnuplot.info/faq/faq.html">frequently-asked questions</a> is also a sign of poor documentation. So there's definitely room for a book of this type.<br /><br />Although I was afraid this book would simply be a long list of commands and switches (of which gnuplot has many, to be sure), "Gnuplot in Action" takes a much more interesting approach, illustrating everything with examples and focusing on the most useful options while simply referring the reader to the gnuplot help for more obscure options. The author isn't afraid of sympathising with the user, e.g. "Nevertheless, it is very different from the behavior we have come to expect from user interfaces in most programs" (in reference to how images are saved in gnuplot). Although the author has been using gnuplot for 15 years, he obviously is still aware of what the most annoying features are, and in this case, describes a handy macro to look after the horrible mess of writing an image to disk and resetting the terminal afterwards.<br /><br />Somewhat bizarrely, the blurb for the book is pitched at business users (analysts and "Six Sigma Black Belts") and doesn't refer to scientists at all. But don't worry, the examples in the book contain everything from share price analysis to looking at phase transitions (reflecting the author's background in theoretical physics as well as in business), and in any case, the nature of the examples doesn't affect the tone of the book in any way.<br /><br />So, overall I liked this book. I particularly liked the sidebars and comments on particular techniques. For example, the section on 3d plots includes a short discussion of the pitfalls of using 3d plots and alternative plotting methods that might be more suitable. The section on color, includes a similar warning as well a discussion on palette design (which references <a href="http://www.research.ibm.com/people/l/lloydt/color/color.HTM">Why Should Engineers and Scientists Be Worried About Color?</a>). This turns the book from being a straightforward technical manual to something more of a discussion of techniques. There are actually a couple of chapters still to come (13, 14, 15) which appear to be focused on where/how to use particular graphical analysis techniques.<br /><br />Any negatives? Well, I wasn't very interested in the section on reading data from disk and transforming it - I think that anyone who knows Perl/Python is going to reshape/filter their data before getting it into gnuplot, but admittedly the book claims to be suitable for non-programmers. Also, the author's favorite interactive terminal, wxt, is not available for Windows and so I was a bit disappointed when I fired up gnuplot and couldn't find it. Regarding the connection between gnuplot and Python, the author doesn't mention the widely-used <a href="http://gnuplot-py.sourceforge.net/">Gnuplot.py</a> (although I'm not a big fan of this).<br /><br />But overall, these are minor quibbles. I recommend this book to anyone who wants to make the most of gnuplot, that "famous scientific plotting package".baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-74762084459692701572008-05-07T23:50:00.000-07:002008-05-09T00:10:48.926-07:00Cheminformatics toolkit face-off - Depiction<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://farm1.static.flickr.com/28/62554955_e310463a86_m.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 225px;" src="http://farm1.static.flickr.com/28/62554955_e310463a86_m.jpg" border="0" alt="" /></a>Now for some pretty pictures as well as some not so pretty. Yes, it's the turn of the structure diagram generators (SDGs) to strut their stuff and throw some shapes. How do they perform for 100 random compounds from PubChem?<br /><br />Here are my results for <a href="http://publishious.org/site_media/blog/depict.html">depiction</a> and <a href="http://publishious.org/site_media/blog/sdg.html">structure diagram generation</a> (Note: I will move these links to my regular hosting in the near future). The images were generated using <a href="http://code.google.com/p/cinfony">cinfony</a>, the code is <a href="http://publishious.org/site_media/blog/sdg.txt">here</a> and the dataset is <a href="http://publishious.org/site_media/blog/dataset.sdf.gz">here</a>.<br /><br />Some comments:<br />(0) Rich Apodaca has written <a href="http://depth-first.com/articles/2008/03/26/five-open-tools-for-2d-structure-layout-aka-structure-diagram-generation">an overview of Open Source SDGs</a>.<br />(1) 2D coordinate generation is independent of depiction. A SDG typically has both parts but coordinates could be generated with one toolkit and depicted with another.<br />(2) Looking good is not the same as chemical accuracy. But looking good is important too! :-)<br />(3) OASA (Obviously Another Stupid Acronym) is a Python SDG which is part of <a href="http://bkchem.zirael.org">BKchem</a> by Beda Kosata. To depict images, it uses <a href="http://www.cairographics.org/pycairo/">Pycairo</a>. OASA currently does not support stereochemistry (e.g. across double bonds) and so the generated coordinates will not be chemically-accurate in some circumstances. This of course does not affect its ability to depict coordinates generated by other toolkits. OASA can depict wedges and bonds but I haven't yet implemented this in cinfony.<br />(4) It is not possible to use the CDK's depiction mechanism natively from CPython using JPype (it's fine from Jython though). That is why I use OASA to depict the CDK's generated coordinates. Technically, this is because JPype doesn't allow subclassing of Java classes (JPanel in this case). It does however, allow Python classes to implement a Java interface. It would be great if the CDK could add a convenience interface to allow this (I can supply more details off-blog).<br />(5) The PubChem images and coordinates are generated by the Cactvs toolkit.<br />(6) There are two sets of RDKit images. Since the current release has quite poor depictions, I decided to also include the results from a development branch, which uses <a href="http://effbot.org/zone/aggdraw-index.htm">Aggdraw</a>.<br />(7) It is important to consider how to handle hydrogens. With OASA, I just drew all the hydrogens. This is probably not a good idea.<br />(8) It is likely that I have not used the best parameters, etc. for generating some of these images. I welcome any suggestions to improve image quality.<br />(9) OASA wasn't able to layout CID 1373132, and so that molecule was not included (I guess I should have just caught the exception and continued). The error message was:<pre>Traceback (most recent call last):<br /> File "cdkjpype.py", line 464, in draw<br /> oasa.cairo_out.cairo_out().mol_to_cairo(mol, filename)<br /> File "bkchem\oasa\oasa\cairo_out.py", line 85, i<br />n mol_to_cairo<br /> self._draw_edge( e)<br /> File "bkchem\oasa\oasa\cairo_out.py", line 121,<br />in _draw_edge<br /> side += reduce( operator.add, [geometry.on_which_<br />side_is_point( <br />start+end, (<br />self.transformer.transform_xy( a.x,a.y))) for a<br /> in ring if a!=v1 and a!=v2])<br /> File "bkchem\oasa\oasa\geometry.py", line 92, in<br /> on_which_side_is_point<br /> b = atan2( y2-y1, x2-x1)<br />ValueError: math domain error</pre><br />(10) PubChem entries with more than 1 connected component were not included in this test. (As a result, the number of molecules shown is actually less than 100.)<br /><br /><i>Image: <a href="http://www.flickr.com/photos/arts/">ARTS</a></i>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-53989573849741520132008-05-07T11:12:00.001-07:002008-05-07T11:47:51.567-07:00ANN: cinfony 0.1 - the toolkit of cheminformatics toolkits<div style="height:300px; width: 200px; float:right; overflow:hidden"><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_x5Hz3F0jd4Q/SCH46fz21jI/AAAAAAAAAf4/X8mLGfNZWIE/s1600-h/edited_crop.png"><img style="float:right; margin:0 -100px 10px 10px;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_x5Hz3F0jd4Q/SCH46fz21jI/AAAAAAAAAf4/X8mLGfNZWIE/s320/edited_crop.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5197709128817366578" /></a></div><span style="font-weight:bold;">What?</span><br /><span style="font-weight:bold;"><a href="http://code.google.com/p/cinfony/">cinfony</a></span> presents a common Python API to three open source cheminformatics toolkits: <a href="http://openbabel.org">OpenBabel</a>, <a href="http://rdkit.org">RDKit</a> and the <a href="http://cdk.sf.net">CDK</a>. <span style="font-weight:bold;">cinfony</span> makes it easy to carry common cheminformatics tasks, but allows you to access the underlying toolkit objects at any point.<br /><br /><span style="font-weight:bold;">Why?</span><br />(1) In the past, cheminformaticians have chosen a particular toolkit and have only used that toolkit. This is because of the impedance mismatch between APIs of different toolkits, and indeed, the language differences between different toolkits (CDK is Java, OpenBabel and RDKit are C++, for example). <span style="font-weight:bold;">cinfony</span> makes it easy to start using a new toolkit because the API for reading/writing files, calculating fingerprints, creating 2D depictions, etc. is the same no matter whether you are using OpenBabel, RDKit or the CDK.<br /><br />(2) Different toolkits have different capabilities. <span style="font-weight:bold;">cinfony</span> allows you to access all of these capabilities in the same Python script. For example, you can read a molecule using OpenBabel, draw it using RDKit and calculate descriptors using the CDK.<br /><br /><span style="font-weight:bold;">Where?</span><br /><span style="font-weight:bold;">cinfony 0.1</span> is now available for download from <a href="http://code.google.com/p/cinfony/">googlecode</a>.<br /><br /><span style="font-weight:bold;">Who?</span><br />Early adopters. This is pretty much a test release.<br /><br /><span style="font-weight:bold;">Why are there so many dependencies?</span><br />Oops, look at the time. Gotta go. Have fun installing.<br /><br />Image: <a href="http://www.flickr.com/photos/dryicons/">DryIcons</a> (modified)baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-45955993858114526812008-04-30T18:59:00.000-07:002008-04-30T11:06:01.384-07:00Shortened SMILES for URLS - Why a big SMILES need not mean a long face<a href="http://cheminfo.informatics.indiana.edu/~rguha/index.html">Rajarshi</a> has put together a <a href="http://www.chembiogrid.org/projects/proj_rest.html">RESTful web service</a> that returns a 2D depiction of a molecule (using the <a href="http://cdk.sf.net">CDK</a>) given a SMILES string. However, it turns out that the SMILES has a maximum size of 255 characters (appears to be an OS or mod_python limit). That should be enough, sez you, but for a random sample of 100 PubChem molecules, 9 are greater than the limit (these include explicit hydrogens, I should point out).<br /><br />Between Rajarshi and myself, we worked out a solution: zip the SMILES and then base64 it (so that it can be used in a URL). Here are the lengths of the original SMILES strings, the bzipped2 strings, and finally the base64ed strings:<pre>280 101 136<br />316 113 152<br />432 110 148<br />282 106 144<br />320 92 124<br />372 115 156<br />452 143 192<br />326 96 128<br />268 94 128</pre>In each case, the final string is less than half the size of the original string. The actual strings themselves all begin with the same few letters (don't ask me why) which means that the same RESTful URL can be used to handle both the SMILES strings and their compressed cousins. For example, sildenafil can be viewed at both <a href="http://www.chembiogrid.org/cheminfo/rest/depict/CCOc1ccc(cc1c1nc(=O)c2n(C)nc(CCC)c2[nH]1)S(=O)(=O)N1CCN(C)CC1">this url</a> (containing the SMILES) and <a href="http://www.chembiogrid.org/cheminfo/rest/depict/QlpoOTFBWSZTWft9pMoAABCfAABgMAIIQYgKCAEgAFRAAAEqeVGT9JolPldZe1G13TzgoGkR_HhF7eG75khc0JSRxt4u5IpwoSH2-0mU">this url</a> (containing the encoded form).<br /><br />Notes: (1) Python methods <a href="http://docs.python.org/lib/module-bz2.html">bz2.compress()</a> and <a href="http://docs.python.org/lib/module-base64.html">base64.urlsafe_b64encode()</a> are used. (2) Bzip2 was used instead of gzip or zip simply because it has an easier to use API. (3) A SMILES string needs to have a certain length before the procedure will actually result in any compression. In the sildenafil example, the encoded string is in fact longer than the original (108 vs. 61)baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-73089432000236312652008-04-30T05:12:00.000-07:002008-04-30T05:14:42.080-07:00Dear LazyWeb - How to write a unittest for when a Python module is unavailable?I have a Python module (<a href="http://code.google.com/p/cinfony/source/browse/trunk/cinfony/pybel.py">cinfony.pybel</a>) which I am <a href="http://code.google.com/p/cinfony/source/browse/trunk/test/testall.py">testing</a> with the standard <a href="http://docs.python.org/lib/module-unittest.html">unittest</a> module. One of the classes in the module has an optional dependency; if the <a href="http://www.pythonware.com/products/pil/">Python Imaging Library</a> (PIL) is available, a particular draw() method will work, otherwise it should raise an ImportError with a helpful message (I am simplifying the situation somewhat but this is the general problem).<br /><br />I have two related questions:<br /><br />(1) Should I place the "import Image" statement for PIL at the start of the draw() method, or at the start of the module (and catch the exception but set a flag)? Personally, I feel that its more Pythonic to keep all of the imports at the start of the module; that is, to be up-front about dependencies, optional or not.<br /><br />(2) How can I test this code in a unittest? That is, how can I call this draw() method with PIL available, and a second time, without PIL available?baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-56990883781977874422008-04-23T19:46:00.000-07:002008-04-23T14:38:19.988-07:00Cheminformatics toolkit face-off - LogP and TPSA<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://farm3.static.flickr.com/2062/1537122018_40806efa5d_m.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 240px;" src="http://farm3.static.flickr.com/2062/1537122018_40806efa5d_m.jpg" border="0" alt="" /></a><b>Updated 23/04/08 (see below)</b><br /><br />In a <a href="http://baoilleach.blogspot.com/2008/03/pybel-as-generic-api-for.html">recent post</a>, I discussed some thoughts I had about using the Pybel API as a generic API for cheminformatics toolkits. I've started to <a href="http://code.google.com/p/cinfony/">implement this</a> (more on this at a later date) and have begun writing some tests to compare results from different toolkits. To begin with, I was hoping that RDKit (Jan2008), CDK (1.0.2) and OpenBabel (dev, soon-to-be 2.2.0) could agree on LogP and TPSA values calculated with the same group contribution approach.<br /><br />The code is below, but first here are the results:<pre>ALogP<br />=====<br />****Example1*****<br />The calculated value from JCICS, 1999, 39, 868 is 1.400000<br />cinfony.rdk {'MolLogP': 1.4007999999999998,<br /> 'pyMolLogP': 1.4007999999999998}<br />cinfony.pybel {'LogP': 0.80749999999999988}<br /><br /><br /><br />****Example2*****<br />The calculated value from JCICS, 1999, 39, 868 is 2.750000<br />cinfony.rdk {'MolLogP': 2.7486000000000006,<br /> 'pyMolLogP': 2.7486000000000006}<br />cinfony.pybel {'LogP': 1.6415999999999995}<br /><br /><br /><br />****methane*****<br />The calculated value from JCICS, 1999, 39, 868 is 0.144100<br />cinfony.rdk {'MolLogP': 0.6331, 'pyMolLogP': 0.6331}<br />cinfony.pybel {'LogP': 0.14410000000000001}<br /><br /><br /><br />TPSA<br />====<br />****metoprolol*****<br />The calculated value from JMC, 2000, 43, 3714 is 50.700000<br />cinfony.cdk {'tpsa': 50.719999999999999}<br />cinfony.rdk {'TPSA': 50.719999999999999}<br />cinfony.pybel {'TPSA': 50.719999999999999}<br /><br /><br /><br />****nordiazempam*****<br />The calculated value from JMC, 2000, 43, 3714 is 41.500000<br />cinfony.cdk {'tpsa': 41.460000000000001}<br />cinfony.rdk {'TPSA': 41.460000000000001}<br />cinfony.pybel {'TPSA': 41.460000000000001}</pre>The good news is that they all agree on the TPSA of the two examples. The bad news is that CDK 1.0.2 does not calculate ALogP, and OpenBabel doesn't seem to be doing well for the more complicated examples (which are taken from the JCICS paper). It's possible that OpenBabel is using a different parameterisation than the paper I reference. I'm not sure either if there's any difference between the two ALogP values that RDKit calculates. Hopefully, the people that know will comment below.<br /><br /><b>Update (23/04/08):</b> (1) I calculated the LogP of methane incorrectly - it should be 0.63610. (2) RDKit has a transposition error in the value for C in methane (only a minor error). (3) OpenBabel calculates the correct answer if hydrogens are added. I'm going to recommend that the descriptor adds the hydrogens itself, to avoid user errors in future. (<b>Done!</b> Thanks to Chris Morley.) (4) The two ALogP descriptors in RDKit are implementations in two different languages, one in C++, the other in Python. They should be identical. (Thanks to Greg Landrum of RDKit for identifying several of these issues)<br /><br />Here's the code...<pre><br /><font color="#a020f0">from</font> cinfony <font color="#a020f0">import</font> cdk, rdk, pybel<br /><br /><br />descs = [(&quot;<font color="#ff00ff">ALogP</font>&quot;, &quot;<font color="#ff00ff">JCICS, 1999, 39, 868</font>&quot;),<br /> (&quot;<font color="#ff00ff">TPSA</font>&quot;, &quot;<font color="#ff00ff">JMC, 2000, 43, 3714</font>&quot;)]<br />toolkits = [(cdk, {'<font color="#ff00ff">ALogP</font>': None, '<font color="#ff00ff">TPSA</font>': ['<font color="#ff00ff">tpsa</font>']}),<br /> (rdk, {'<font color="#ff00ff">ALogP</font>': ['<font color="#ff00ff">MolLogP</font>', '<font color="#ff00ff">pyMolLogP</font>'],<br /> '<font color="#ff00ff">TPSA</font>': ['<font color="#ff00ff">TPSA</font>']}),<br /> (pybel, {'<font color="#ff00ff">ALogP</font>': ['<font color="#ff00ff">LogP</font>'], '<font color="#ff00ff">TPSA</font>': ['<font color="#ff00ff">TPSA</font>']})]<br />testcases = {'<font color="#ff00ff">ALogP</font>':<br /> {&quot;<font color="#ff00ff">methane</font>&quot;: (0.1441, '<font color="#ff00ff">C</font>'),<br /> &quot;<font color="#ff00ff">Example1</font>&quot;: (1.40, '<font color="#ff00ff">c1(O)ccccc1(OC)</font>'),<br /> &quot;<font color="#ff00ff">Example2</font>&quot;: (2.75, '<font color="#ff00ff">c1ccccc1c1ccccn1</font>')},<br /> '<font color="#ff00ff">TPSA</font>':<br /> {&quot;<font color="#ff00ff">metoprolol</font>&quot;: (50.7,<br /> &quot;<font color="#ff00ff">c1(OCC(O)CNC(C)C)ccc(CCOC)cc1</font>&quot;),<br /> &quot;<font color="#ff00ff">nordiazempam</font>&quot;: (41.5,<br /> &quot;<font color="#ff00ff">c1c(Cl)ccc2NC(=O)CN=C(c3ccccc3)c12</font>&quot;)}}<br /><br /><font color="#804040"><b>for</b></font> desc, ref <font color="#804040"><b>in</b></font> descs:<br /> <font color="#804040"><b>print</b></font> desc + &quot;<font color="#6a5acd">\n</font>&quot; + &quot;<font color="#ff00ff">=</font>&quot;*len(desc)<br /> <font color="#804040"><b>for</b></font> k, v <font color="#804040"><b>in</b></font> testcases[desc].iteritems():<br /> <font color="#804040"><b>print</b></font> &quot;<font color="#ff00ff">****%s*****</font>&quot; % k<br /> <font color="#804040"><b>print</b></font> &quot;<font color="#ff00ff">The calculated value from %s is %f</font>&quot; % (ref, v[0])<br /> <font color="#804040"><b>for</b></font> toolkit <font color="#804040"><b>in</b></font> toolkits:<br /> <font color="#804040"><b>if</b></font> toolkit[1][desc]:<br /> mol = toolkit[0].readstring(&quot;<font color="#ff00ff">smi</font>&quot;, v[1])<br /> <font color="#804040"><b>print</b></font> toolkit[0].__name__,<br /> <font color="#804040"><b>print</b></font> mol.calcdesc(toolkit[1][desc])<br /> <font color="#804040"><b>print</b></font> &quot;<font color="#6a5acd">\n\n</font>&quot;<br /></pre>Image credit: <a href="http://www.flickr.com/photos/pfly/">pfly</a>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-79571881872731490932008-04-22T08:19:00.000-07:002008-04-22T08:39:42.719-07:00Cheminformatics toolkit face-off - Molecular weight<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://farm3.static.flickr.com/2387/1579864633_61a9ae2925_m.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 240px;" src="http://farm3.static.flickr.com/2387/1579864633_61a9ae2925_m.jpg" border="0" alt="" /></a>Next up in this series of toolkit comparisons is the molecular weight. Here, <a href="http://cinfony.googlecode.com/">cinfony</a> is going to support two ideas:<br /><ul><li>molwt: the standard molar mass given by IUPAC atomic masses (amu). This is calculated using OpenBabel's <a href="http://openbabel.org/api/2.1.0/classOpenBabel_1_1OBMol.shtml#065d867e0f04048bc005cd306f5e1759">OBMol::GetMolWt()</a>, CDK's <a href="http://cheminfo.informatics.indiana.edu/~rguha/code/java/nightly-1.0.x/api/org/openscience/cdk/tools/MFAnalyser.html#getCanonicalMass()">MFAnalyser.getCanonicalMass</a>, and the MolWt descriptor in RDKit.</li><li>exactmass: the mass given by the most abundant isotope. This is available through <a href="http://openbabel.org/api/2.1.0/classOpenBabel_1_1OBMol.shtml#b2945776a4f01d50e2934b4742f484eb">OBMol::GetExactMass</a>(), CDK's <a href="http://cheminfo.informatics.indiana.edu/~rguha/code/java/nightly-1.0.x/api/org/openscience/cdk/tools/MFAnalyser.html#getMass()">MFAnalyser.getMass</a>, and is not available in RDKit.</li></ul>The gold standard is the <a href="http://bodr.svn.sourceforge.net/viewvc/*checkout*/bodr/trunk/bodr/elements/elements.xml?revision=34&amp;content-type=text%2Fplain">element data</a> in the <a href="http://bodr.sf.net/">Blue Obelisk Data Repository</a> (BODR), which was created for just this purpose.<br /><br />The following are the results for a hydrogen atom and a carbon atom according to the BODR and each of the toolkits. The first figure in parentheses is the molwt, then the exactmass.<pre>BODR: (1.00794, 1.007823032)<br /> (12.0107, 12)<br /><br />Pybel: (1.00794, 1.007825032)<br /> (12.0107, 12.0)<br /><br />RDKit: (1.008)<br /> (12.011)<br /><br />CDK: (1.0079400539398193, 2.0156500339508057)<br /> (12.010700225830078, 16.031299591064453)</pre><br />OpenBabel is the only toolkit that gets everything right. RDKit is doing okay, but should consider using BODR data in future. The CDK presents the most intriguing results. I'm not sure whether the Python/Java interface has introduced noise into the molwt values, but they are exactly in agreement with the BODR up to the 7th decimal place or so, after which something goes weird. On the other hand, it's quite clear that the CDK's exactmass is simply not behaving as advertised. It is using Deuterium for the hydrogen and C-16 for the carbon. (I note that MFAnalyser has already been replaced in the CDK development code.)<br /><br />Image credit: <a href="http://www.flickr.com/photos/bugmonkey/">bugmonkey</a>baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-13675499053191128332008-04-03T01:32:00.001-07:002008-04-03T01:48:46.618-07:00Noel O'Blog celebrates one year at ACS New OrleansTo celebrate the first anniversary of Noel O'Blog, I'm going to be attending the ACS National Meeting in New Orleans, so say hi if you see me. I'll mostly be hanging around the CINF and COMP sessions, or at the CCDC (<a href="http://www.ccdc.cam.ac.uk">Cambridge Crystallographic Data Centre</a>) stand in the Exposition Hall, and for sure, at the CINF reception on Tuesday evening. <br /><br />If you're tired of hearing talks about how docking doesn't work, etc., etc., come along early on Monday and I'll explain one way to improve a scoring function (in the docking software <a href="http://www.ccdc.cam.ac.uk/products/life_sciences/gold/">GOLD)</a> to pick out those pesky actives. Any other ideas people might have would be very welcome.baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-24773821463511304092008-03-28T10:40:00.000-07:002008-03-28T17:58:46.229-07:00Pybel as a generic API for cheminformatics libraries - proof of concept using CDKI'm very interested in interoperability of open source chemistry codes. Following a <a href="http://baoilleach.blogspot.com/2008/03/python-scripting-language-of-chemistry.html#comment-8138422585875625699">comment by Egon</a> on a recent <a href="http://baoilleach.blogspot.com/2008/03/python-scripting-language-of-chemistry.html">post</a> of mine, I started wondering whether the <a href="http://openbabel.org/pybel.html">Pybel API</a> could used with other cheminformatics libraries as a backend.<br /><br />The advantage of this for the user would be (a) to reduce the learning curve - if you know how to use Pybel, you can access any of several different cheminformatics libraries with the same syntax, (b) the same scripts could be used to carry out a particular analysis using different cheminformatics libraries - different libraries may have different fingerprints, descriptors or implementations of particular algorithms (this is of course also useful for cross-checking the results of different programs) and (c) help reduce the divide between different cheminformatics toolkits (interoperability!!).<br /><br />The rationale behind Pybel (described in the <a href="http://dx.doi.org/10.1186/1752-153X-2-5">paper</a>) lends itself to this use. Pybel doesn't attempt to wrap all the functionality of OpenBabel, but only the most common tasks in cheminformatics. For advanced options, or additional functionality, you can go behind the scenes and access OpenBabel directly. As a result, I propose that the Pybel API represents a generic API (one of many possible, of course) for accessing any cheminformatics library.<br /><br />To test this, I have created <b>CDKabel</b>, a proof of concept which shows that the <a href="http://cdk.sf.net">Chemistry Development Kit (CDK)</a> can be accessed using Pybel syntax through <a href="http://jython.org">Jython</a>. CDKabel does not yet pass all of the Pybel tests, but there's enough to show that the approach has some merit. Compare the following: here's some Python code using Pybel and OpenBabel:<pre>C:\Documents and Settings\oboyle>python25<br />Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32<br />bit (Intel)] on win32<br />Type "help", "copyright", "credits" or "license" for more inf<br />ormation.<br />>>> from pybel import *<br />>>> for mol in readfile("sdf", "head.sdf"):<br />... print "Molecule has molwt of %.2f and %d atoms" %<br /> (mol.molwt, len(mol.atoms))<br />...<br />Molecule has molwt of 122.12 and 15 atoms<br />Molecule has molwt of 332.49 and 28 atoms<br />>>></pre>Now here's some Jython code with CDKabel and CDK:<pre>D:\Tools\CDK>set CLASSPATH=cdk-1.0.2.jar<br /><br />D:\Tools\CDK>..\jython2.2.1\jython<br />Jython 2.2.1 on java1.6.0_05<br />Type "copyright", "credits" or "license" for more informa<br />tion.<br />>>> from cdkabel import *<br />>>> for mol in readfile("sdf", "head.sdf"):<br />... print "Molecule has molwt of %.2f and %d atoms" %<br /> (mol.molwt, len(mol.atoms))<br />...<br />Molecule has molwt of 122.04 and 15 atoms<br />Molecule has molwt of 331.96 and 28 atoms<br />>>></pre>Well, at least they agree on the number of atoms :-) (It's my fault - CDK has like, ten different ways of calculating the molecular mass, and I just chose randomly :-) )<br /><br />I've only spent a few minutes throwing CDKabel together, so it doesn't do much beyond the example shown. However, if interested, you can <a href="http://www.redbrick.dcu.ie/~noel/blog/cdkabel/cdkabel.txt">download it</a> and try it for yourself.<br /><br />I'd appreciate comments on the idea that there is a core Python API that could be usefully applied to several cheminformatics libraries. Would anyone use CDKabel if it were available?baoilleachhttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-35551743626352682602008-03-25T20:00:00.000-07:002008-03-25T15:27:38.524-07:00MadMol - Chemistry-aware codeThere are definitely better things you can do with the <a href="http://openbabel.org/wiki/Tutorial:Fingerprints">functional group fingerprint (FP4)</a> of <a href="http://openbabel.org">OpenBabel</a>, but instead, I've created MadMol. Basically, give MadMol a <a href="http://chemical-quantum-images.blogspot.com/2008/03/simplified-molecular-input-line-entry.html">SMILES</a> string such as CC(=O)CCl and it will tell you all you ever wanted to know about that molecule:<blockquote>That there's a Primary_carbon. Isn't that a Alkylchloride? Wake up and smell the Ketone, cause that's what it is. Most people wouldn't realise this is a C_ONS_bond. It's a 1,3-Tautomerizable (or your money back). I don't believe it, it's a Rotatable_bond! Wake up and smell the CH-acidic, cause that's what it is.</blockquote><br />And here's the code (requires <a href="http://openbabel.org/wiki/Python">Pybel</a>, and the file <a href="http://openbabel.svn.sourceforge.net/svnroot/openbabel/openbabel/trunk/data/SMARTS_InteLigand.txt">SMARTS_InteLigand.txt</a>):<pre><br /><span style="color:#a020f0;">import</span> sys<br /><span style="color:#a020f0;">import</span> random<br /><br /><span style="color:#a020f0;">import</span> pybel<br /><br /><span style="color:#804040;"><b>def</b></span> <span style="color:#008080;">readsmartsfile</span>(filename="<span style="color:#ff00ff;">SMARTS_InteLigand.txt</span>"):<br /> patterns = []<br /> inputfile = open(filename, "<span style="color:#ff00ff;">r</span>")<br /> <span style="color:#804040;"><b>for</b></span> line <span style="color:#804040;"><b>in</b></span> inputfile:<br /> line = line.strip()<br /> <span style="color:#804040;"><b>if</b></span> line <span style="color:#804040;"><b>and</b></span> line[0]!="<span style="color:#ff00ff;">#</span>":<br /> colon = line.find("<span style="color:#ff00ff;">:</span>")<br /> name = line[:colon]<br /> smarts = line[colon+1:].strip()<br /> patterns.append([pybel.Smarts(smarts), name])<br /> <span style="color:#804040;"><b>return</b></span> patterns<br /><br />phrases = ["<span style="color:#ff00ff;">I don't believe it, it's a %s!</span>",<br /> "<span style="color:#ff00ff;">Isn't that a %s?</span>",<br /> "<span style="color:#ff00ff;">It's a whadyamacallit, a %s.</span>",<br /> "<span style="color:#ff00ff;">Looks like a %s to me.</span>",<br /> "<span style="color:#ff00ff;">That there's a %s.</span>",<br /> "<span style="color:#ff00ff;">Most people wouldn't realise this is a %s.</span>",<br /> "<span style="color:#ff00ff;">It's a %s (or your money back).</span>",<br /> "<span style="color:#ff00ff;">Wow, a %s. Last time I saw one of these, I hit the</span>"<br /> "<span style="color:#ff00ff;">fire alarm and ran.</span>",<br /> "<span style="color:#ff00ff;">Could be a %s...yes, I'm sure of it.</span>",<br /> "<span style="color:#ff00ff;">It's a %s if I've ever seen one.</span>",<br /> "<span style="color:#ff00ff;">Wake up and smell the %s, cause that's what it is.</span>",<br /> "<span style="color:#ff00ff;">It's a %s. I wish I had one.</span>",<br /> "<span style="color:#ff00ff;">You've hit the jackpot, you and your %s!</span>",<br /> "<span style="color:#ff00ff;">A %s. You know, back in the day, we used to have</span>"<br /> "<span style="color:#ff00ff;">fun with these.</span>",<br /> "<span style="color:#ff00ff;">It takes me back years, this %s does.</span>"]<br /><br /><span style="color:#804040;"><b>if</b></span> __name__=="<span style="color:#ff00ff;">__main__</span>":<br /> <span style="color:#804040;"><b>if</b></span> <span style="color:#804040;"><b>not</b></span> len(sys.argv)==2:<br /> sys.exit("<span style="color:#ff00ff;">You need a SMILES string like CC(=O)CCl</span>")<br /> molecule = py