2-D searching Tutorial

By G. Schaftenaar



2D-searching is applied to find compounds in a database which are similar in molecular features to a known active molecule(s). Each compound is assigned a 2D-fingerprint. A fingerprint is a set of bits, where each bit indicates the absence or presence of a molecular feature. To determine how similar two compounds are based on their fingerprints, the Tanimoto coefficient is often used. Below you will find an example of two compounds and their fingerprints and the calculation of the Tanimoto coefficient:



Below you will find two new compounds. Calculate the Tanimoto coefficient for this pair of molecules.



Below you will find two known active compounds for the estrogen receptor; Raloxifene(1) and Tamoxifen(2). Calculate the Tanimoto coefficient for this pair of molecules. Are these two compounds very similar ?




Setup the working environment

From the Unix shell (command prompt):




How many compounds in the database of 500 compounds (485 randomly selected, 15 SERM (Selective Estrogen Receptor Modulator) are similar to tamoxifen, a known active for the ERa receptor.

Read in the database of 500 compounds:

Next we have to calculate fingerprints for the molecules in the database:

Now close this window
and let's do the 2D similarity search:

Open the Query file: tamoxifen.mol2
and set the overlap (Similarity) to 0.75:

Now close this search results window.
The results of the search have been written in the res.sdf file. Now let us open this file and look at the results.

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

To navigate through the database use the Up and Down arrow keys.
To view a hit left click the compound icon field and click OK. You will 4 serms as hits.




Now repeat the same procedure but change the Overlap to 55. You will now find 7 compounds matched using these similarity criteria.
Have a look at these compounds too (see above). How many of these hits are known drugs and how many of these are likely to be false positives ? (answer is 7 and 0 respectively.)

If you repeat the same procedure with Overlap set to 48. You will find almost all of the known drugs in the database (14) and twelve false positives. In this case the optimal similarity cutoff is apparently 48.