Google+ Badge

Total Pageviews

Introduction (I)

There is a short introduction below the posts (scroll to bottom)

Tuesday, 7 May 2013

NCATS vice-versa compounds, more x-mapping fun

Update 8th May:  See this connecting post from SE's blog

As an unexpected variation on the theme, NCATS, having previously brokered industry compounds to academia for repurposing  (see PMID 23159359), are now putting their home-grown drug research candidates out to tender (NCATS/research/tools/preclinical/patents/tech-licensing).  My guess is this will generate a bit of social media commentary but I will restrict this post to the informatics fun of, you guessed it, digging out the structures. As we know this is scientifically enabling and could join up disparate pieces of data. However there is a Déjà vu element to this since there was much comment that over 30 of the the NCATS industry compounds remain completely blinded (PMID 23159359). While an intial inspection suggests the NCATS folk are pulling the same trick in reverse (i.e. partially blinding structures from US Gov research that could be licenced out to industry), to be fair most of these are open, but they still need a lot of non-trivial chemical extraction and x-mapping work. Why we should need to do this is another matter. The list is below.


Representing   ~30 structures or classes (see below) the fun would rapidly wear off  if I was to attempt to discern key structures behind all  these links. What I'll do instead is take a didactic approach by working through just a few example to show any interested parties how they might carry on with DIY-structure digging for whatever candidates might interest them most. The bad news is that NCATS have not chosen to expedite follow-up in the simplest way which would have been  PubChem links.  The good news is that most of the publications are open-access (by NIH mandate) and they have declared inventor names, and some patent IDs to enable tracking back to patents (but I suppose offering direct URL links to the patents would have made it too easy.....).

The first one E-308-2009 Caspase 1 inhibitor turned out be straigtforward since chemicalize.org (see PMID 23618056) converted 92 structures from the paper, including all the novel synthetic IUPAC names. The circles thus joined up as per below via the InChIKey Google seach.

 
Thus, one of the key cpds in the paper  (4) = NCGC00183434 = CID 44620939 = KENKPOUHXLJLEY-QANKJYHBSA-N.  There are also a few quirks  associated with this. Firstly while there are two SureChemOpen patent links via SID 155243680 these are citing the compound as prior-art. What looks like the primary composition-of-matter filing as US20120294843 has converted many of the image structures from the series but the exact one obove may have failed.  Note also there is a full set of Caspase 1 PubChem Bioassays (linked to AID 2389). This is not expliitly cited in the paper but the PubMed/Entrez system adds the link.  The InChIKey search picks up a "cryptic" CHEMBL1552969 entry in the sense that, because this was a subsumed confirmatory PubChem BioAssay, there is a ChEMBL-to-PubChem pointer via the BioAssay SID but this is not reciprocal.  The MeSH annotator picked up one specified substance
1-(2-((1-(4-amino-3-chlorophenyl)methanoyl)amino)-3,3-dimethylbutanoyl)pyrrolidine-2-carboxylic acid  but as CID 10156704 this turns out to be a reference compound used in the paper but only code-name mapped as VX-765 via an InChIKey hit to a chemical supplier, not via MeSH.

Moving on down I drew a blank with E-276-2011 in not being able to relate the inventor patent name hits to the  abstract details. Next up, E-120-2010 was OK because the NGC codes used in the two papers  (PMID 20451379  and PMID 20017496 , also picked up by ChEMBL) are Google and PubChem-positive. As an example,  one of the pyruvate kinase M2 activators, NCGC00185916 is mapped to  CID 44543605.  In turn this maps to useful stuff including  US20110195958 via SureChemOpen with over 130 examples.  Next up E-240-2011 Small-Molecule Inhibitors of Human Galactokinase for the Treatment of Galactosemia and Cancers has no publication links but does include a US patent number.  We can then go in to WO201304319 "galactokinase inhibitors for the treatment and prevention of associated diseases and disorders" with an SAR table of 112 structures. While the reported potencies are low we can pick up most active structure below.


And to not only square the circle but also add a quirk here,   CID 44640157 matches the assay result in the patent (ie. was the same data) but potency fell 4-fold on the re-test.  This would take some sorting out but at least the data is on deck for inspection by all.  



The entry for E-094-2011 Inhibitors of Human Apurinic/apyrimidinic Endonuclease 1 (APE1) as an Anticancer Drug Target is at least open via two publications and a patent number.  Nonetheless, there is no direct link into PubChem. We have to paste the IUPAC for compound "3" out of the supplementary data PDF from PMID 22455312 use OPSIN in this case for conversion and the SMILES to hit CID 3581333.  This squares the circle as below




We can also dive into the relevant assay results as per below but note this was also scored as an active antimalarial in the parasite assay !




I may dig out a few more of these but that's enough for this afternoon, other stuff to do. To  conclude:

1)  Much of this is openly accessible data (including the published patents) but many potential interested parties will find linking everything up via chemical structures hard work. Most of the PI's for these projects could do this like rolling off a log by simply listing out InChIs and CIDs in SAR data tables (but please - don't Markush  the series)
2) The science coming out of these enteprises is top notch but the PIs should actively engage in connecting the system.  At the most basic this should be explicit citations and/or cross-pointers between the papers, patents and BioAssay IDs
3)  I can't estimate how long it would take me to get lead structures from the whole list (and some are blinded) because this would depend on the desired end-point (e.g.  one key example or 100s of patent structures) but it would be in the order of days  and the process also has plenty of room for error.
4)  There are probably a lot of others embarked on the same  exercise right now.....
5) Interestingly I think it was tacit knowledge that MLSCN investigators were filing patents on leads that would later appear in PubChemBioAssay but some of these entries now make this explicit (this is not a criticism of the practise, merely an observation).

Sunday, 14 April 2013

Joining in with post-ACS new drug curation

Update 17th April

Efforts made to initiate Wikepedia entries have been somewhat zealously flagged for a variety of percieved edting misdemenours or even blocked. As can be seem by the tweets we will try to resolve this cordially.  I'm all for collaborative expert open editing but not with a stopwatch ticking.  I have also been asked by EW to add the primary target gene IDs to faclitate pathway mapping (now done).

********************************************************************************
While I could not make ACS because of the clash with BioIT I was pleased to be spiritually represented as co-author on two presentations (see pointers over at Collabchem ACS 2013).  While I will expand on BioIT participation in the next post, precedent will be taken today on the interesting and immediate efforts surrounding the remarkable Liveblogging-first-time-disclosures-of-drug-structures-from-acsnola particularly since I already engaged on the equivalent exercise last year (see live-chemical-structure-blogging-but-scooped). Not being at the meeting or initiating the follow-up tweets this is "after the fact"  but I hope my orthogonal observations will add to the open mix. I have focused on the targets, PubChem entries, Google hits and patent links but you can find the clinical trial entries for these code names at clinicaltrials.gov

The first drug candidate up was collaboratively edited  into the ChemSpider (CD 28536138) and Wikipedia entries as BMS-906024  within days.  This targets  NOTCH1 and NOTCH3 as a treatment for various cancers.  Thomson had deposited the structure into PubChem as CID 66550890 last October ( InChIKey: AYOUDDAETNMCBW-GSHUGGBRSA-N)  but their feed does not add synonyms and the patent would not have had the code name anyway. Intriguingly, they dropped in the three forms shown below.

 
One looses the third stereo resolution but the other is tritiated. It turns out you can whack the patents in SureChemOpen (see below but note the PDF links are not in the Open version) but this structure has not made the PubChem feed yet.


The US20120245151 document (or the WO)  is unusual for the inclusion of X-ray crystal structure for the coordinates (re-constructable to the 3D even?). While it mentions tritium enumeration (not unusual) I cant immediately spot the example chosen by the WPI annotator but that does not mean its not specifically exemplified in there somewhere. There is also some IC50s from Notch assays and some cancer cell line data.

The next compound,  LGX818 from Novartis, has been designed to treat melanoma with a V600E  mutation (P15056[600])  in B-RAF kinase (BRAF).  This time Google open has not ranked the Wikipedia page (that has an issue warning just now) anywhere near the top but Google images ranks it 1st as shown below.


Note the original blog image (link at the top of this post)  is ranked 7th.  By searching the InChIKey from ChemSpider  CS 28536139  (CMJCXYNUCSMDBY-ZDUSSCGKSA-N)  we can whack  CID 50922675 that has the three patent source submissions below.


Here again there is a lot of data in WO2011025927 for LGX818, a.k.a. "compound 9".  You can also generate a nice analogue cluster (below) in PubChem from the patent extractions (note some vertical lines are merged stereo versions).


An unusual precedent for this compound, picked up on the first page in Google, is a thread in the Melanoma International Forum from an Australian patient in one of the trials.  While the post is at least tokenly anonymised the comment "So far the side effects I've experienced are weird blurry vision that last for a few hours starting about an hour and a half after taking the LGX818 / MEK162, and the skin on my face is sensitive kind of like feeling wind burnt in the winter. In the 3 days I have already noticed a reduction in the size of my tumors" is not only direct and personal but the (identified) foundation founder has commented in the thread.  This may a prescient sign for the future.

Next up, AZD5423 , also has a Wikipedia entry and a ChemSpider CS 28536140  (you can tell these are post-ACS "farm fresh" becuase of the CS ID series 138-40).  It is reported as a glucocorticoid receptor modulator (NR3C1) for COPD and oiginated in part from a collaboration with Bayer. Via the InChIKey FCNQMDSJHADDFT-WNSKOXEYSA-N we can find  CID 24825740 and the three SIDs below.


In this case SCRIPDB links the patent extraction to US20110071194 (as marked up in the CID record) but the patent mapping is complex in this case because of the large patent family and what looks like multiple applications including the structure.

The next compound has a fresh  USAN document for birinapant that you can see below in all its detailed glory (I can't find a date though). 


It had also made it into ChemSpider  but as an older entry CS 27444380.  The InChIKey  (PKWRMUKBEYJEIX-DXXQBUJASA-N) whacks CID 49836020  from whence you can get to US20130012564 where "compound 1"  has some detailed data.  The code name also has a recent publication "The Novel SMAC Mimetic Birinapant Exhibits Potent Activity against Human Melanoma Cells"(PMID: 23403634).  This declares birinapant to be an antagonist of cIAP1  and cIAP2 ( a.k.a.  BIRC2 and BIRC3)

The last one in this set, MGL-3196, is a hormone mimic targeted against the beta thyroid hormone receptor in the liver (THRB).  This has neither made it in to Wikipedia, ChemSpider, PubChem, PubMed nor Google Scholar as a name match. However, it does have an intriguing false positive ranked 2nd in Goggle images against the other true positives, but only one at 5th with a structure.


In the unlikely event that any readers might have some interest in - er....  dressing up, they can follow up on the link below  (it's by no means the only dodgy false-positve you'll find with code number searches).


Moving swiftly on,  you can see that OSRA made a good effort on the hand sketch but needed some editing.


After some iterative searching I converged on CID 15981237 that not only has the largest number of SIDs but also the oldest date in this set as 2007-03-19 from Thomson.


As an early entry it was picked up by ChemSpider from Thomson via PubChem but, oddly, the InChIKey (FDBYIYFVSAHJLY-UHFFFAOYSA-N)  only has the PubChem match in Google, not the ChemSpider one.  In this case I'll show the patent matches via the PubChem record (below).

Note the linked numbers are parsed into the records by PubChem via SCRIPDB and you can see the analogue series as the pre-computed 90% tanimoto similarity set of 24.Last but not least I have suggested the code name as a ChemSpider update  (below)  so we'll see when this goes in.


Whats odd?  The absence of ChEMBL hits implying these four are particullarly short on medicinal chemistry SAR publications.  Whats next ? We'll see what the other interested parties suggest. Maybe I'll try my hand for a Wikipedia drugbox and/or try out the new PubChem submission system for synonym edits.

Thursday, 28 March 2013

A miniscule addendum to HeLa data

Given the flurry of interest in the now-you-see-it-now-you-don't HeLa cell genome it seems a good time to post something that seems like a footnote but, believe it or not, was on my to-blog list  as an interesting protein annotation and database versioning issue long before this fuss started after the sequencing paper. This revolves around the topic of chimeras, those things almost designed to confound bionfomaticians, especially when they try to write transcript contiging and mapping rules.

Once upon a time, back on the Ribena ranch, someone went Y2H fishing with a piece of PSEN1 and pulled out a serine protease.  This caused a frisson of excitement as being a possible candidate "APP secretase" (and implicit Alzheimers disease drug target) so the team duly filed a patent  EP0828003 Human serine protease . As specified in the patent the PSEN1 interacting "prey" section was walked-out via a human brain cDNA library  to  a 323 ORF.  This was AR095633 that we later corrected to 458 aa and duly deposited in GenBank as AF141305.1.  We also knocked off the mouse as AF164513.1 but this never got linked to the PMID sigh...(see putting-uniprot-records-straight).   While it turned out not to be an APP secretase, as a mitochondrial protein it had some interesting biochemistry so the team eventually wrote a paper "Characterization of human HtrA2, a novel serine protease involved in the mammalian cellular stress response" (PMID:10971580). Gratifyingly, this has garnered 213 citations since 2000.  To keep the story short it was noticed sometime after we had the partial sequence (I can't remember exactly when but our patent priority calls back to Sept 1996)  that someone else had deposited a largely identical clone on the 2nd of Jan 1998, but with differences in the 5' end. This was  AF020760.1 (GI:2738914).

The key fact (for this whole post) indicated in the sequence submission record was this had been cloned from a HeLa cell, designated as a cervical carcinoma cell type.   But, by the time both papers had published, the "other"sequence had been updated as AF020760.2 ( GI:5870864)  and became identical to ours. Their publication from the same year, x-refed in the sequencence record, is merged with oursleves and others in Swiss-Prot as O43464 Serine protease HTRA2, mitochondrial.

Yours truly, having picked up the early clone differences, did a post-genomic BLAST some years later with  AF020760.1 This clearly (calssically even) shows a chimeric transcript nicely represented in the graphics below.



The details of this can be followed from running these searches yourself but this is mostly chromosome 2 at the back with a chunk of 17 at the front but also with pieces of short, but 100% matches, from 5 and 8. Repeating the equivalent searches and graphical displays for AF020760.2 provides a useful control (see below)



The implication from the HeLa genome paper "The Genomic and Transcriptomic Landscape of a HeLa Cell Line"  (but PMID on hold ?) is that  this could be a consequence of chromothripsis which is accompanied by extensive chromosomal rearrangement.  This is supported by a BLAST search of the first 800 bases against ESTs (below)


Interesting corollaries to these observations include the following; the first is that that none of the  70869 HeLa ESTs are junction-crossing.  The second is that all author-revised sequences are updated to the latest versions in the BLAST search indexes, because of the implicit default assumption that older versions are "wrong". This means you can not  "find" AF020760.1 by sequence search after 1999 (or for any superceded GI number for that matter).  You would thus, either have to know it was there, or happen to notice the versioning had ticked up to 2.  In the light of the new paper it now seems likely that AF020760.1 could be a real HeLa cell gene product and ipso facto even a translated protein. A third corollary is that, given the cell line is heteroploid as well,  the authors may have re-cloned a normal version  (ie identical to our brain clone) from a HeLa cell library.  The forth is the question of the instanciation of this "lost" protein in any databases. Both TrEMBL and UniParc come to mind here. The sequence is UniProt -ve (i.e. the deprecated mRNA-division version ORF absent), possibly because it never got into TrEMBL in the first place.  However, there is a 100% protein full-lenght hit in UniParc (below)


Keeping the story short this was extracted from a 1998 academic patent, US6489136 Cell proliferation related genes,  filed with the first author on the AF020760 cloning paper as inventor.

So what happens now ?  We'll see.  I'd certainly like to be able to search a full genomic sequence somewhere out there so I can check if this possible gives a single locus match on the rearranged chromosome pieces.  It would also be biochmically significant to verify protein translation in cellulo, especially since the mitochondrial targetting section is lost. MS tags would be one route to this but the only candidate for a junction crossing tryptic peptide is on the small side (upper sequence below) but should still be possible to select in the first quad for MS-MS confirmation against the wild type (lower sequence).

CKVYITGGRGAGWSLRAWR
 MAAPRAGRGAGWSLRAWR

Chimeras, or if translated more comononly rerered to as fusion protein proteins, are a big topic in cancer cells and tumors. They raise interesting challenges, both regarding their potential biochemical effects and database annotation. For example. there are ~ 150 BCR/ABL fusion proteins in TrEMBL.  We'll see what happens to the 1000s that remain to be charactrised (and quantified) in HeLa and other cell lines, possibly including the sequence above.

Sunday, 24 March 2013

BioIT Apr 9th drug repurposing workshop preview

Update, 27 Mar,  all four slide sets now up for preview.
**********************************************
I am pleased to see our workshop taking shape as a web page with the abstracts and draft slide sets already on slideshare.
Drug Repurposing - Fishing for Pearls with a Very Wide Net
Digging out Structures for Repurposing: Non-competitive Intelligence
Exploring Chemical and Biological knowledge spaces with PubChem
Compound Repurposing from Knowledge Integration and Opportunisitc Screening
If you are attending BioIT, even if you register on the day, you are welcome to sign up to join us and note there are both academic and student discounts.   What I will do in this post is pick up on a few things in more detail more here that I will probably not have time to squeeze in while talking through the slides.

One of these was an interesting surprise on the specificity of PubChem queries for INNs and USANs.  Like anyone else who reads their informative web pages  I took the statement  "The cumulative list of INN now stands at some 7000 names designated since that time (1953), growing every year by some 120-150" at face value, but, as ever, it looks like this is a little out of date.  I was also interested to find some corroborative statistics in the form of "USP Dictionary of USAN and International Drug Names database contains more than 8,400 non proprietary drug name entries including: U.S. Adopted Names (USANs), USP-NF drug substances and excipients,   International Non proprietary Names (INNs),  British Approved Names (BANs) and Japanese Approved Names (JANs), more than 6,690 graphic formulas, 4,100 brand names, 9,400 CAS® Registry Numbers, and 3,898 code designations.  There's a lot to mull over in these numbers but you can see my PubChem cross-checks below that proved surprisingly effective.


As we can see from the minor differences between restricting to the synonym field these queries not only tally not too far off the official numbers but they also seem to be "clean" (i.e. low false positives).  Some simple source/property cutting produces the graph below.


A few minor surprises here. In fact 26% salts/mixtures (not unexpectedly 675:2158  INN:USAN) brings this in line to   ~ 7000 parent INNs (note approved mixtures don't get them).  DrugBank is very low because of their focus on approved drugs and PDB ligands.  One of the consequences of this is low protein target mapping into the BioSystems pathways.  ROF is also low but remember the mixtures will confound this cut.  I would have expected active BioAssays (largely ChEMBL)  and MeSH pharmacology to be higher.   The vendor count is higher than I expected and bodes well for those who want to try some experimental repurposing.

Another aspect of  wild/card stemming search that proved to be useful for the repurposing slides was code names.  The bad news is there are an awful lot of them. The good news is that a triple prefix can be used for recall as the query below shows.


Note that the query box (and the history box) has index-matched all the GSK code names it could find.  While many of you I'm sure can do text parsing like rolling of a log I find http://www.textfixer.com/ offers me some quick fixes as shown below.


From a repurposing point of view we can attempt to intersect these structures with clinical trial results.  Usefully, the stemming search for GSK code names works in PubMed and we can filter by clinical trial in the last five years,  as shown below.


Here again the search detail box lists the matches and textfixer cleans them up.  Its then a case of intersecting the two code number lists in Venny, but this is slightly confounded by the use of the "a" and "b" suffixes.


Nevertheless we now have 40 in common (i.e. GSK code, structure, clinical trial report, last five years).  Doing a little more text manipulation and popping these into the query box returns 44 CIDs as shown below because some names will map to more than one CIDs, probably stereo forms or salts.


So we can save these a public collection as shown below.


OK,  be my guest and have a look through. For the record, 12 of the 44 have INN/USAN but still might have stalled.  If you find any clues to repurposing let me know.

Monday, 11 March 2013

The new drugs of 2012 in PubChem


It was gratifying to see the previous Drug-class-of-2011-in-pubchem post garnered 412 hits.  For this annual update I have made a few changes to make it a bit quicker to compile the post.  As ever, the pundits have made their pronouncements, varying between glass half-empty and half-full interpretations of two big years in a row.  For a detailed technical breakdown the informative ChEMBL 2012 drug listings are recommended.  In this post I will just pick up on a few observations (mostly quirks of one sort or another)  arising from the CIDs per se, the SureChemOpen patent links and the InChIKey Google hits.  The Excel table, put together from the FDA list in reverse chronological order, was problematic to format in Blogger so I needed to split the columns into the  sets below.  The sheet is also posted on figshare but unfortunately the links are not live (but this is on their feature wish list).

*****************************************************************************
 
No. Active USAN  Use
 39. crofelemer HIV/AIDS patients whose diarrhea is not caused by infection.
 38. bedaquiline As part of combination therapy multi-drug resistant pulmonary tuberculosis (TB) 
 37. apixaban Reduce systemic embolism in patients with atrial fibrillation  not  by a heart valve problem.
 36. lomitapide Patients with homozygous familial hypercholesterolemia (HoFH).
 35. teduglutide Adults with short bowel syndrome (SBS) who need parenteral nutrition.
 34. pasereotide Cushing’s disease patients who cannot be helped through surgery
 33. raxibacumab Inhalational anthrax
 32. ponatinib CML and Ph+ ALL
 31. cabozantinib Matastatic  medullary thyroid cancer 
 30. facitinib Active rheumaid arthritis (RA) who have had an inadequate response to methotrexate.
 29. omacetaxine mepesuccinate Chronic myelogenous leukemia (CML), a blood and bone marrow disease.
 28. perampanel Partial onset seizures in patients with epilepsy ages 12 years and older.
 27. ocriplasmin Eye condition called sympmatic vitreomacular adhesion (VMA).
 26. regorafenib Metastatic colorectal cancer 
 25. Choline C 11 Injection Positron Emission mography (PET) imaging agent used  help detect recurrent prostate cancer.
 24. teriflunomide Relapsing MS.
 23. bosutinib CML.
 22. enzalutamide Late-stage (metastatic) castration-resistant prostate cancer 
 21. linaclotide Chronic idiopathic constipation and  irritable bowel syndrome with constipation (IBS-C) in adults. 
 20. tbo-filgrastim Reduce neutropenia.
 19. elvitegravir, cobicistat, emtricitabine, tenofovir disoproxil fumarate Combination pill for first HIV-1 infection 
 18. ziv-aflibercept Combination with a FOLFIRI (folinic acid, fluorouracil and irinotecan) for colorectal cancer. 
 17. aclidinium bromide Bronchospasm associated with chronic obstructive pulmonary disease (COPD)
 16. carfilzomib Multiple myeloma who have received at least two prior therapies
 15. sodium picosulfate, magnesium oxide,citric acid Help cleanse the colon in adults preparing  colonoscopy.
 14. mirabegron Overactive bladder.
 13. lorcaserin hydrochloride Chronic weight management.
 12. pertuzumab HER2-positive late-stage (metastatic) breast cancer.
 11. taliglucerase alfa Long-term enzyme replacement therapy for Gaucher disease
 10. avanafil Erectile dysfunction.
 9. Florbetapir F 18 PET imaging of the brain  estimate β-amyloid neuritic plaque in patients with cognitive impairment 
 8. peginesatide Anemia, for dialysis patients who have chronic kidney disease (CKD)
 7. lucinactant Prevention of respirary distress syndrome (RDS), a breathing disorder that affects premature infants.
 6. tafluprost Reducing elevated intraocular pressure in patients with open-angle glaucoma or ocular hypertension.
 5. ivacafr Treatment of CF for  G551D mutation in CFTR.
 4. vismodegib  Basal cell carcinoma
 3. axitinib Advanced renal cell carcinoma who have not responded to another drug.
 2. ingenol mebutate Topical treatment of actinic kerasis.
 1. glucarpidase Patients with raised levels of methotrexate in their blood due  kidney failure.

*****************************************************************************

No. Trade Name  Active USAN  CID  SID (if no CID) Notes
 39. Fulyzaq crofelemer SID 135030308 botanical mixture
 38. Sirturo bedaquiline CID 5388906 7 CID  > 52 SIDs
 37. Eliquis apixaban CID 10182969 40 SIDs, ~85 deuteros
 36. Juxtapid  lomitapide CID 9853053
 35. Gattex  teduglutide CID 16139605 3752  Mw, strange renderings
 34. Signi pasereotide CID 56928195 1047 Mw, 13 CIDs > 32 SIDs
 33. raxibacumab raxibacumab SID 160687615 antibody
 32. Iclusig ponatinib CID 24826799
 31. Cometriq cabozantinib CID 25102847 Similar Compounds (1182)
 30. Xeljanz facitinib CID 9926791 42 SIDs  ~ 60 deuteros
 29. Synribo omacetaxine mepesuccinate CID 285033 Natural product
 28. Fycompa perampanel CID 9924495
 27. Jetrea ocriplasmin SID 135332408
 26. Stivarga regorafenib CID 11167602
 25. Choline C 11 Injection Choline C 11 Injection
 24. Aubagio teriflunomide CID 54684141
 23. Bosulif bosutinib CID 5328940 536 targets,story on wrong structures
 22. Xtandi enzalutamide CID 15951529
 21. Linzess linaclotide CID 16158208 1526 Mw, Stange renderings
 20. Neutroval tbo-filgrastim SID 135320862 form of filgrastim, no exact structure
 19. Stribild elvitegravir, cobicistat, emtricitabine, tenofovir disoproxil fumarate CID 66545969
 18. Zaltrap  ziv-aflibercept SID 135347926 VEGF bilogical
 17. Tudorza Pressair aclidinium bromide CID 11519741 parent for patent match CID 11434515
 16. Kyprolis  carfilzomib CID 11556711
 15. Prepopik  sodium picosulfate, magnesium oxide,citric acid
 14. Myrbetriq  mirabegron CID 9865528
 13. Belviq lorcaserin hydrochloride CID 11673085
 12. Perjeta pertuzumab SID 135348342
 11. Elelyso taliglucerase alfa SID 124490415
 10. Stendra avanafil CID 9869929
 9. Amyvid Florbetapir F 18 CID 24822371
 8. Omontys peginesatide SID 160645460 SID structures but no CID
 7. Surfaxin lucinactant SID 47206602
 6. Zioptan tafluprost CID 9868491 6 ZINC entries
 5. Kalydeco ivacafr CID 16220172
 4. Erivedge vismodegib CID 24776445 328 genes
 3. Inlyta axitinib CID 23724859
 2. Pica ingenol mebutate CID 6918670
 1. Voraxaze glucarpidase SID 135297601

********************************************************************************


No. Trade Name  Possible  1st patent  SureChemOpen count
 39. Fulyzaq
 38. Sirturo http://open.surechem.com/en/document/US-20050148581-A1/ 260
 37. Eliquis http://open.surechem.com/en/document/WO-2003049681-A2/ 3,361
 36. Juxtapid  https://open.surechem.com/en/document/WO-1996026205-A1/ 274
 35. Gattex 
 34. Signi
 33. raxibacumab
 32. Iclusig http://open.surechem.com/en/document/WO-2007075869-A2/ 42
 31. Cometriq https://open.surechem.com/en/document/US-20070054928-A1/ 32
 30. Xeljanz http://open.surechem.com/en/document/WO-2002096909-A1/ 635
 29. Synribo http://open.surechem.com/en/document/US-20110071097-A1/ 1
 28. Fycompa http://open.surechem.com/en/document/WO-2006107859-A2/ 1424
 27. Jetrea
 26. Stivarga http://open.surechem.com/en/document/WO-2004078746-A2/ 432
 25. Choline C 11 Injection
 24. Aubagio http://open.surechem.com/en/document/EP-0538783-A1/ 3,919
 23. Bosulif http://open.surechem.com/en/document/WO-2003093241-A1/ 426
 22. Xtandi
 21. Linzess
 20. Neutroval
 19. Stribild
 18. Zaltrap 
 17. Tudorza Pressair http://open.surechem.com/en/document/US-20030055080-A1/ 850
 16. Kyprolis 
 15. Prepopik 
 14. Myrbetriq  https://open.surechem.com/en/document/EP-1932838-A3/ 25
 13. Belviq http://open.surechem.com/en/document/WO-2005003096-A1/ 330
 12. Perjeta
 11. Elelyso
 10. Stendra http://open.surechem.com/en/document/US-20020037828-A1 1305
 9. Amyvid
 8. Omontys
 7. Surfaxin
 6. Zioptan http://open.surechem.com/en/document/US-20110275715-A1/ 5
 5. Kalydeco https://open.surechem.com/en/document/WO-2007075901-A2/ 132
 4. Erivedge http://open.surechem.com/en/document/WO-2006028958-A2/ 59
 3. Inlyta http://open.surechem.com/en/document/WO-2006048744-A1/ 17
 2. Pica http://open.surechem.com/en/document/US-20090292017-A1/ 18
 1. Voraxaze

**********************************************************************************
 
No. Trade Name  InChIKey (IK) IK Google count
 39. Fulyzaq
 38. Sirturo QUIJNHUBAXPXFS-XLJNKUFUSA-N 7
 37. Eliquis QNZCBYKSOIHPEH-UHFFFAOYSA-N 207
 36. Juxtapid  MBBCVAKAJPKAKM-UHFFFAOYSA-N 54
 35. Gattex  CILIXQOJUNDIDU-ASQIGDHWSA-N 6
 34. Signi
 33. raxibacumab
 32. Iclusig PHXJVRSECIGDHY-UHFFFAOYSA-N 24
 31. Cometriq ONIQOQHATWINJY-UHFFFAOYSA-N 8
 30. Xeljanz UJLAWZDWDVHWOW-YPMHNXCESA-N 697
 29. Synribo HYFHYPWGAURHIV-JFIAXGOJSA-N 120
 28. Fycompa PRMWGUBFXWROHD-UHFFFAOYSA-N 82
 27. Jetrea
 26. Stivarga FNHKPVJBJVTLMP-UHFFFAOYSA-N 24
 25. Choline C 11 Injection
 24. Aubagio UTNUDOFZCWSZMS-YFHOEESVSA-N 141
 23. Bosulif UBPYILGKFZZVDX-UHFFFAOYSA-N 315
 22. Xtandi WXCXUHSOUPDCQV-UHFFFAOYSA-N 116
 21. Linzess KXGCNMMJRFDFNR-WDRJZQOASA-N 7
 20. Neutroval
 19. Stribild JQSAENLSMVBKRQ-WTRZBLBQSA-N none
 18. Zaltrap 
 17. Tudorza Pressair XLAKJQPTOJHYDR-QTQXQZBYSA-M 144
 16. Kyprolis  BLMPQMFVWMYDKT-NZTKNTHTSA-N 54
 15. Prepopik 
 14. Myrbetriq  PBAPPPCECJKMCM-IBGZPJMESA-N 69
 13. Belviq ITIHHRMYZPNGRC-QRPNPIFTSA-N 10
 12. Perjeta
 11. Elelyso
 10. Stendra WEAJZXNPAWBCOA-INIZCTEOSA-N 105
 9. Amyvid YNDIAUKFXKEXSV-CRYLGTRXSA-N 7
 8. Omontys
 7. Surfaxin
 6. Zioptan WSNODXPBBALQOF-VEJSHDCNSA-N 62
 5. Kalydeco PURKAOJPTOLRMP-UHFFFAOYSA-N 219
 4. Erivedge BPQMGSKTAYIVFO-UHFFFAOYSA-N 111
 3. Inlyta RITAVMQDGBJQJZ-XFXZXTDPSA-N 5
 2. Pica VDJHFHXMUKFKET-WDUFCVPESA-N 44
 1. Voraxaze

************************************************************************

I need to explain the non-obvious column headings. The CIDs are my own de novo assignments from the INN/USAN name matches match in PubChem (i.e.not anyone else's recycled links, even thought these could be better).  As a simplifying option I strip back to parent unless the USAN is for the salt.  I take the simple empirical choice of  "most SIDs = probably correct".  For the combinations  I just have to open mixtures up from any component and inspect them because they have no names that identify the CID as a drug combination. Just for the record, I am not in a position to "officially provenance" any drug name-to-CID links,  but , because neither the originating companies nor the FDA nor WHO nor USAN are prepared to do this  - then who can (Wikipedia maybe ?).  The SID cases are all biologicals where there is no small-molecule CID structure.  The "Notes" column is just what piqued my interest and some of these will be expanded below.  " Probable first patent" is taken from SureChemOpen.  I basically follow the PubChem SureCN SID > SureChemOpen link,  open up the document list (even up to  100s), and go staight to the last (i.e. first date) entry, usually a WO/PCT.  Not all of these may be first-filings (e.g. if these were Markushed or otherwise obfuscated the lead, with only the later process patents included the explicit structure that the SureChem pipe extracted).  The SureChemOpen  count is a new feature you can see below from the ponatinib SID 152688356


In this case the 15 patents included 42 extractions of the same structure because an example is typically "exemplified" multiple times as an IUPAC and/or image in one patent (e.g. both in description and claims). This is multiplexed by the size of the patent family.  Note here that large numbers usually occur from public declarations of  the clinical candidate that, after a few years lag lead to a swathe of claims around processes and combinations, some by the primary assignees but more often by generics companies.  Context for the InChIKey Google searches is given in this post. We can now move on to the interesting quirky bits, approximately in order. 

Stereo multiplexing:  The new TB drug CID 5388906  has a good example with 7 "same connectivity" CIDs from 52 SIDs with permutated wedge bonds around the two stereo centres.  Since 24 submitters have plumped for (1R,2S) I'll take that as correct.

Blunderbuss virtual deuteration:  The example of apixaban tops the list here because it an old  INN that MeSH picked up in 2007.  As a potential drug it therfore got jumped on in the 2008/9 virtual deuteration gold rush (but no gold as yet......) so we get 91 "same connectivity" CIDS  of which 82 are deuterated, including this rather splendid  "deutero-max" version below.


Big stuff:  I'm impressed that not only PubChem but submitters can manfully cope with peptides like teduglutide, CID 16139605 coming in at not much under 4000 Mw. While it holds the size record for this set I was more interested  in the second largest,  linaclotide, CID 16158208 weighing in at 1526, because a freind of mine at Ironwood was in the team that prepared the regulatory application dossier. You can see it also produces a rather pretty set of renderings via different submitters (below).


Target over-mapping:   As a link to gene targets in the Entrez system two drugs come out way to high,  bosutinib, CID 5328940  with and vismodegib, CID 4776445 with 526 and 328 genes, respectively (note the former has already featured in this blog via will-real-bosinhib-please-stand-up). The problem here is related to kinase panel screening results. As an example, AID 624722 is part of a result set of 72 kinase inhibitors against 442 kinases.  The PubChem BioAssay target mapping challenge is associated with the threshold settings for "active" in the result matrix.  Thus, any inhibitor run against the panel gets extensive target numbers in the system, as in these two cases (the bosinhib results are shown below)

 
The Entrez system has compounded the problem by transitive orthologous cross-mapping in the gene mappings.  The 526 result for bosutinihib is shown below.


So the gene-to-assay (as opposed to the protein-to-assay) mapping now multiplexes from 21 humans out to all 505 Entrez gene with an orthologous relationship (note also the classic primate 100% identity problem).  Other kinase inhibitors, such as regorafenib, CID 11167602, map to their single primary target. 


Google InChIKey counts:  I can't feasibly review these but by all means just pop a few yourself to get an idea.  I had expected the new SureChem corpus counts to correlate with these on the basis of  INN "age".  This seems actually very loose coupling but certainly for tofacitinib the issuing if the INN back in 2003 seems the likely cause of the 695 IK Google hits. One of the surprises was:

Cabozantinib - Shopping-enabled Wikipedia Page on Amazon www.amazon.com/wiki/Cabozantinib N-(4-((6,7-Dimethoxyquinolin-4-yl)oxy)phenyl)-N-(4-fluorophenyl) ... 1 Approvals and indications; 2 Clinical trials; 3 See also; 4 References; 5 External links ...


Whatever next ... Another unexpected link  was: 

Ontology Browser - Rat Genome Database rgd.mcw.edu/rgdweb/.../view.html?acc... -... InChIKey=FNHKPVJBJVTLMP-UHFFFAOYSA-N; Regorafenibum ... (1E)-2-(5-chlorothiophen-2-yl)-N-\{(3S)-1-[(1S)-1-methyl-2-morpholin-4-yl-2-oxoethyl]-2- ...

Looking at the link really had me puzzled but I think I've worked it out from the two  pictures below. 



Now, the Rat Genome Database is an impressive resource (I even have a gene in it) and most of the external ontologies (list above) they have chosen to include look-ups for make good bioinformatic sense. However, since the ChEBI ontolgy is a "rat free zone"  I'm not convinced of the utility of linking 28,000 ChEBI records and InChIKeys back to any rodent genome (sure, some will have been assayed against rat proteins but this data will not be ChEBI-linked).   If this realy was a thought-through choice by the RGD team I'd be pleased to add their explanation here.

2012 vs 2011:

I can finish off with some comparisons between collated sets as public MyNCBI collections for (27 drugs approved 2012) the CIDs from above and (25 drugs approved 2011) as the CIDs from  the Drug-class-of-2011-in-pubchem post.  The results are shown below.


Note this cannot be regarded as a strict comparison because of the sprinkling of mixtures, alternative structures, radiolables and tricky peptides. However,  it gives an indication at least of relative coverage.  Nothing particularly surprising here except perhaps a) we might have expected the PubMed and MeSH pharmacology capture to be higher and b) Drugs of the Future showing neck-and-neck capture alogside ChEMBL (but why not get the journal indexed in PubMed ? ...sigh). Note the apparent anomaly that some new drugs on the ChEMBL blog are not  in ChEMBLdb,  but this is mainly the  fault of some originators for not doing the whole biomedical community the good service of writing up decent medicinal chemistry papers with SAR for their candidates.

For some arcane, but nontherless significant metrics,  we can also establish that from 555 SIDs the 2012 drugs have an average of 20 SIDs/CID,  rising to 24 for 2011.  The same-connectivity approximate tautomer/isomer envelope is 548 for the 25 2011 CIDs.  This falls to 362 for 2012 but remember these numbers can get skewed by just one or two blunderbuss deuteration series.