As an unexpected variation on the theme, NCATS, having previously brokered industry compounds to academia for repurposing (see PMID 23159359), are now putting their home-grown drug research candidates out to tender (NCATS/research/tools/preclinical/patents/tech-licensing). My guess is this will generate a bit of social media commentary but I will restrict this post to the informatics fun of, you guessed it, digging out the structures. As we know this is scientifically enabling and could join up disparate pieces of data. However there is a Déjà vu element to this since there was much comment that over 30 of the the NCATS industry compounds remain completely blinded (PMID 23159359). While an intial inspection suggests the NCATS folk are pulling the same trick in reverse (i.e. partially blinding structures from US Gov research that could be licenced out to industry), to be fair most of these are open, but they still need a lot of non-trivial chemical extraction and x-mapping work. Why we should need to do this is another matter. The list is below.
Representing ~30 structures or classes (see below) the fun would rapidly wear off if I was to attempt to discern key structures behind all these links. What I'll do instead is take a didactic approach by working through just a few example to show any interested parties how they might carry on with DIY-structure digging for whatever candidates might interest them most. The bad news is that NCATS have not chosen to expedite follow-up in the simplest way which would have been PubChem links. The good news is that most of the publications are open-access (by NIH mandate) and they have declared inventor names, and some patent IDs to enable tracking back to patents (but I suppose offering direct URL links to the patents would have made it too easy.....).
The first one E-308-2009 Caspase 1 inhibitor turned out be straigtforward since chemicalize.org (see PMID 23618056) converted 92 structures from the paper, including all the novel synthetic IUPAC names. The circles thus joined up as per below via the InChIKey Google seach.
Thus, one of the key cpds in the paper (4) = NCGC00183434 = CID 44620939 = KENKPOUHXLJLEY-QANKJYHBSA-N. There are also a few quirks associated with this. Firstly while there are two SureChemOpen patent links via SID 155243680 these are citing the compound as prior-art. What looks like the primary composition-of-matter filing as US20120294843 has converted many of the image structures from the series but the exact one obove may have failed. Note also there is a full set of Caspase 1 PubChem Bioassays (linked to AID 2389). This is not expliitly cited in the paper but the PubMed/Entrez system adds the link. The InChIKey search picks up a "cryptic" CHEMBL1552969 entry in the sense that, because this was a subsumed confirmatory PubChem BioAssay, there is a ChEMBL-to-PubChem pointer via the BioAssay SID but this is not reciprocal. The MeSH annotator picked up one specified substance
1-(2-((1-(4-amino-3-chlorophenyl)methanoyl)amino)-3,3-dimethylbutanoyl)pyrrolidine-2-carboxylic acid but as CID 10156704 this turns out to be a reference compound used in the paper but only code-name mapped as VX-765 via an InChIKey hit to a chemical supplier, not via MeSH.
Moving on down I drew a blank with E-276-2011 in not being able to relate the inventor patent name hits to the abstract details. Next up, E-120-2010 was OK because the NGC codes used in the two papers (PMID 20451379 and PMID 20017496 , also picked up by ChEMBL) are Google and PubChem-positive. As an example, one of the pyruvate kinase M2 activators, NCGC00185916 is mapped to CID 44543605. In turn this maps to useful stuff including US20110195958 via SureChemOpen with over 130 examples. Next up E-240-2011 Small-Molecule Inhibitors of Human Galactokinase for the Treatment of Galactosemia and Cancers has no publication links but does include a US patent number. We can then go in to WO201304319 "galactokinase inhibitors for the treatment and prevention of associated diseases and disorders" with an SAR table of 112 structures. While the reported potencies are low we can pick up most active structure below.
And to not only square the circle but also add a quirk here, CID 44640157 matches the assay result in the patent (ie. was the same data) but potency fell 4-fold on the re-test. This would take some sorting out but at least the data is on deck for inspection by all.
The entry for E-094-2011 Inhibitors of Human Apurinic/apyrimidinic Endonuclease 1 (APE1) as an Anticancer Drug Target is at least open via two publications and a patent number. Nonetheless, there is no direct link into PubChem. We have to paste the IUPAC for compound "3" out of the supplementary data PDF from PMID 22455312 use OPSIN in this case for conversion and the SMILES to hit CID 3581333. This squares the circle as below
We can also dive into the relevant assay results as per below but note this was also scored as an active antimalarial in the parasite assay !
I may dig out a few more of these but that's enough for this afternoon, other stuff to do. To conclude:
1) Much of this is openly accessible data (including the published patents) but many potential interested parties will find linking everything up via chemical structures hard work. Most of the PI's for these projects could do this like rolling off a log by simply listing out InChIs and CIDs in SAR data tables (but please - don't Markush the series)
2) The science coming out of these enteprises is top notch but the PIs should actively engage in connecting the system. At the most basic this should be explicit citations and/or cross-pointers between the papers, patents and BioAssay IDs
3) I can't estimate how long it would take me to get lead structures from the whole list (and some are blinded) because this would depend on the desired end-point (e.g. one key example or 100s of patent structures) but it would be in the order of days and the process also has plenty of room for error.
4) There are probably a lot of others embarked on the same exercise right now.....
5) Interestingly I think it was tacit knowledge that MLSCN investigators were filing patents on leads that would later appear in PubChemBioAssay but some of these entries now make this explicit (this is not a criticism of the practise, merely an observation).