[Blog] Fighting COVID-19 With Deep Graph

Since December 2019, the rapid spread of COVID-19 corona-viruses worldwide has caused more than 7 million infections and more than 400,000 deaths. The rapid spread of COVID-19 demonstrates the dire need for quick and effective drug discovery. Drug repurposing is a drug discovery paradigm that uses existing drugs for new therapeutic indications. It has the advantages of significantly reducing time and cost relative to de novo drug discovery. Drug repurposing with knowledge graphs presents a promising strategy for COVID-19 treatment.

This is a companion discussion topic for the original entry at https://www.dgl.ai/news/2020/06/09/covid.html

Hi Minjie, thanks for sharing this great project.
I have a question about entities from different data sources. Is the same gene from different data sources represented as a single node?
For an example, if Hetionet has ADT gene in its dataset, and Drugbank also has ADT gene in its dataset. Is the ADT gene show up as a single node in the DRKG knowledge graph, or multiple nodes? i.e. ADT gene from Hetionet as one node, ADT gene from Drugbank as another node.

We do handle the data fusion issue and treat the same entity from different data sources as a single node.

That’s great! Is there a dictionary file that we can match the gene IDs to the gene names? For an example, “Gene::10”, I want to know what gene it is.
Thank you.

You can find those in ./entity2src.tsv by following the corresponding instructions in README

I took a look at that file, here is what I see.

From the content, I can’t figure out what Gene::10 and Gene::100 are. They all point to the same set of links without specific gene information. Am I missing something?
Thanks for answering.

Hi, those numbers are the Entrez IDs (identifier for a gene per the NCBI Entrez database) of the genes. The file entity2src.tsv maps the gene IDs to the list of data sources they appear in (we use seven different data sources to construct the DRKG). Specifically we use the following rules to assign IDs:
(i) Compound entities are mapped to the Drugbank ID and if not possible to the Chembl ID. If a compound can not be found to either of the two we use the native ID space and we include the name of the source as part of the entity’s name.
(ii) Gene entities are mapped to the Entrez ID.
(iii) Disease entities are mapped to the MESH ID space.
(iv) The remaining biological entities appear only in a single data source and hence we use the data source’s ID.

Thanks for the detailed info.