No CrossRef data available.
Published online by Cambridge University Press: 11 April 2025
Objectives/Goals: Understanding the interconnections among over 20,000 human diseases spanning organ systems could inform more precise diagnosis and treatment of diseases. Here, we examine whether the ability of large language models (LLMs) to learn universal representations of concepts can be leveraged to discover complex relationships across human diseases. Methods/Study Population: To address the challenge of computationally representing thousands diseases spanning multiple organ systems, we used internal representations of concepts by LLMs to encode diseases based on their descriptions from standard disease ontologies (ICD10 and Phecodes). To do this, we leveraged application programming interfaces (APIs) of three LLMs-GPT3.5, Mistral and Voyage to encode disease relationships. We then performed unsupervised clustering of the diseases using their encodings (embeddings) from each LLM to determine whether the resulting clusters reflect disease relationships. To enable deeper exploration of disease relationships, we developed interactive plots that provide a system level view of the relationships between thousands of diseases and their association with specific organ systems. Results/Anticipated Results: We found that unsupervised analysis of disease relationships using the LLM encodings reveal high similarities among diseases based on organ systems they affect. All the LLMs clustered diseases into groups largely defined by the organ systems they affect without being trained to specifically classify diseases into their corresponding organ system classification. An exception to this was tumors in which we observed that most tumors cluster together as a group irrespective of the organs they affect. Interestingly, we found that tumors affecting anatomically related organs show higher similarity to each other than to those affecting distantly related organs. In addition to anatomical relationships between diseases, we found that the LLM embeddings capture genetic relationships between diseases. Discussion/Significance of Impact: Overall, we found that the LLM-derived encodings uphold biologically and clinically significant relationships across organ systems and disease types. These results suggest that LLM encodings could provide a universal framework for representing diseases as computable phenotypes and enable the discovery of complex disease relationships.