Hostname: page-component-669899f699-tpknm Total loading time: 0 Render date: 2025-05-01T11:51:37.458Z Has data issue: false hasContentIssue false

366 Universal representation of human diseases using large language models

Published online by Cambridge University Press:  11 April 2025

Geoffrey Siwo
Affiliation:
University of Michigan Medical School
Ellen R. Bowen
Affiliation:
University of Michigan Medical School
Akbar K. Waljee
Affiliation:
University of Michigan Medical School
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Objectives/Goals: Understanding the interconnections among over 20,000 human diseases spanning organ systems could inform more precise diagnosis and treatment of diseases. Here, we examine whether the ability of large language models (LLMs) to learn universal representations of concepts can be leveraged to discover complex relationships across human diseases. Methods/Study Population: To address the challenge of computationally representing thousands diseases spanning multiple organ systems, we used internal representations of concepts by LLMs to encode diseases based on their descriptions from standard disease ontologies (ICD10 and Phecodes). To do this, we leveraged application programming interfaces (APIs) of three LLMs-GPT3.5, Mistral and Voyage to encode disease relationships. We then performed unsupervised clustering of the diseases using their encodings (embeddings) from each LLM to determine whether the resulting clusters reflect disease relationships. To enable deeper exploration of disease relationships, we developed interactive plots that provide a system level view of the relationships between thousands of diseases and their association with specific organ systems. Results/Anticipated Results: We found that unsupervised analysis of disease relationships using the LLM encodings reveal high similarities among diseases based on organ systems they affect. All the LLMs clustered diseases into groups largely defined by the organ systems they affect without being trained to specifically classify diseases into their corresponding organ system classification. An exception to this was tumors in which we observed that most tumors cluster together as a group irrespective of the organs they affect. Interestingly, we found that tumors affecting anatomically related organs show higher similarity to each other than to those affecting distantly related organs. In addition to anatomical relationships between diseases, we found that the LLM embeddings capture genetic relationships between diseases. Discussion/Significance of Impact: Overall, we found that the LLM-derived encodings uphold biologically and clinically significant relationships across organ systems and disease types. These results suggest that LLM encodings could provide a universal framework for representing diseases as computable phenotypes and enable the discovery of complex disease relationships.

Type
Informatics, AI and Data Science
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright
© The Author(s), 2025. The Association for Clinical and Translational Science