We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
Online ordering will be unavailable from 17:00 GMT on Friday, April 25 until 17:00 GMT on Sunday, April 27 due to maintenance. We apologise for the inconvenience.
This journal utilises an Online Peer Review Service (OPRS) for submissions. By clicking "Continue" you will be taken to our partner site
https://www.editorialmanager.com/eurpsy/default.aspx.
Please be aware that your Cambridge account is not valid for this OPRS and registration is required. We strongly advise you to read all "Author instructions" in the "Journal information" area prior to submitting.
To save this undefined to your undefined account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you used this feature, you will be asked to authorise Cambridge Core to connect with your undefined account.
Find out more about saving content to .
To send this article to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional method, specifically when only different instruments are available for one construct as well as for the evaluation of potentially new construct-constellations. The present article therefore explores embedding models’ potential to detect opportunities for semantic harmonization.
Methods
Using models like SBERT and OpenAI’s ADA, we developed a prototype application (“Semantic Search Helper”) to facilitate the harmonization process of detecting semantically similar items within extensive health-related datasets. The approach’s feasibility and applicability were evaluated through a use case analysis involving data from four large cohort studies with heterogeneous data obtained with a different set of instruments for common constructs.
Results
With the prototype, we effectively identified potential harmonization pairs, which significantly reduced manual evaluation efforts. Expert ratings of semantic similarity candidates showed high agreement with model-generated pairs, confirming the validity of our approach.
Conclusions
This study demonstrates the potential of embeddings in matching semantic similarity as a promising add-on tool to assist harmonization processes of multiplex data sets and instruments but with similar content, within and across studies.
Cardiovascular disease (CVD) is twice as prevalent among individuals with mental illness compared to the general population. Prevention strategies exist but require accurate risk prediction. This study aimed to develop and validate a machine learning model for predicting incident CVD among patients with mental illness using routine clinical data from electronic health records.
Methods
A cohort study was conducted using data from 74,880 patients with 1.6 million psychiatric service contacts in the Central Denmark Region from 2013 to 2021. Two machine learning models (XGBoost and regularised logistic regression) were trained on 85% of the data from six hospitals using 234 potential predictors. The best-performing model was externally validated on the remaining 15% of patients from another three hospitals. CVD was defined as myocardial infarction, stroke, or peripheral arterial disease.
Results
The best-performing model (hyperparameter-tuned XGBoost) demonstrated acceptable discrimination, with an area under the receiver operating characteristic curve of 0.84 on the training set and 0.74 on the validation set. It identified high-risk individuals 2.5 years before CVD events. For the psychiatric service contacts in the top 5% of predicted risk, the positive predictive value was 5%, and the negative predictive value was 99%. The model issued at least one positive prediction for 39% of patients who developed CVD.
Conclusions
A machine learning model can accurately predict CVD risk among patients with mental illness using routinely collected electronic health record data. A decision support system building on this approach may aid primary CVD prevention in this high-risk population.