We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Electronic Health Records (EHR) analysis is pivotal in advancing medical research. Numerous real-world EHR data providers offer data access through exported datasets. While enabling profound research possibilities, exported EHR data requires quality control and restructuring for meaningful analysis. Challenges arise in medical events (e.g., diagnoses or procedures) sequence analysis, which provides critical insights into conditions, treatments, and outcomes progression. Identifying causal relationships, patterns, and trends requires a more complex approach to data mining and preparation.
Methods:
This paper introduces EHRchitect – an application written in Python that addresses the quality control challenges by automating dataset transformation, facilitating the creation of a clean, formatted, and optimized MySQL database (DB), and sequential data extraction according to the user’s configuration.
Results:
The tool creates a clean, formatted, and optimized DB, enabling medical event sequence data extraction according to users’ study configuration. Event sequences encompass patients’ medical events in specified orders and time intervals. The extracted data are presented as distributed Parquet files, incorporating events, event transitions, patient metadata, and events metadata. The concurrent approach allows effortless scaling for multi-processor systems.
Conclusion:
EHRchitect streamlines the processing of large EHR datasets for research purposes. It facilitates extracting sequential event-based data, offering a highly flexible framework for configuring event and timeline parameters. The tool delivers temporal characteristics, patient demographics, and event metadata to support comprehensive analysis. The developed tool significantly reduces the time required for dataset acquisition and preparation by automating data quality control and simplifying event extraction.
Raw data require a great deal of cleaning, coding, and categorizing of observations. Vague standards for this data work can make it troublingly ad hoc, with much opportunity and temptation to influence the final results. Preprocessing rules and assumptions are not often seen as part of the model, but they can influence the result just as much as control variables or functional form assumptions. In this chapter, we discuss the main data processing decisions that analysts often face and how they can affect the results: coding and classifying of variables, processing anomalous and outlier observations, and the use of sample weights.
Once the information about nodes, links, and their substantive attributes has been collected, a bit more work is needed to prepare to use the data. This chapter covers this intermediate step, with tips for organizing and cleaning the data. Reading this chapter before collecting the data in the first place will help avoid some serious pitfalls. It covers ethical issues pertaining to collecting names (a necessary step in most methods of network elicitation), a method for automating the cleaning of name data, and robustness checks that can be done to assess the cleaning.
High-quality data are necessary for drawing valid research conclusions, yet errors can occur during data collection and processing. These errors can compromise the validity and generalizability of findings. To achieve high data quality, one must approach data collection and management anticipating the errors that can occur and establishing procedures to address errors. This chapter presents best practices for data cleaning to minimize errors during data collection and to identify and address errors in the resulting data sets. Data cleaning begins during the early stages of study design, when data quality procedures are set in place. During data collection, the focus is on preventing errors. When entering, managing, and analyzing data, it is important to be vigilant in identifying and reconciling errors. During manuscript development, reporting, and presentation of results, all data cleaning steps taken should be documented and reported. With these steps, we can ensure the validity, reliability, and representative nature of the results of our research.
Edited by
Ruth Kircher, Mercator European Research Centre on Multilingualism and Language Learning, and Fryske Akademy, Netherlands,Lena Zipp, Universität Zürich
The questionnaire, one of the most frequently used methods in the study of language attitudes, can be used to elicit both qualitative and quantitative data. This chapter focuses on the questionnaire as a means of eliciting quantitative data by means of closed questions. It begins by examining the strengths of doing this (e.g. the fact that the resulting data can easily be compared and analysed across participants) as well as the limitations (e.g. the fact that issues unforeseen by the researcher usually do not come to the fore). The chapter then discusses key issues in research planning and design: for example, question types, question wording, question order, reliability and validity, and more general issues regarding questionnaire design. The chapter also considers questionnaire distribution. The exploration of data analysis and interpretation focuses on data cleaning and coding, statistical analyses, and some points of caution regarding the interpretation of findings from questionnaire-based studies. A case study of language attitudes in Quebec serves to illustrate the main points made in the chapter. The chapter concludes with further important considerations regarding the context-specificity of findings and the benefits of combining questionnaires with other methods of attitude elicitation.
This chapter presents a detailed example that applies the compensation analytics concepts developed in Chapter 6. The reader is assumed to be a compensation consultant charged with evaluating whether gender-based discrimination in pay is present in a public university system in the sciences. Section 7.1 walks through the analysis step-by-step, from formulating the business question, to acquiring and cleaning data, to analyzing the data and interpreting the results from voluminous statistical output in light of the business question. Section 7.2 covers exploratory data mining, causality, and experiments. Exploratory data mining covers situations in which the manager does not know in advance which relationships in the data will be of interest, in contrast to the example in section 7.1 in which a statistical model and specific measures could be constructed that were directly tailored to address the business question at hand. Section 7.2 covers the challenges associated with establishing causality in compensation research and how experiments can sometimes be designed to address those challenges. Randomization and some pitfalls associated with compensation experiments are also covered
This chapter presents a detailed example that applies the compensation analytics concepts developed in Chapter 6. The reader is assumed to be a compensation consultant charged with evaluating whether gender-based discrimination in pay is present in a public university system in the sciences. Section 7.1 walks through the analysis step-by-step, from formulating the business question, to acquiring and cleaning data, to analyzing the data and interpreting the results from voluminous statistical output in light of the business question. Section 7.2 covers exploratory data mining, causality, and experiments. Exploratory data mining covers situations in which the manager does not know in advance which relationships in the data will be of interest, in contrast to the example in section 7.1 in which a statistical model and specific measures could be constructed that were directly tailored to address the business question at hand. Section 7.2 covers the challenges associated with establishing causality in compensation research and how experiments can sometimes be designed to address those challenges. Randomization and some pitfalls associated with compensation experiments are also covered
This chapter responds to the growing importance of business analytics on "big data" in managerial decision-making, by providing a comprehensive primer on analyzing compensation data. All aspects of compensation analytics are covered, starting with data acquisition, types of data, and formulation of a business question that can be informed by data analysis. A detailed, hands-on treatment of data cleaning is provided, equipping readers to prepare data for analysis by detecting and fixing data problems. Descriptive statistics are reviewed, and their utility in data cleaning explicated. Graphical methods are used in examples to detect and trim outliers. The basics of linear regression analysis are covered, with an emphasis on application and interpreting results in the context of the business question(s) posed. One section covers the question of whether or not the pay measure (as a dependent variable) should be transformed via a logarithm, and the implications of that choice for interpreting the results are explained. Precision of regression estimates is covered via an intuitive, non-technical treatment of standard errors. An appendix covers nonlinear relationships among variables.
This chapter responds to the growing importance of business analytics on "big data" in managerial decision-making, by providing a comprehensive primer on analyzing compensation data. All aspects of compensation analytics are covered, starting with data acquisition, types of data, and formulation of a business question that can be informed by data analysis. A detailed, hands-on treatment of data cleaning is provided, equipping readers to prepare data for analysis by detecting and fixing data problems. Descriptive statistics are reviewed, and their utility in data cleaning explicated. Graphical methods are used in examples to detect and trim outliers. The basics of linear regression analysis are covered, with an emphasis on application and interpreting results in the context of the business question(s) posed. One section covers the question of whether or not the pay measure (as a dependent variable) should be transformed via a logarithm, and the implications of that choice for interpreting the results are explained. Precision of regression estimates is covered via an intuitive, non-technical treatment of standard errors. An appendix covers nonlinear relationships among variables.
Recommend this
Email your librarian or administrator to recommend adding this to your organisation's collection.