How to create predictive health models with PyHealth?

Machine learning has been applied to many health-related tasks, such as the development of new medical treatments, the management of patient data and records, and the treatment of chronic diseases. To be successful in these SOTA applications, we must rely on the time-consuming technique of model building evaluation. To ease this burden, Yue Zhao et al came up with a PyHealth, a Python-based toolkit. As the name suggests, this toolkit contains a variety of ML models and architecture algorithms for working with medical data. In this article, we are going to walk through this model to understand how it works and its application. Below are the main points that we will cover in this article.


  1. Machine learning in healthcare
  2. How can PyHealth help in healthcare?
  3. How PyHealth works
  4. PyHealth for building models

Let’s first talk about the use case of machine learning in healthcare.

Machine learning in healthcare

Machine learning is used in a variety of healthcare settings, from managing cases of common chronic diseases to mining patient health data in conjunction with environmental factors such as pollution exposure and environmental conditions. weather situation.

Machine learning technology can help healthcare professionals develop precise drug treatments tailored to individual characteristics by analyzing huge amounts of data. Here are some examples of applications that can be covered in this segment:

Disease detection

The ability to quickly and correctly diagnose disease is one of the most critical aspects of a successful healthcare organization. In high-needs areas like cancer diagnosis and treatment, where hundreds of drugs are currently in clinical trials, scientists and computer scientists come into play. One method combines cognitive computing with genetic tumor sequencing, while another uses machine learning to provide diagnosis and treatment in a range of fields, including oncology.

Diagnose using the image

Medical imaging, and its ability to provide a complete picture of disease, is another important aspect of disease diagnosis. Deep learning becomes more accessible as data sources become more diverse, and it can be used in the diagnostic process, so it becomes increasingly important. While these machine learning applications are often correct, they have certain limitations in that they cannot explain how they arrived at their conclusions.

drug discovery

ML has the potential to identify new drugs with significant economic benefits for pharmaceutical companies, hospitals and patients. Some of the world’s biggest tech companies, like IBM and Google, have developed ML systems to help patients find new treatment options. Precision medicine is an important term in this field since it involves understanding the mechanisms underlying complex disorders and developing alternative therapeutic pathways.

Surgical tools

Due to the high-risk nature of surgeries, we will always need human assistance, but machine learning has proven extremely useful in the robotic surgery industry. The da Vinci robot, which allows surgeons to operate robotic arms to perform highly detailed surgery in confined areas, is one of the most popular breakthroughs in the profession.

These hands are generally more precise and stable than human hands. There are other instruments that use computer vision and machine learning to determine distances between different parts of the body so that surgery can be performed correctly.

How can PyHealth help in healthcare?

Healthcare data is typically noisy, complicated, and heterogeneous, resulting in a diverse set of healthcare modeling issues. For example, health risk prediction is based on sequential patient data, disease diagnosis is based on medical images, and risk detection is based on continuous physiological signals.

Electroencephalogram (EEG) or electrocardiogram (ECG), for example, and multimodal clinical notes (eg, text and images). Despite their importance in healthcare research and clinical decision-making, the complexity and variability of healthcare data and tasks necessitate the long-awaited development of a specialized ML system for model benchmarking. predictive health.

PyHealth is composed of three modules: data preprocessing, predictive modeling and evaluation. Computer scientists and health data scientists are the target consumers of PyHealth. They can run complex machine learning processes on health datasets in less than 10 lines of code using PyHealth.

Data pre-processing module converts complex health data sets such as longitudinal electronic health records, medical images, continuous signals (eg, electrocardiograms), and clinical notes into formats suitable for machine learning .

The Predictive Modeling Module offers over 30 machine learning models, including known ensemble trees and deep neural network-based approaches, using a uniform yet flexible API suitable for researchers and practitioners.

The assessment module includes a number of assessment methodologies (eg, cross-validation and train-validation-test separation) as well as prediction model metrics.

There are five distinct advantages to using PyHealth. For starters, it contains over 30 state-of-the-art predictive health algorithms, including both traditional techniques such as XGBoost and newer deep learning architectures such as automatic encoders, convolution-based models, and models. contradictory.

Second, PyHealth is broad in scope and includes models for a variety of data types, including sequences, images, physiological signals, and unstructured text data. Third, for clarity and ease of use, PyHealth includes a unified API, detailed documentation, and interactive examples for all algorithms. Complex deep learning models can be implemented in less than ten lines of code.

Fourth, unit testing with cross-platform, continuous integration, code coverage, and code maintainability checks are performed on most models in PyHealth. Finally, for efficiency and scalability, parallelization is enabled in some modules (data pre-processing), as well as fast GPU computation for deep learning models through PyTorch.

How PyHealth works

PyHealth is a Python 3 application that uses NumPy, scipy, scikit-learn and PyTorch. As shown in the diagram below, PyHealth consists of three main modules: First, the data preprocessing module can validate and convert user input into a format that learning models can understand;

Second, the predictive modeling module is composed of a collection of models organized by type of input data into sequences, images, EEG and text. For each type of data, a set of dedicated learning models have been implemented, and the third is that the assessment module can automatically infer the type of task, such as multi-classification, and perform comprehensive assessment by type of task.

Most of the learning models share the same interface and are inspired by the learning of the scikit-API design and the general design of deep learning: I fit learns the weights and saves the necessary statistics from the training and validation data; The load model chooses the model with the best validation accuracy and the inference predicts the incoming test data.

For fast exploration of data and models, the framework includes a library of helper functions and utilities (parameter checking, label checking and partition estimators). For example, a label check can check the data label and automatically infer the task type, such as binary classification or multi-classification.

PyHealth for building models

We will now discuss below how we can take advantage of the API of this framework. First, we need to install the package using pip.

! pip install pyhealth

Then we can load the data from the repository itself. For this we need to clone the repository. After cloning the repository into the datasets folder, there are a variety of datasets such as sequenced, image-based, etc. We are using the mimic dataset and it is in the zip form that we need to unzip it. Below is the snippet clone repository and unpack the data.

! git clone
! unzip /content/PyHealth/datasets/

The decompressed file is saved in the current working directory with the name of the folder in the form of a synoptic. Then, to use this dataset, we need to load the sequence data generator function which serves as a functionality to prepare the dataset for experimentation.

from import sequencedata as expdata_generator
# initialize the dataset
# unique id for dataset
expdata_id = ''
cur_dataset = expdata_generator(expdata_id=expdata_id)
cur_dataset.get_exp_data(sel_task='phenotyping', data_root="/content/mimic")

We have now loaded the dataset. Now we can do further modeling as below.

# load and fit the model
from pyhealth.models.sequence.embedgru import EmbedGRU
# unique id for model
expmodel_id = '2020.0811.model.phenotyping.test.v2'
clf = EmbedGRU(expmodel_id=expmodel_id, n_batchsize=5, use_gpu=False,
# fit model, cur_dataset.valid)

Here is the result of the assembly.

Last words

Through this article, we have discussed how machine learning can be used in the healthcare industry by observing the different applications. As this area is quite large and NOT number of applications, we discussed a Python-based toolkit designed to create a predictive modeling approach using various deep learning techniques such as LSTM, GRU for sequence data, and CNN for sequence-based data. on pictures.

The references

About Norman Griggs

Check Also

Use these quick tricks to ripen bananas

When you see a bunch of unripe bananas on the table, there are usually only …