HeAR: Google's New Disease Detector Through Sound

type

status

date

slug

summary

Innovations and Features of HeAR

HeAR was trained using 313 million audio clips extracted from YouTube and evaluated on 33 tasks on 6 different datasets. These tasks include health acoustic event detection, cough inference, lung function assessment, etc.

Innovation

Self-supervised learning framework

The HeAR system uses a self-supervised learning (SSL) framework, which is a learning method that does not rely on a large amount of manually labeled data. By training the masked autoencoder (MAE), the system can learn a general low-dimensional audio representation from large-scale unlabeled audio data. This method can effectively improve the generalization ability of the model in a variety of tasks, especially when dealing with out-of-distribution (OOD) data.

Large-scale dataset training

The HeAR system was trained on a large-scale dataset of 313 million two-second audio clips extracted from three billion YouTube videos, covering a variety of non-semantic health acoustic events (such as coughing, breathing, etc.). The use of a large-scale dataset improves the robustness and wide applicability of the system.

Healthy Acoustic Event Detector

The system introduces a multi-label classification convolutional neural network (CNN) as a health acoustic event detector, which can identify non-speech health acoustic events in audio clips. These events include coughing, baby coughing, breathing, clearing the throat, laughing, and speaking. This detector not only enhances the functionality of the system, but also enables the system to handle a variety of different health acoustic tasks.

Multi-task performance evaluation

The HeAR system was benchmarked on 33 different health acoustic tasks, demonstrating its superior performance in a variety of tasks. In particular, on cough inference and lung function inference tasks, the HeAR system surpassed many existing technical benchmarks, demonstrating its potential as a general health acoustic model.

Key Features

Disease inference and screening

The HeAR system encodes two-second audio clips and generates audio embeddings that can be used for downstream tasks. These embeddings can be directly applied in various health acoustic tasks, such as health event detection, cough inference, and lung function inference.

HeAR can infer the possibility of specific diseases by analyzing health acoustic signals such as cough sounds. For example, it can be used to detect tuberculosis (TB), COVID-19, chronic obstructive pulmonary disease (COPD), etc. This inference function is particularly suitable for resource-limited environments, where screening can be performed through simple audio collection devices (such as smartphones).

Healthy Acoustic Event Detection

HeAR can detect and identify various health-related acoustic events from audio data, such as coughing, breathing, throat clearing, laughing, and talking. The detection of these events can be used to monitor health status and provide early warning of diseases.

The HeAR system can infer relevant health information based on the cough sounds in the audio, such as detecting specific diseases (such as COVID-19 or tuberculosis), determining an individual's gender, age, BMI, and lifestyle habits (such as smoking status).

Lung function inference

HeAR can estimate the patient's lung function parameters such as forced expiratory volume (FEV1), forced vital capacity (FVC), peak expiratory flow (PEF), etc. by analyzing respiratory audio data.

These assessments can help doctors monitor changes in patients’ lung function and support disease management. They are important for screening chronic obstructive pulmonary disease (COPD) and monitoring patients’ lung function.

Equipment compatibility and environmental adaptability

HeAR has been trained and tested on a variety of devices (such as different models of smartphones) and can adapt to audio data from different recording devices. This makes HeAR more compatible with devices in real-world applications and suitable for audio recording environments of different qualities, including those with limited resources.

Self-supervised learning and data efficiency

HeAR uses a self-supervised learning model to achieve higher task generalization capabilities by training on a large amount of unlabeled audio data. Compared with traditional methods, HeAR can maintain high performance even when data is scarce, which makes it effective when there is less health data.

Efficient data usage and generalization

: HeAR demonstrates high efficiency and excellent task generalization in data-scarce situations. Through a self-supervised learning framework, HeAR performs well on unseen data and devices, and even maintains a high level of performance when the amount of training data is reduced to 6.25% of the original data.

Medical research and development support

HeAR is a basic model that is open to researchers to accelerate the development of customized bioacoustic models for specific diseases and populations. This capability allows medical researchers to develop health monitoring tools for specific application scenarios in a shorter time.

Technical Methods of HeAR

The technical approach of the HeAR system consists of three main parts: data processing, model training, and task evaluation. The following is a detailed introduction to each part:

1. Data processing

Healthy Acoustic Event Detector

The HeAR system first uses a multi-label classification convolutional neural network (CNN) as a health acoustic event detector to detect non-semantic health acoustic events in audio clips, including coughing, baby coughing, breathing, throat clearing, laughter, and speaking.

The audio data was processed as mono, 16kHz sampling rate, and converted to a log-mel spectrogram with 48 frequency bands covering the frequency range of 125Hz to 7.5kHz, and processed with per-channel energy normalization (PCEN).

These spectrograms are input to a small convolutional neural network, which is trained with a balanced binary cross entropy loss function and ultimately outputs logits (log odds) for each predicted class.

Dataset

The HeAR system was trained using a dataset called YT-NS (YouTube Non-Semantic), which contains two-second audio clips extracted from three billion non-copyrighted YouTube videos, for a total of 313 million audio clips (about 174,000 hours of audio).

Since most of the events we are interested in are short, a two-second time window is chosen. The audio encoder of HeAR is trained entirely on this dataset.

2. Model Training

Self-supervised learning framework

The HeAR system adopts a generative learning framework based on self-supervised learning (SSL). Specifically, a masked autoencoder (MAE) model is used to learn audio representation. The MAE model learns audio representation by training an autoencoder to reconstruct masked 16×16 spectrogram patches.

During training, 75% of the input spectrogram patches are masked and encoded by a ViT-L encoder (Visual Transformer). Then, the learnable mask tokens are added to the encoded token sequence, and an 8-layer transformer decoder is responsible for reconstructing the missing patches, which is optimized by minimizing the L2 distance between the normalized patches and the predicted results.

The HeAR system was trained using the AdamW optimizer for a total of 950k steps (approximately 4 epochs) with a global batch size of 4096. The learning rate was scheduled using cosine annealing with an initial learning rate of 4.8e-4, following the commonly used linear batch scaling rule.

Benchmark Results

The HeAR system was extensively benchmarked on 33 tasks, covering three major categories of tasks: health acoustic event detection, cough inference, and lung function inference.

Healthy Acoustic Event Detection: HeAR performs well in healthy acoustic event detection tasks, especially when dealing with acoustic events such as coughing, breathing, throat clearing, laughter, etc., and can accurately identify these events. These detection tasks are validated on 6 different datasets.

Cough Inference Task: HeAR achieved top results on 10 out of 14 cough inference tasks, including diagnosing specific diseases (such as COVID-19 and tuberculosis) and inferring demographic information (such as gender, smoking status, age, etc.).

Lung function evaluation: Among the five lung function-related tasks (such as forced expiratory volume, vital capacity, peak expiratory flow, etc.), HeAR performed better than other baseline models in four tasks.

1. Overall performance

Across all tasks, the HeAR system achieved a Mean Reciprocal Rank (MRR) score of 0.708 and was the best performer in 17 out of 33 tasks, demonstrating its superior performance as a general-purpose healthy acoustic model.

The specific performance of the tasks is divided into three categories: healthy acoustic event detection, cough inference, and lung function inference. HeAR achieved the highest scores in 3, 10, and 5 tasks respectively in these three categories.

2. Health Acoustic Event Detection

Datasets: FSD50K and FluSense.

Main results: In the health acoustic event detection task, while the CLAP model performed best overall (mean average precision of 0.691 and MRR of 0.846), HeAR performed best among models not trained with FSD50K (mean average precision of 0.658 and MRR of 0.538).

HeAR performs well on breathing, coughing, laughing, breathing sound, and sneezing detection tasks. For example, in the breathing sound detection task of the FSD50K dataset, HeAR achieves an average precision of 0.434, which is significantly higher than other models.

**HeAR: performance comparison on health acoustic event detection tasks on fsk50k and FluSense datasets**

**HeAR: Radar plot of performance comparison on health acoustic event detection tasks on fsk50k and FluSense datasets**

3. Cough inference

Datasets: CoughVID, Coswara, CIDRZ tuberculosis datasets.

Main results: HeAR performs best in 10 out of 14 cough inference tasks, especially in detecting COVID-19, tuberculosis, chest X-ray (CXR) abnormalities, and inferring gender, age, and BMI.

In the COVID-19 detection task on the CIDRZ dataset, HeAR achieved an AUROC of 0.710, significantly higher than other baseline models. For the gender inference task, HeAR achieved an AUROC of 0.897 on the CoughVID dataset.

**HeAR: performance comparison on cough inference tasks**

**HeAR: radar plot of performance comparison on cough inference tasks**

4. Inference of lung function

Dataset: SpiroSmart.

Main results: In the lung function inference task, HeAR performed best in 4 out of 6 tasks, including forced expiratory volume (FEV1), forced vital capacity (FVC), and gender classification.

For example, in the FEV1 inference task, HeAR achieves a mean absolute error of 0.418, outperforming other baseline models.

**HeAR: performance comparison on spirometry tasks**

**HeAR: radar plot of the performance comparison on spirometry tasks**

5. Equipment robustness test

In the device robustness test of the CIDRZ dataset, HeAR performed consistently on different recording devices, demonstrating its adaptability to different devices. This is particularly important in practical applications, especially in resource-limited environments.

6. Other findings

Data efficiency: The HeAR system can achieve comparable performance to other models even with a smaller amount of training data, which means that HeAR has potential advantages in the field of health research where data is scarce.

Model size and deployment: Although HeAR has excellent performance, its model is large and may be difficult to run directly on devices such as smartphones. Future research may need to explore model distillation or quantization techniques to improve its usability on mobile devices.

Application prospects

The success of the HeAR system demonstrates the great potential in health acoustics research, especially in the following key areas:

1. Early disease detection and monitoring

Respiratory disease monitoring: HeAR can detect and infer healthy acoustic signals such as breathing and coughing, which makes it have important application prospects in the early detection and monitoring of respiratory diseases such as COVID-19, tuberculosis, asthma, chronic obstructive pulmonary disease, etc. By deploying HeAR on smartphones or other portable devices, patients can monitor their health status in their daily lives, thereby achieving early intervention and management of diseases.

Chronic disease management: For patients with chronic diseases that require long-term monitoring, such as chronic obstructive pulmonary disease (COPD), HeAR can provide real-time health monitoring services by analyzing daily coughing and breathing patterns, helping doctors better understand the progression of the disease and adjust treatment plans in a timely manner.

Potential impact: This type of continuous health monitoring could help improve the management of chronic diseases, enhance patients' quality of life, and reduce the risk of acute exacerbations.

2. Medical support in low-resource settings

Mobile health (mHealth) applications: In resource-limited areas, the problem of insufficient medical resources remains serious. Due to its strong device robustness and low data requirements, the HeAR system can be deployed in these environments through devices such as smartphones to help local medical staff improve disease screening and diagnosis capabilities, especially in the absence of complex medical equipment.

Telemedicine: HeAR can collect and analyze health acoustic data, such as coughing and breathing sounds, through smartphones or other portable devices, for remote health monitoring. This is especially important in areas with limited resources and can significantly improve the accessibility of medical services.

Potential impact: The HeAR system can be integrated into telemedicine platforms to help medical professionals remotely analyze patients' health acoustic signals and make preliminary health assessments, thereby reducing the need for face-to-face visits, especially in epidemics or other emergency situations. Through the application of HeAR, residents can complete basic health screening and monitoring without visiting medical institutions, which is of great significance for the popularization of public health and early detection of diseases.

3. Public health surveillance

Epidemic monitoring: By deploying the HeAR system on a large scale, public health departments can monitor health acoustic signals in a specific area in real time, such as changes in coughing and breathing patterns, to detect possible epidemic outbreaks early. For example, in crowded public places such as airports and stations, the HeAR system can be used as a contactless, real-time health monitoring tool.

Health data analysis and research: The health acoustic data generated by the HeAR system can provide important data support for public health research, helping researchers to better understand the acoustic characteristics of different diseases and their transmission patterns in different populations and environments.

4. Personalized health assistant

Health management and fitness applications: The HeAR system can be integrated with personal health management applications to help users track their respiratory health and coughing patterns and provide personalized health advice. For example, it can remind users to take appropriate preventive measures when early symptoms of a cold or respiratory infection appear.

Smart device integration: The HeAR system can be integrated into smart home devices, such as smart speakers or smartphones, to provide 24-hour uninterrupted health monitoring services by continuously monitoring the user's health acoustic signals.

5. Clinical research and development

New drug development and clinical trials: The HeAR system can be used in clinical trials to help researchers evaluate the effectiveness and side effects of drugs by analyzing participants' health acoustic signals, especially in studies involving respiratory diseases.

Expanded applications of medical artificial intelligence: As a universal health acoustic model, HeAR can be integrated with other medical artificial intelligence systems to form a multimodal health assessment platform, thereby providing more comprehensive health monitoring and disease detection services.

Official blog introduction: https://blog.google/technology/health/ai-model-cough-disease-detection/

Paper: https://arxiv.org/pdf/2403.02522

API: https://github.com/Google-Health/google-health/blob/master/health_acoustic_representations/README.md

💡

Secure your business with enterprise-grade servers