Hybrid LLM and Machine Learning Framework Enhances Early Fire Detection in Subway Tunnels

Key Takeaways

The sensitivity analysis reveals a trade-off between detection sensitivity and premature-alarm suppression.
Lower decision thresholds improve F1 scores but increase the Pre-Alarm Rate, while higher thresholds can reduce false alarms at the cost of lower F1 scores.
The architectural design using LLM as a feature extractor outperforms direct LLM prompting methods in fire detection accuracy.

Sensitivity Analysis Results

Analysis of the LLM-augmented classifiers highlights their performance across various decision thresholds (\(\tau\)) and temporal persistence windows (k) on the Total dataset. The evaluation focuses on key metrics: F1 Score, Detection Delay, and Pre-Alarm Rate (PAR). The default settings are \(\tau = 0.50\) and \(k = 1.0 \text{s}\).

The findings indicate that decreasing thresholds yield better F1 scores, yet this leads to significantly higher PAR, while higher thresholds mitigate false alarms but can negatively impact detection accuracy. For instance, at \(\tau = 0.25\), classifiers like SVM + LLM and GBM + LLM achieve F1 scores of 87.45% and 90.85%, with PAR values of 99.25% and 51.32%, respectively. In contrast, setting \(\tau = 0.75\) drops PAR to 8.68% but lowers F1 scores substantially.

Additionally, extending the temporal persistence window from 1.0 to 2.0 or 3.0 seconds typically decreases PAR but can also diminish F1 scores, particularly for SVM + LLM and RF + LLM. GBM + LLM demonstrates better resilience to moderate smoothing, achieving an F1 score of 88.33% with a reduced PAR of 28.68% when \(\tau = 0.50\) and \(k = 2.0\). However, increasing the window further to \(k = 3.0\) results in a decline of F1 to 83.46%.

Comparison with Direct LLM Prompting

To assess the rationale for utilizing LLMs as a semantic feature extractor instead of direct classifiers, a comparative analysis presents the performance across Zero-shot, One-shot, and Few-shot Chain-of-Thought (CoT) prompting approaches.

Results show that direct prompting proves highly sensitive to prompting settings and model architecture. Under Zero-shot CoT, various architectures yield suboptimal recall and F1 scores, with Llama-3.2-3B being the only exception, achieving an F1 of 78.47% before its performance collapses under the One-shot setting. In the Few-shot context, Llama-3-8B attains an F1 of 70.32% but remains inferior to the LLM-augmented classifier’s F1 score of 90.77%.

Ultimately, the analysis indicates that LLMs perform better when acting as semantic feature generators in conjunction with supervised classifiers rather than as independent numerical classifiers.

Implementation Details

Supplementary information outlines the framework’s prompt design, hyperparameter settings for ML classifiers and LLMs, and scenario-specific thresholds used for the Rule + ML ablation baseline. This configuration establishes robust symbolic features that enhance detection capabilities compared to traditional threshold rules.

The content above is a summary. For more details, see the source article.