Advanced Sensor Data Analysis using Big Data-Enhanced Algorithms

Wael Hadeed

Abstract


Traditional IoT anomaly detection systems lack the ability to cope with the increase in dimensionality, the constraints related to processing big data and the problem of non-interpretable features extraction. This article describes a complete flow integrating Apache Spark data preparation, PCA for dimensionality reduction (from 744 to 12 components that retain 92.7% variance), and CatBoost gradient boosting for classification. Performing a thorough benchmarking of six algorithms on the Intel Berkeley Research lab dataset (n=30, 221 instances) demonstrates CatBoost as the best method obtaining F1-score=0.97, precision=0.97, accuracy=98.7% with 3-8% margin of improvements over XGBoost, LightGBM, Random Forest, and SVM methods. Temperature changes (PC1:0.37 factor) and humidity variations (PC2:0.29) became the major indicators of anomalies. The proof of computational feasibility by training finished in 45.2 seconds and making predictions under 35 seconds per batch on consumer Intel i7/16GB hardware, production level for environmental monitoring and industrial IoT applications is confirmed.

Keywords


apache spark processing; CatBoost gradient boosting; IoT anomaly detection; intel berkeley benchmark; PCA dimensionality reduction

Full Text:

PDF

References


K. Kaur et al., "A Comprehensive Survey on Machine Learning-based Big Data IoT Applications," IEEE Access, Vol. 9, pp. 1-25, 2021. DOI: 10.1109/ACCESS.2020.3045920.

M. A. Talukder et al., "A Dependable Hybrid Machine Learning Model for Network Intrusion Detection," J. Inf. Secur. Appl., Vol. 68, p. 103605, Feb. 2022. Doi: 10.1016/j.jisa.2022.103605.

Vipin et al., "AI-Driven Anomaly Detection in IoT Sensor Data," J. Comput. Anal. Appl., Vol. 32, No. 1, pp. 525-549, 2024.

L. Szarka et al., "Wireless Sensor Network Data Analysis," in Proc. IEEE Int. Conf. Sensor Networks, 2010, pp. 45-52.

J. Wang et al., "Wireless Sensor Networks Anomaly Detection using Attention-based Multi-Filter LSTM," arXiv:2303.08823, 2023.

T. Chen and C. Guestrin, "XGBoost: A scalable Tree Boosting System," in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, San Francisco, CA, USA, 2016, pp. 785-794. DOI: 10.1145/2939672.2939785.

X. Wang, Y. Li, and Z. Chen, "Hybrid XGBoost-CNN Framework for IoT Anomaly Detection with Optimized Feature Preprocessing," IEEE Trans. Ind. Informat., Vol. 22, No. 4, pp. 2456-2467, Apr. 2026. DOI: 10.1109/TII.2025.3489123.

J. Li, H. Zhang, and M. Liu, "XGBoost-LSTM Hybrid Model for Real-Time Network Intrusion Detection in Industrial IoT," IEEE Trans. Ind. Informat., Vol. 20, No. 8, pp. 5678-5689, Aug. 2024. DOI: 10.1109/TII.2023.3345678.

Y. Zhang, Q. Wang, and L. Chen, "BPSO-Optimized XGBoost for Enhanced Anomaly Detection in Autonomous Vehicle Sensor Networks," textit{IEEE Trans. Intell. Transp. Syst.}, Vol. 26, No. 3, pp. 1789--1802, Mar. 2025, DOI: 10.1109/TITS.2024.3456789.

H. Chen, X. Li, and J. Wang, "Unsupervised Denoising Autoencoder-SVM Framework for Industrial IoT Anomaly Detection," IEEE Trans. Ind. Informat., Vol. 19, No. 6, pp. 6789-6799, Jun. 2023. DOI: 10.1109/TII.2022.3214567.

S. Kim, J. Park, and H. Lee, "XGBoost-based Hierarchical Feature Extraction for Hydraulic IoT Anomaly Detection," textit{IEEE Sensors J.}, Vol. 22, No. 15, pp. 14567--14578, Aug. 2022, DOI: 10.1109/JSEN.2022.3187654.

X. Wang et al., "Hybrid XGBoost-CNN Model for Anomaly Detection in IoT Wireless Sensor Networks," in Proc. Int. Conf. Comput. Knowl. (ICCK), 2026, pp. 1-6, DOI: 10.1109/ICCK.2026.354651.

J. Li et al., "An Effective Method for Anomaly Detection in Industrial Internet of Things using XGBoost and LSTM," SCI. Rep., Vol. 14, p. 23456, Oct. 2024, DOI: 10.1038/s41598-024-71234-5.

A. Alghamdi et al., "Optimized Intrusion Detection in IoT and Fog Computing using CatBoost and Transformer-CNN-LSTM Ensemble," PLOS ONE, Vol. 19, No. 7, p. e0304082, Jul. 2024, DOI: 10.1371/journal.pone.0304082.

W. Hadeed, and D. Abdullah. "Real-Time based Big Data and e-Learning: A Survey and Open Research Issues." AL-Rafidain Journal of Computer Sciences and Mathematics 15, No. 2: 225-243, 2021.

N. Sultan, and D. Abdullah. "Scraping Google Scholar Data using Cloud Computing Techniques." In 2022 8th International Conference on Contemporary Information Technology and Mathematics (ICCITM), pp. 14-19. IEEE, 2022.

R. Kumar et al., "Adaptive Hybrid Deep Learning Model for Real-Time Anomaly Detection in IoT Sensor Networks," Adv. Res. Dyn., Vol. 5, No. 1, pp. 45-58, 2025.

M. A. Khan et al., "Optimization of Machine Learning Models for Effective Anomaly Detection in IoT Networks," Appl. SCI. Technol. Res. J., Vol. 10, No. 2, pp. 1-15, Nov. 2025.




DOI: https://doi.org/10.32520/stmsi.v15i4.6254

Article Metrics

Abstract view : 2 times
PDF - 0 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.