A Machine Learning Pipeline and Application for Automatic Classification of Clinical Documents

Main Article Content

G Uday Kiran, Sneha Raga Soujanya, M Mounika, Narasimha Chary CH


Healthcare industry has many associated services including research on various trends or patterns in diseases and patients’ life style. With the emergence of Artificial Intelligence (AI), it is made possible that problems in healthcare domain can be solved by using Machine Learning (ML) techniques. One such problem considered in this paper is known as clinical document classification. Existing methods in this area lack a systematic approach in filtering out false positives. In this paper we proposed a ML framework that considers pipelining of ML models at multiple levels. In the first level, clinical documents that do not have any content related to smoking are discarded. In the second level, the documents that talk about known smoking cases are retained. In the third level clinical document are classified into two categories such as currently smoking and past smokers. We proposed an algorithm known as Learning based Clinical Document Classification (LbCDC). This algorithm makes use of three models in pipeline in order to perform classification of clinical documents at multiple levels of granularity. Our experimental results revealed that the proposed system is efficient in clinical document classification.

Article Details

How to Cite
G Uday Kiran, et al. (2023). A Machine Learning Pipeline and Application for Automatic Classification of Clinical Documents. International Journal on Recent and Innovation Trends in Computing and Communication, 11(10), 481–490. https://doi.org/10.17762/ijritcc.v11i10.8512
Author Biography

G Uday Kiran, Sneha Raga Soujanya, M Mounika, Narasimha Chary CH

1Dr. G Uday Kiran, 2Sneha Raga Soujanya, 3Mrs. M Mounika, 4Dr. Narasimha Chary CH

1Associate Professor, Department of CSE(AI & ML), B V Raju Institute of Technology


2assistant professor, Department of CSE, AVN college


3Assistant Professor, B V Raju Institute of Technology


4Associate Professor, Dept of CSE, Sri Indu college of engineering and technology (Autonomous) Sheriguda, Hyderabad,

TS, INDIA- 501510



Goodrum, Heath; Roberts, Kirk and Bernstam, Elmer V. (2020). Automatic classification of scanned electronic health record documents. International Journal of Medical Informatics, 144, 104302–. http://doi:10.1016/j.ijmedinf.2020.104302

Latif, Jahanzaib; Xiao, Chuangbai; Tu, Shanshan; Rehman, Sadaqat Ur; Imran, Azhar and Bilal, Anas (2020). Implementation and Use of Disease Diagnosis Systems for Electronic Medical Records Based on Machine Learning: A Complete Review. IEEE Access, 1–1. http://doi:10.1109/ACCESS.2020.3016782

Gerevini, Alfonso Emilio; Lavelli, Alberto; Maffi, Alessandro; Maroldi, Roberto; Minard, Anne-Lyse; Serina, Ivan and Squassina, Guido (2018). Automatic classification of radiological reports for clinical care. Artificial Intelligence in Medicine, S0933365717305912–. http://doi:10.1016/j.artmed.2018.05.006

Waring, Jonathan; Lindvall, Charlotta and Umeton, Renato (2020). Automated Machine Learning: Review of the State-of-the-Art and Opportunities for Healthcare. Artificial Intelligence in Medicine, 101822–. http://doi:10.1016/j.artmed.2020.101822

Gibson, Eli; Li, Wenqi; Sudre, Carole; Fidon, Lucas; Shakir, Dzhoshkun I.; Wang, Guotai; Eaton-Rosen, Zach; Gray, Robert; Doel, Tom; Hu, Yipeng; Whyntie, Tom; Nachev, Parashkev; Modat, Marc; Barratt, Dean C.; Ourselin, Sébastien; Cardoso, M. Jorge and Vercauteren, Tom (2018). NiftyNet: a deep-learning platform for medical imaging. Computer Methods and Programs in Biomedicine, S0169260717311823–. http://doi:10.1016/j.cmpb.2018.01.025

Koopman, Bevan; Zuccon, Guido; Nguyen, Anthony; Bergheim, Anton and Grayson, Narelle (2018). Extracting cancer mortality statistics from death certificates: A hybrid machine learning and rule-based approach for common and rare cancers. Artificial Intelligence in Medicine, S0933365717301173–. http://doi:10.1016/j.artmed.2018.04.011

Suárez-Paniagua, Víctor; Rivera Zavala, Renzo M.; Segura-Bedmar, Isabel and Martínez, Paloma (2019). A two-stage deep learning approach for extracting entities and relationships from medical texts. Journal of Biomedical Informatics, 99, 103285–. http://doi:10.1016/j.jbi.2019.103285

Alyafeai, Zaid and Ghouti, Lahouari (2019). A Fully-Automated Deep Learning Pipeline for Cervical Cancer Classification. Expert Systems with Applications, 112951–. http://doi:10.1016/j.eswa.2019.112951

Wang, Yunlu; Hu, Menghan; Zhou, Yuwen; Li, Qingli; Yao, Nan; Zhai, Guangtao; Zhang, Xiao-Ping and Yang, Xiaokang (2020). Unobtrusive and Automatic Classification of Multiple Peopleâ?™s Abnormal Respiratory Patterns in Real Time using Deep Neural Network and Depth Camera. IEEE Internet of Things Journal, 1–1. http://doi:10.1109/JIOT.2020.2991456

Obeid, Jihad S.; Weeda, Erin R.; Matuskowitz, Andrew J.; Gagnon, Kevin; Crawford, Tami; Carr, Christine M. and Frey, Lewis J. (2019). Automated detection of altered mental status in emergency department clinical notes: a deep learning approach. BMC Medical Informatics and Decision Making, 19(1), 164–. http://doi:10.1186/s12911-019-0894-9

Yue, Lin; Tian, Dongyuan; Chen, Weitong; Han, Xuming and Yin, Minghao (2020). Deep learning for heterogeneous medical data analysis. World Wide Web. http://doi:10.1007/s11280-019-00764-z

Marshall, Iain J. and Wallace, Byron C. (2019). Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic Reviews, 8(1). http://doi:10.1186/s13643-019-1074-9

Fernando Pérez-García; Rachel Sparks and Sébastien Ourselin; (2021). TorchIO: A Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning . Computer Methods and Programs in Biomedicine. http://doi:10.1016/j.cmpb.2021.106236

Laith Alzubaidi; Mohammed A. Fadhel; Omran Al-Shamma; Jinglan Zhang; J. Santamaría and Ye Duan; (2021). Robust application of new deep learning tools: an experimental study in medical imaging . Multimedia Tools and Applications. http://doi:10.1007/s11042-021-10942-9

Liang Tan; Keping Yu; Ali Kashif Bashir; Xiaofan Cheng; Fangpeng Ming; Liang Zhao and Xiaokang Zhou; (2021). Toward real-time and efficient cardiovascular monitoring for COVID-19 patients by 5G-enabled wearable medical devices: a deep learning approach . Neural Computing and Applications. http://doi:10.1007/s00521-021-06219-9

Li, Min; Fei, Zhihui; Zeng, Min; Wu, Fangxiang; Li, Yaohang; Pan, Yi and Wang, Jianxin (2018). Automated ICD-9 Coding via A Deep Learning Approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1–1. http://doi:10.1109/TCBB.2018.2817488

Babita Pandey; Devendra Kumar Pandey; Brijendra Pratap Mishra and Wasiur Rhmann; (2021). A comprehensive survey of deep learning in the field of medical imaging and medical natural language processing: Challenges and research directions . Journal of King Saud University - Computer and Information Sciences. http://doi:10.1016/j.jksuci.2021.01.007

Koyel Datta Gupta; Deepak Kumar Sharma; Shakib Ahmed; Harsh Gupta; Deepak Gupta and Ching-Hsien Hsu; (2021). A Novel Lightweight Deep Learning-Based Histopathological Image Classification Model for IoMT . Neural Processing Letters. http://doi:10.1007/s11063-021-10555-1

Yasar, Huseyin and Ceylan, Murat (2020). A novel comparative study for detection of Covid-19 on CT lung images using texture analysis, machine learning, and deep learning methods. Multimedia Tools and Applications. http://doi:10.1007/s11042-020-09894-3

Radakovich, Nathan; Nagy, Matthew and Nazha, Aziz (2020). Machine learning in haematological malignancies. The Lancet Haematology, 7(7), e541–e550. http://doi:10.1016/S2352-3026(20)30121-6

Arora, Ridhi; Rai, Prateek Kumar and Raman, Balasubramanian (2020). Deep featureâ?“based automatic classification of mammograms. Medical & Biological Engineering & Computing. http://doi:10.1007/s11517-020-02150-8

Vellido, Alfredo (2019). The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural Computing and Applications. http://doi:10.1007/s00521-019-04051-w

L, Arokia Jesu Prabhu; Sengan, Sudhakar; G K, Kamalam; J, Vellingiri; Gopal, Jagadeesh; Velayutham, Priya and V, Subramaniyaswamy (2020). Medical information retrieval systems for e-Health care records using fuzzy based machine learning model. Microprocessors and Microsystems, 103344–. http://doi:10.1016/j.micpro.2020.103344

Zabihollahy, Fatemeh; Schieda, Nicola; Krishna, Satheesh and Ukwatta, Eranga (2020).

Jingyu Zhong, Chengxiu Zhang, Yangfan Hu, Jing Zhang, Yun Liu, Liping Si1, Yue Xing , Defang Ding, Jia Geng, Qiong Jiao, Huizhen Zhang, Guang Yang and Weiwu Yao (2022). Automated prediction of the neoadjuvant chemotherapy response in osteosarcoma with deep learning and an MRI-based radiomics nomogram. Springer, pp.1-11. https://doi.org/10.1007/s00330-022-08735-1

CHERUBIN MUGISHA AND INCHEON PAIK. (2022). Comparison of Neural Language Modeling Pipelines for Outcome Prediction From Unstructured Medical Text Notes. IEEE. 10, pp.16489-16498. http://DOI:10.1109/ACCESS.2022.3148279

Narmin Ghaffari Laleh, Hannah Sophie Muti, Chiara Maria Lavinia Loeffler, Amelie Echlea , Oliver Lester Saldanha , Faisal Mahmood , Ming Y. Lu , Christian Trautwein , Rupert Langer, Bastian Dislich, Roman D. Buelow, Heike Irmgard Grabsch, Hermann Brenner, Jenny Chang-Claude, Elizabeth Alwers, Titus J. Brinker, Firas Khader, Daniel Truhnn, Nadine T. Gaisa, Peter Boor, Michael Hoffmeister, Volkmar Schulz, Jakob Nikolas Kather. (2022). Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology. Elsevier., pp.1-15. https://doi.org/10.1016/j.media.2022.102474

Stefan Grafberger, Paul Groth, Julia Stoyanovich and Sebastian Schelter. (2022). Data distribution debugging in machine learning pipelines. Springer, pp.1-24. https://doi.org/10.1007/s00778-021-00726-w

Clinical documents dataset. Retrieved from https://portal.dbmi.hms.harvard.edu/