Harnessing Deep Learning Techniques for Text Clustering and Document Categorization

Main Article Content

Rama Krishna Paladugu
Gangadhara Rao Kancherla

Abstract

This research paper delves into the realm of deep text clustering algorithms with the aim of enhancing the accuracy of document classification. In recent years, the fusion of deep learning techniques and text clustering has shown promise in extracting meaningful patterns and representations from textual data. This paper provides an in-depth exploration of various deep text clustering methodologies, assessing their efficacy in improving document classification accuracy. Delving into the core of deep text clustering, the paper investigates various feature representation techniques, ranging from conventional word embeddings to contextual embeddings furnished by BERT and GPT models.By critically reviewing and comparing these algorithms, we shed light on their strengths, limitations, and potential applications. Through this comprehensive study, we offer insights into the evolving landscape of document analysis and classification, driven by the power of deep text clustering algorithms.Through an original synthesis of existing literature, this research serves as a beacon for researchers and practitioners in harnessing the prowess of deep learning to enhance the accuracy of document classification endeavors.

Article Details

How to Cite
Paladugu, R. K. ., & Kancherla, G. R. . (2023). Harnessing Deep Learning Techniques for Text Clustering and Document Categorization. International Journal on Recent and Innovation Trends in Computing and Communication, 11(8), 125–139. https://doi.org/10.17762/ijritcc.v11i8.7930
Section
Articles

References

Lavanya, P. M., and E. Sasikala. "Deep learning techniques on text classification using Natural language processing (NLP) in social healthcare network: A comprehensive survey." In 2021 3rd international conference on signal processing and communication (ICPSC), pp. 603-609. IEEE, 2021.

Guan, Renchu, Hao Zhang, Yanchun Liang, Fausto Giunchiglia, Lan Huang, and Xiaoyue Feng. "Deep feature-based text clustering and its explanation." IEEE Transactions on Knowledge and Data Engineering 34, no. 8 (2020): 3669-3680.

Hassani, Hossein, Christina Beneki, Stephan Unger, Maedeh Taj Mazinani, and Mohammad Reza Yeganegi. "Text mining in big data analytics." Big Data and Cognitive Computing 4, no. 1 (2020): 1.

Ezugwu, Absalom E., Abiodun M. Ikotun, and Andronicus A. Akinyelu. "A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects." Engineering Applications of Artificial Intelligence 110 (2022): 104743.

Maylawati, D. Saadillah, Tedi Priatna, Hamdan Sugilar, and M. Ali Ramdhani. "Data science for digital culture improvement in higher education using K-means clustering and text analytics." International Journal of Electrical and Computer Engineering 10, no. 5 (2020): 4569-4580.

Ibrahim, R., S. Zeebaree, and K. Jacksi. "Survey on semantic similarity based on document clustering." Adv. sci. technol. eng. syst. j 4, no. 5 (2019): 115-122.

Abualigah, Laith, Amir H. Gandomi, Mohamed Abd Elaziz, Husam Al Hamad, Mahmoud Omari, Mohammad Alshinwan, and Ahmad M. Khasawneh. "Advances in meta-heuristic optimization algorithms in big data text clustering." Electronics 10, no. 2 (2021): 101.

Kokkinos, Konstantinos, and Eftihia Nathanail. "Exploring an ensemble of textual machine learning methodologies for traffic event detection and classification." Transport and Telecommunication 21, no. 4 (2020): 285-294.

Buenano-Fernandez, Diego, Mario Gonzalez, David Gil, and Sergio Luján-Mora. "Text mining of open-ended questions in self-assessment of university teachers: An LDA topic modeling approach." Ieee Access 8 (2020): 35318-35330.

Chen, Hongshu, Ximeng Wang, Shirui Pan, and Fei Xiong. "Identify topic relations in scientific literature using topic modeling." IEEE Transactions on Engineering Management 68, no. 5 (2019): 1232-1244.

Tsapatsoulis, Nicolas, and Constantinos Djouvas. "Opinion mining from social media short texts: Does collective intelligence beat deep learning?." Frontiers in Robotics and AI 5 (2019): 138.

Kumar, Yogesh, Komalpreet Kaur, and Gurpreet Singh. "Machine learning aspects and its applications towards different research areas." In 2020 International conference on computation, automation and knowledge management (ICCAKM), pp. 150-156. IEEE, 2020.

Akpatsa, Samuel K., Xiaoyu Li, and Hang Lei. "A survey and future perspectives of hybrid deep learning models for text classification." In ICAIS 2021, Dublin, Ireland, July 19–23, pp. 358-369. Springer International Publishing, 2021.

Yoon, Wonjin, Chan Ho So, Jinhyuk Lee, and Jaewoo Kang. "Collabonet: collaboration of deep neural networks for biomedical named entity recognition." BMC bioinformatics 20, no. 10 (2019): 55-65.

R. Guan, et al.,"Deep Feature-Based Text Clustering and its Explanation" in IEEE Transactions on Knowledge & Data Engineering, vol. 34, no. 08, pp. 3669-3680, 2022.

Pappagari, Raghavendra, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. "Hierarchical transformers for long document classification." In 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp. 838-844. IEEE, 2019.

Kim, Sang-Woon, and Joon-Min Gil. "Research paper classification systems based on TF-IDF and LDA schemes." Human-centric Computing and Information Sciences 9 (2019): 1-21.

Elnagar, Ashraf, Ridhwan Al-Debsi, and Omar Einea. "Arabic text classification using deep learning models." Information Processing & Management 57, no. 1 (2020): 102121.

De Araujo, Pedro Henrique Luz, Teófilo Emídio de Campos, Fabricio Ataides Braz, and Nilton Correia da Silva. "VICTOR: a dataset for Brazilian legal documents classification." In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 1449-1458. 2020.

Chen, Liang, Shuo Xu, Lijun Zhu, Jing Zhang, Xiaoping Lei, and Guancan Yang. "A deep learning based method for extracting semantic information from patent documents." Scientometrics 125 (2020): 289-312.

Kowsari, Kamran, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. "Text classification algorithms: A survey." Information 10, no. 4 (2019): 150.

Rashid, Junaid, Muhammad Shafiq, and Akber Gardezi. "Topic modeling technique for text mining over biomedical text corpora through hybrid inverse documents frequency and fuzzy k-means clustering." IEEE Access 7 (2019): 146070-146080.

Abasi, Ammar Kamal, Ahamad Tajudin Khader, Mohammed Azmi Al-Betar, Syibrah Naim, Zaid Abdi Alkareem Alyasseri, and Sharif Naser Makhadmeh. "A novel hybrid multi-verse optimizer with K-means for text documents clustering." Neural Computing and Applications 32 (2020): 17703-17729.

Amer, Ali A., and Hassan I. Abdalla. "A set theory based similarity measure for text clustering and classification." Journal of Big Data 7 (2020): 1-43.

Jo, Tae-Ho. "Inverted index based modified version of k-means algorithm for text clustering." Journal of Information Processing Systems 4, no. 2 (2008): 67-76.

Venkataramanan, A. R. ., Kanimozhi, K. V. ., Valarmathia, K. ., Therasa, M. ., Hemalatha, S. ., Thangamani, M. ., & Gulati, K. . (2023). A Survey on Covid-19 & Its Impacts. International Journal of Intelligent Systems and Applications in Engineering, 11(3s), 129 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/2550

Pappagari, Raghavendra, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. "Hierarchical transformers for long document classification." In 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp. 838-844. IEEE, 2019.

Stein, Roger Alan, Patricia A. Jaques, and Joao Francisco Valiati. "An analysis of hierarchical text classification using word embeddings." Information Sciences 471 (2019): 216-232.

Meng, Yu, Jiaming Shen, Chao Zhang, and Jiawei Han. "Weakly-supervised hierarchical text classification." In Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, pp. 6826-6833. 2019.

Cretulescu, Radu George, Daniel Morariu, Macarie Breazu, and Danie Volovici. "DBSCAN algorithm for document clustering." International Journal of Advanced Statistics and IT&C for Economics and Life Sciences 9, no. 1 (2019).

Liu, Zhiwei, and Yan Yang. "Research on web text clustering based on DBSCAN optimization algorithm." In 6th International Workshop on Advanced Algorithms and Control Engineering (IWAACE 2022), vol. 12350, pp. 550-555. SPIE, 2022.

Mohammed, Shapol M., Karwan Jacksi, and Subhi RM Zeebaree. "Glove word embedding and DBSCAN algorithms for semantic document clustering." In 2020 International Conference on Advanced Science and Engineering (ICOASE), pp. 1-6. IEEE, 2020.

Widiastuti, N. I. "Convolution neural network for text mining and natural language processing." In IOP Conference Series: Materials Science and Engineering, vol. 662, no. 5, p. 052010. IOP Publishing, 2019.

Aich, Satyabrata, Sabyasachi Chakraborty, and Hee-Cheol Kim. "Convolutional neural network-based model for web-based text classification." International Journal of Electrical & Computer Engineering (2088-8708) 9, no. 6 (2019).

Wang, Fan, Jing-Fang Yang, Meng-Yao Wang, Chen-Yang Jia, Xing-Xing Shi, Ge-Fei Hao, and Guang-Fu Yang. "Graph attention convolutional neural network model for chemical poisoning of honey bees’ prediction." Science Bulletin 65, no. 14 (2020): 1184-1191.

Skrlj, Blaz, Jan Kralj, Nada Lavrac, and Senja Pollak. "Towards robust text classification with semantics-aware recurrent neural architecture." Machine Learning and Knowledge Extraction 1, no. 2 (2019): 34.

Murthy, G. S. N., Shanmukha Rao Allu, Bhargavi Andhavarapu, Mounika Bagadi, and Mounika Belusonti. "Text based sentiment analysis using LSTM." Int. J. Eng. Res. Tech. Res 9, no. 05 (2020).

Hosseini, Soodeh, and Zahra Asghari Varzaneh. "Deep text clustering using stacked AutoEncoder." Multimedia Tools and Applications 81, no. 8 (2022): 10861-10881.

Yin, Hui, Xiangyu Song, Shuiqiao Yang, Guangyan Huang, and Jianxin Li. "Representation learning for short text clustering." 22nd International Conference on Web Information Systems Engineering, WISE 2021, Melbourne, VIC, Australia, October 26–29, 2021, 321-335. Springer International Publishing, 2021.

Yilmaz, Seyhmus, and Sinan Toklu. "A deep learning analysis on question classification task using Word2vec representations." Neural Computing and Applications 32 (2020): 2909-2928.

Gundogan, Esra, and Mehmet Kaya. "Research paper classification based on Word2vec and community discovery." In 2020 international conference on decision aid sciences and application (DASA), pp. 1032-1036. IEEE, 2020.

Chen, Kai, Rabea Jamil Mahfoud, Yonghui Sun, Dongliang Nan, Kaike Wang, Hassan Haes Alhelou, and Pierluigi Siano. "Defect texts mining of secondary device in smart substation with GloVe and attention-based bidirectional LSTM." Energies 13, no. 17 (2020): 4522.

Hossain, Md Rajib, and Mohammed Moshiul Hoque. "Covtexminer: Covid text mining using cnn with domain-specific glove embedding." In International Conference on Intelligent Computing & Optimization, pp. 65-74. Cham: Springer International Publishing, 2022.

Mark White, Thomas Wood, Maria Hernandez, María González , María Fernández. Enhancing Learning Analytics with Machine Learning Techniques. Kuwait Journal of Machine Learning, 2(2). Retrieved from http://kuwaitjournals.com/index.php/kjml/article/view/184

Selva Birunda, S., and R. Kanniga Devi. "A review on word embedding techniques for text classification." Innovative Data Communication Technologies and Application: Proceedings of ICIDCA 2020 (2021): 267-281.

Talebpour, Mozhgan, Alba García Seco de Herrera, and Shoaib Jameel. "Topics in Contextualised Attention Embeddings." In European Conference on Information Retrieval, pp. 221-238. Cham: Springer Nature Switzerland, 2023.

Alaparthi, Shivaji, and Manit Mishra. "BERT: A sentiment analysis odyssey." Journal of Marketing Analytics 9, no. 2 (2021): 118-126.

Ali, Sikandar, Anam Nasir, Ali Samad, Samad Basser, and Azeem Irshad. "An automated approach for the prediction of the severity level of bug reports using GPT-2." Security and Communication Networks 2022 (2022).

Soydaner, Derya. "Attention mechanism in neural networks: where it comes and where it goes." Neural Computing and Applications 34, no. 16 (2022): 13371-13385.

Li, Weijiang, Fang Qi, Ming Tang, and Zhengtao Yu. "Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification." Neurocomputing 387 (2020): 63-77.

Kadhim, Ammar Ismael, Yu-N. Cheah, and Nurul Hashimah Ahamed. "Text document preprocessing and dimension reduction techniques for text document clustering." In 2014 4th international conference on artificial intelligence with applications in engineering and technology, pp. 69-73. IEEE, 2014.

Huang, Xuan, Lei Wu, and Yinsong Ye. "A review on dimensionality reduction techniques." International Journal of Pattern Recognition and Artificial Intelligence 33, no. 10 (2019): 1950017.

Harel, David, and Yehuda Koren. "Graph drawing by high-dimensional embedding." In International symposium on graph drawing, pp. 207-219. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002.

Knops, Zeger F., JB Antoine Maintz, Max A. Viergever, and Josien PW Pluim. "Normalized mutual information based registration using k-means clustering and shading correction." Medical image analysis 10, no. 3 (2006): 432-439.

Steinley, Douglas, Michael J. Brusco, and Lawrence Hubert. "The variance of the adjusted Rand index." Psychological methods 21, no. 2 (2016): 261.

Borkar, Karishma, and Nutan Dhande. "Efficient text classification of 20 newsgroup dataset using classification algorithm." Int J Recent Innov Trends Comput Commun 5, no. 6 (2017): 1236-1240.

Zhang, Xiang, Junbo Zhao, and Yann LeCun. "Character-level convolutional networks for text classification." Advances in neural information processing systems 28 (2015).

Rodríguez, Juan M., Hernán D. Merlino, Patricia Pesado, and Ramón García-Martínez. "Evaluation of open information extraction methods using Reuters-21578 database." In Proceedings of the 2nd International Conference on Machine Learning and Soft Computing, pp. 87-92. 2018.