A Word Embeddings based Approach for Author Profiling: Gender and Age Prediction

Main Article Content

Karunakar Kavuri
M Kavitha


Author Profiling (AP) is a method of identifying the demographic profiles such as age, gender, location, native language and personality traits of an author by processing their written texts. The AP techniques are used in multiple applications such as literary research, marketing, forensics and security. The researchers identified various differences in the authors writing styles by analysing various datasets. The differences in writing styles are represented as stylistic features. The researchers extracted several style based features like structural, content, word, character, syntactic, readability and semantic features to recognize the profiles of the authors. Traditionally, the researchers extracted various feature combinations for differentiating the profiles of authors. Several existing works are used Machine Learning (ML) methods for predicting the author characteristics of a new author. The existing works achieved good accuracies for predicting the author characteristics by considering the both stylistic features and ML algorithms combination. Recently, in advent of Deep Learning (DL) techniques the researchers are proposed approaches to author profiling by using these techniques. Few researchers identified that the deep learning techniques performance is good for author profiles prediction than the results of style based features. In this work, a word embeddings based approach is proposed for gender and age prediction. In this approach, the experiment conducted with different word embedding models such as Word2Vec, GloVe, FastText and BERT for generating word vectors for words. The documents are converted as vectors by using the document representation technique which uses the word embeddings of words. The document vectors are transferred to three different ML algorithms such as Extreme Gradient Boosting (XGBoost), Random Forest (RF) and Logistic Regression (LR) for generating the trained model. This model is used for predicating the accuracy of age and gender prediction. The XGBoost classifier with word embeddings of BERT achieved good accuracies for age and gender prediction than other word embeddings and ML algorithms. The experiment implemented on PAN 2014 competition Reviews dataset for age and gender prediction. The proposed approach attained best accuracies for predicting age and gender than the performances of various existing approaches proposed for AP.

Article Details

How to Cite
Kavuri, K. ., & Kavitha, M. . (2023). A Word Embeddings based Approach for Author Profiling: Gender and Age Prediction . International Journal on Recent and Innovation Trends in Computing and Communication, 11(7s), 239–250. https://doi.org/10.17762/ijritcc.v11i7s.6996


Raghunadha Reddy T, Vishnu Vardhan B, Vijayapal Reddy P, “A Survey on Author Profiling Techniques”, International Journal of Applied Engineering Research, March 2016, Volume-11, Issue-5, pp. 3092-3102.

E. Stamatatos, “A Survey of Modern Authorship Attribution Methods”, Journal of the American Society for Information Science and Technology, Vol.60, No.3, pp.538-556, 2009.

M. Koppel, J. Schler, and E. Bonchek-Dokow, “Measuring differentiability: Unmasking pseudonymous authors”, The Journal of Machine Learning Research, Vol.8, pp.1261–1276, 2007.

Koppel M. S. Argamon and A. Shimoni, Automatically categorizing written texts by author gender, Literary and Linguistic Computing, pages 401-412, 2003.

Nerbonne, J., The secret life of pronouns. What our words say about us. 2013, ALLC.

Newman, M.L., Groom, C.J., Handelman, L.D. and Pennebaker,J.W., "Gender differences in language use: An analysis of 14,000 text samples", Discourse Processes, Vol. 45, No. 3, (2008), 211-236.

Pennebaker, J.W., Francis, M.E. and Booth, R.J., "Linguistic inquiry and word count: Liwc 2001", Mahway: Lawrence Erlbaum Associates, Vol. 71, No. 2001, (2001), 2001-2009.

Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In CLEF Conference on Multilingual and Multimodal Information Access Evaluation, CELCT, pp. 352-365 (2013).

Sakura Nakamura, Machine Learning in Environmental Monitoring and Pollution Control , Machine Learning Applications Conference Proceedings, Vol 3 2023.

Rangel Pardo, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, 15-18 September, Sheffield, UK. CEUR Workshop Proceedings, CEUR-WS.org (Sep 2014)

Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating Gender on Twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 1301–1309. EMNLP ’11, Association for Computational Linguistics, Stroudsburg, PA, USA (2011).

J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006. Vol. 6, (2006), 199-205.

Edson RD Weren, Viviane Pereira Moreira, and Jose Palazzo M de Oliveira. Exploring information retrieval features for author profiling. In CLEF (Working Notes), pages 1164-1171, 2014.

James Marquardt, Golnoosh Farnadi, Gayathri Vasudevan, Marie-Francine Moens, Sergio Davalos, Ankur Teredesai, and Martine De Cock. Age and gender identification in social media. Proceedings of CLEF 2014 Evaluation Labs, 2014.

Rishabh Katna, Kashish Kalsi, Srajika Gupta, Divakar Yadav, Arun Kumar Yadav, “Machine learning based approaches for age and gender prediction from tweets”, Multimedia Tools and ApplicationsVolume 81Issue 19Aug 2022 pp 27799–27817.

Ameer. Iqraa, Sidorov. Grigoria, Nawab. Rao Muhammad Adeel, “Author profiling for age and gender using combinations of features of various types ”, Journal of Intelligent & Fuzzy Systems, vol. 36, no. 5, pp. 4833-4843, 2019.

Yaakov HaCohen-Kerner, “Survey on profiling age and gender of text authors”, Expert Systems with Applications: An International Journal, Volume 199, Issue C, Aug 2022.

Seifeddine Mechti, Moez Krichen, Dhouha Ben Noureddine,Lamia H. Belguith,” A decision system for computational authors profiling: From machine learning to deep learning ”, Concurrency and Computation Practiec and Experience, Special Issue, Wiley Online Library, 07 September 2020, https://doi.org/10.1002/cpe.5985.

Piot–Perez-Abadin, P., Martin–Rodilla, P. and Parapar, J., “Experimental Analysis of the Relevance of Features and Effects on Gender Classification Models for Social Media Author Profiling”, DOI: 10.5220/0010431901030113, In Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2021), pages 103-113, ISBN: 978-989-758-508-1, 2021 by SCITEPRESS – Science and Technology Publications.

Danique Sabel, “Gender Prediction based on Word Knowledge using Machine Learning Techniques”, thesis submitted to Tilburg University, January 2019.

Roobaea Alroobaea, Sali Alafif, Shomookh Alhomidi, “A Decision Support System for Detecting Age and Gender from Twitter Feeds based on a Comparative Experiments”, International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020.

L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

S. J. Russell and P. Norvig, Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited„ 2016.

N. M. Nasrabadi, “Pattern recognition and machine learning,” Journal of Electronic Imaging, vol. 16, no. 4, p. 049901, 2007.

L. Olshen, C. J. Stone, et al., “Classification and regression trees,” Wadsworth International Group, vol. 93, no. 99, p. 101, 1984.

Prof. Prachiti Deshpande. (2016). Performance Analysis of RPL Routing Protocol for WBANs. International Journal of New Practices in Management and Engineering, 5(01), 14 - 21. Retrieved from http://ijnpme.org/index.php/IJNPME/article/view/43

Pranckevi?ius, T., & Marcinkevi?ius, V. (2017). Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic Journal of Modern Computing, 5(2), 221

J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.

T. Chen and C. Guestrin, Xgboost: A scalable tree boosting system", in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '16, San Francisco, California, USA: ACM, 2016, pp. 785-794, isbn: 978-1-4503-4232-2.

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model. journal of machine learning research, Vol. 3, No." (2003): 1137-1155.

Chaudhary, A. ., Sharma, A. ., & Gupta, N. . (2023). A Novel Approach to Blockchain and Deep Learning in the field of Steganography. International Journal of Intelligent Systems and Applications in Engineering, 11(2s), 104–115. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/2514

X. RONG, word2vec parameter learning explained, arXiv preprint arXiv:1411.2738, (2014).

J. PENNINGTON, R. SOCHER, AND C. MANNING, Glove: Global vectors for word representation, in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

A. JOULIN, E. GRAVE, P. BOJANOWSKI, AND T. MIKOLOV, Bag of tricks for efficient text classification, arXiv preprint arXiv:1607.01759, (2016).

J. DEVLIN, M.-W. CHANG, K. LEE, AND K. TOUTANOVA, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, (2018).

Yathiraju, D. . (2022). Blockchain Based 5g Heterogeneous Networks Using Privacy Federated Learning with Internet of Things. Research Journal of Computer Systems and Engineering, 3(1), 21–28. Retrieved from https://technicaljournals.org/RJCSE/index.php/journal/article/view/37

Karunakar. Kavuri and M. Kavitha, "A Term Weight Measure based Approach for Author Profiling," 2022 International Conference on Electronic Systems and Intelligent Computing (ICESIC), 2022, pp. 275-280, doi: 10.1109/ICESIC53714.2022.9783526.

Karunakar Kavuri, Kavitha, M. (2020). “A Stylistic Features Based Approach for Author Profiling”. In: Sharma, H., Pundir, A., Yadav, N., Sharma, A., Das, S. (eds) Recent Trends in Communication and Intelligent Systems. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-15-0426-6_20

Martínez, L., Mili?, M., Popova, E., Smit, S., & Goldberg, R. Machine Learning Approaches for Human Activity Recognition. Kuwait Journal of Machine Learning, 1(4). Retrieved from http://kuwaitjournals.com/index.php/kjml/article/view/146

A. VASWANI, N. SHAZEER, N. PARMAR, J. USZKOREIT, L. JONES, A. N. GOMEZ, ?. KAISER, AND I. POLOSUKHIN, Attention is all you need, in Advances in neural information processing systems, 2017, pp. 5998–6008.