A New Term Representation Method for Gender and Age Prediction
Main Article Content
Abstract
Author Profiling is a kind of text classification method that is used for detecting the personality profiles such as age, gender, educational background, place of origin, personality traits, native language, etc., of authors by processing their written texts. Several applications like forensic analysis, security and marking are used the techniques of author profiling for finding the basic details of authors. The main problem in the domain of author profiling is preparation of suitable dataset for predicting the characteristics of authors. PAN is one organization conducting competitions on various types of shared tasks. In 2013, PAN organizers presented the task of author profiling in their series of competitions and continued this task in further years. They arranged different kinds of datasets in different varieties of languages. From 2013 onwards several researchers proposed solutions for author profiling to predict different personality features of authors by utilizing the datasets provided in PAN competitions. Researchers used different kinds of features like character based, lexical or word based, structural features, syntactic, content based, style based features for distinguishing the author’s writing styles in their texts. Most of the researchers observed that the content based features like words or phrases those are used in the text are most useful for detecting the personality features of authors. In this work, the experiment conducted with the content based features like most important words or terms for predicting age group and gender from the PAN competition datasets. Two datasets such as PAN 2014 and 2016 author profiling datasets are used in this experiment. The documents of dataset are converted in to a vector representation which is a suitable format for giving training to machine learning algorithms. The term representation in a document vector plays a crucial role to improve the performance of gender and age group prediction.The Term Weight Measures (TWMs) are such techniques used for this purpose to represent the significance of a term value in document vector representation. In this work, we developed a new TWM for representing the term value in document vector representation. The proposed TWM’s efficiency is compared with the efficiency of other existing TWMs. Two Machine Learning (ML) algorithms like SVM (Support Vector Machine) and RF (Random Forest) are considered in this experiment for estimating the accuracy of proposed approach. We recognized that the proposed TWM accomplished best accuracies for gender and age prediction in two PAN Datasets.
Article Details
References
D. Lazer, A.S. Pentland, L. Adamic, S. Aral, A.L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, et al., Life in the network: the coming age of computational social science, Science (New York, NY) 323 (5915) (2009) 721.
E. Bothos, D. Apostolou, G. Mentzas, Using social media to predict future events with agent-based markets, IEEE Intell. Syst. (1) (2010).
M.J. Paul, M. Dredze, You are what you tweet: Analyzing twitter for public health, ICWSM 20 (2011) 265–272.
A. Mislove, Pulse of the nation: Us mood throughout the day inferred from twitter, 2010, http://www.ccs.neu.edu/home/amislove/twittermood/.
S. Asur, B.A. Huberman, Predicting the future with social media, in: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 01, IEEE Computer Society, 2010, pp. 492–499.
A. Dittrich, C. Lucas, A step towards real-time detection and localization of disaster events based on tweets, in: Proceedings of the 10th International ISCRAM Conference, 2013.
M. Oussalah, A. Zaidi, Forecasting weekly crude oil using twitter sentiment of us foreign policy and oil companies data, in: 2018 IEEE International Conference on Information Reuse and Integration, IRI, IEEE, 2018, pp. 201–208.
Argamon, S., Koppel, M., Fine, J., Shimoni, A. R.: Gender, genre, and writing style in formal written texts. TEXT-THE HAGUE THEN AMSTERDAM THEN BERLIN-, 23(3), 321-346 (2003).
Koppel, M., Argamon, S., Shimoni, A. R.: Automatically categorizing written texts by author gender. Literary and linguistic computing, 17(4), 401-412 (2002).
Schler, J., Koppel, M., Argamon, S., Pennebaker, J. W.: Effects of age and gender on blogging. In AAAI spring symposium: Computational approaches to analyzing weblogs, Vol. 6, pp. 199-205 (2006).
Nerbonne, J., The secret life of pronouns. What our words say about us. 2013, ALLC.
Newman, M.L., Groom, C.J., Handelman, L.D. and Pennebaker,J.W., "Gender differences in language use: An analysis of 14,000 text samples", Discourse Processes, Vol. 45, No. 3, (2008), 211-236.
Pennebaker, J.W., Francis, M.E. and Booth, R.J., "Linguistic inquiry and word count: Liwc 2001", Mahway: Lawrence Erlbaum Associates, Vol. 71, No. 2001, (2001), 2001-2009.
Argamon, S., Koppel, M., Pennebaker, J.W. and Schler, J., "Mining the blogosphere: Age, gender and the varieties of selfexpression", First Monday, Vol. 12, No. 9, (2007).
Chanchal Suman, Anugunj Naman, Sriparna Saha, Pushpak Bhattacharyya, “A Multimodal Author Profiling System for Tweets ”,IEEE Transactions on Computational Social Systems, Volume: 8 Issue: 6, July 2021, PP. 1407 – 1416
Rishabh Katna, Kashish Kalsi, Srajika Gupta, Divakar Yadav, Arun Kumar Yadav, “Machine learning based approaches for age and gender prediction from tweets”, Multimedia Tools and ApplicationsVolume 81Issue 19, Aug 2022, pp 27799–27817
Ameer, Iqraa, Sidorov, Grigoria, Nawab, Rao Muhammad Adeelb, Author profiling for age and gender using combinations of features of various types, Journal of Intelligent & Fuzzy Systems, vol. 36, no. 5, pp. 4833-4843, 2019
Ibrahim Mousa Al Zuabi , Assef Jafar and Kadan Aljoumaa, “Predicting customer’s gender and age depending on mobile phone data”, Journal of Big Data (2019) 6:18, pp. 1 – 16, https://doi.org/10.1186/s40537-019-0180-9
Erhan Sezerer, Ozan Polatbilek, Selma Tekir, “Gender Prediction from Tweets: Improving Neural Representations with Hand-Crafted Features”, arXiv:1908.09919v2 [cs.CL] 6 Sep 2019
Piot–Perez-Abadin, P., Martin–Rodilla, P. and Parapar, J. Experimental Analysis of the Relevance of Features and Effects on Gender Classification Models for Social Media Author Profiling. In Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2021), pages 103-113, DOI: 10.5220/0010431901030113, ISBN: 978-989-758-508-1
Janneke van de Loo and Guy De Pauw and Walter Daelemans, “Text-Based Age and Gender Prediction for Online Safety Monitoring”, International Journal of Cyber-Security and Digital Forensics (IJCSDF) 5(1): 46-60, The Society of Digital Information and Wireless Communications, 2016
Seifeddine Mechti, Maher Jaoua, Rim Faiz, Heni Bouhamed and Lamia Hadrich Belguith, “ Author Profiling: Age Prediction Based on Advanced Bayesian Networks”, Research in Computing Science 110 (2016), pp. 129–137
Esam Alzahrani and Leon Jololian, “ How Different Text-Preprocessing Techniques using The Bert Model Affect the Gender Profiling of Authors”, CS & IT - CSCP 2021, 2021, pp. 01-08
Danique Sabel, “ Gender Prediction Based on Word Knowledge using Machine Learning Techniques”, Thesis submitted for Department of Cognitive Science & Artificial Intelligence, Tilburg, the Netherlands, January 2019, pp. 01-21
Abhinay Pandya, Mourad Oussalah, Paola Monachesi, Panos Kostakos, “On the use of distributed semantics of tweet metadata for user age prediction”, Future Generation Computer Systems 102 (2020) 437–452
Roobaea Alroobaea, Sali Alafif, Shomookh Alhomidi, Ahad Aldahass, Reem Hamed, Rehab Mulla, Bedour Alotaibi, “A Decision Support System for Detecting Age and Gender from Twitter Feeds based on a Comparative Experiments”, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020, pp. 370-376
Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., ... Daelemans, W.: Overview of the 2nd author profiling task at pan 2014. In CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014, pp. 1-30 (2014).
F. Rangel, P. Rosso, B. Verhoeven, W. Daelemans, M. Potthast, and B. Stein, “Overview of the 4th author profiling task at PAN 2016: Cross-genre evaluations,” CEUR Workshop Proc., vol. 1609, pp. 750–784, 2016.
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1,pp. 5–32, 2001.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European conference on machine learning, 137-142.
F. Carvalho, G. P. Guedes, TF-IDFC-RF: A Novel Supervised Term Weighting Scheme, https://arxiv.org/abs/ 2003.07193, 2020.
G. Salton, M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., 1986.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
M. Lan, C. L. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 4, pp. 721-735, April, 2009.
Liu, Y., Loh, H. T., & Sun, A. (2009). Imbalanced text classification: A term weighting approach. Expert Systems with Applications, 36 (1), 690–701. http://doi.org/10. 1016/j.eswa.2007.10.042
V. Lertnattee, T. Theeramunkong, Analysis of inverse class frequency in centroid-based text classification, IEEE International Symposium on Communications and Information Technology (ISCIT) 2004, Sapporo, Japan, 2004, pp. 1171-1176.
D. Wang, H. Zhang, Inverse-Category-Frequency Based Supervised Term Weighting Schemes for Text Categorization, Journal of Information Science and Engineering, Vol. 29, No. 2, pp. 209-225, March, 2013.
Tao Wang, Yi Cai, Ho-fung Leung, Zhiwei Cai and Huaqing Min, “Entropy-based Term Weighting Schemes for Text Categorization in VSM”, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence, 2015, pp 325-332.
Ren, F., & Sohrab, M. G. (2013). Class-indexing-based term weighting for automatic text classification. Information Sciences, 236 , 109–125. http://doi.org/10.1016/j. ins.2013.02.029.
K. Chen, Z. Zhang, J. Long, H. Zhang, Turning from tf-idf to tf-igm for term weighting in text classification, Expert Systems with Applications 66(2016) 1339-1351.