Parnell, Andrew, González-Castro, Víctor, Alaiz-Rodríguez, Rocío and Barrientos, Gonzalo Molpeceres (2020) Machine Learning Techniques for the Detection of Inappropriate Erotic Content in Text. International Journal of Computational Intelligence Systems, 13 (1). p. 591. ISSN 1875-6883
Preview
AndrewParnellMac2022.pdf
Download (3MB) | Preview
Abstract
Nowadays, children have access to Internet on a regular basis. Just like the real world, the Internet has many unsafe locations where kids may be exposed to inappropriate content in the form of obscene, aggressive, erotic or rude comments. In this work, we address the problem of detecting erotic/sexual content on text documents using Natural Language Processing (NLP) techniques. Following an approach based on Machine Learning techniques, we have assessed twelve models resulting from the combination of three text encoders (Bag of Words, Term Frequency-Inverse Document Frequency and Word2vec) together with four classifiers (Support Vector Machines (SVMs), Logistic Regression, k-Nearest Neighbours and Random Forests). We evaluated these alternatives on a new created dataset extracted from public data on the Reddit Website. The best performance result was achieved by the combination of the text encoder TF-IDF and the SVM classifier with linear kernel with an accuracy of 0.97 and F-score 0.96 (precision 0.96/recall 0.95). This study demonstrates that it is possible to detect erotic content on text documents and therefore, develop filters for minors or according to user's preferences.
Item Type: | Article |
---|---|
Additional Information: | Cite as: Hernández, A., Martin-Puertas, C., Moffa-Sánchez, P., Moreno-Chamarro, E., Ortega, P., Blockley, S., Cobb, K.M., Comas-Bru, L., Giralt, S., Goosse, H., Luterbacher, J., Martrat, B., Muscheler, R., Parnell, A., Pla-Rabes, S., Sjolte, J., Scaife, A.A., Swingedouw, D., Wise, E. & Xu, G. 2020, "Modes of climate variability: Synthesis and review of proxy-based reconstructions through the Holocene", Earth-science reviews, vol. 209, pp. 103286. Copyright: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/). |
Keywords: | Inappropriate content; Machine learning; Text classification Natural language processing; Text encoders; |
Academic Unit: | Faculty of Science and Engineering > Mathematics and Statistics Faculty of Science and Engineering > Research Institutes > Hamilton Institute Faculty of Social Sciences > Research Institutes > Irish Climate Analysis and Research Units, ICARUS |
Item ID: | 16230 |
Identification Number: | 10.2991/ijcis.d.200519.003 |
Depositing User: | Andrew Parnell |
Date Deposited: | 05 Jul 2022 14:20 |
Journal or Publication Title: | International Journal of Computational Intelligence Systems |
Publisher: | Atlantis Press |
Refereed: | Yes |
Related URLs: | |
URI: | https://mu.eprints-hosting.org/id/eprint/16230 |
Use Licence: | This item is available under a Creative Commons Attribution Non Commercial Share Alike Licence (CC BY-NC-SA). Details of this licence are available here |
Repository Staff Only (login required)
Downloads
Downloads per month over past year