Classification of Arabic real and fake news based on Arabic textual analysis

Rights statement
Awarding institution
  • University of Strathclyde
Date of award
  • 2022
Thesis identifier
  • T16299
Person Identifier (Local)
  • 201888010
Qualification Level
Qualification Name
Department, School or Faculty
  • The ease of communication that has been made possible by chat messaging platforms, and their increased use and ubiquity in society, have motivated purveyors of fake news to create and present their news as legitimate. Though many countries have introduced severe penalties for distributing fake news, monitoring the myriad articles involved has been burdensome. While different organisations have continued their efforts to resolve this problem, many of the solutions rely on verifying the associated metadata for further validation. In the case of text sent through social messaging, these metadata are not always present. Several studies have attempted to identify fake news by analysing the textual content of these pieces, however, there is a dearth of studies on Arabic language sources. This study fills that gap. This research compiled a machine learning (ML) model that classifies real and fake articles in Arabic based on textual analysis. It is important not only for its development of the classification model but also because of the ability of the model to classify other types of fake news, such as satire and the article’s country of origin. This work employed qualitative approaches to create five Arabic datasets that may be used for other research projects in Arabic. Then, through comprehensive textual analysis using Natural Language Processing (NLP) tools, quantitative approaches were used in several supervised ML classifiers. This research thus puts forward a comprehensive supervised ML classification model that identifies fake news articles that are written in a formal journalistic genre imitating real news articles. The novelty of this model lies in the fact that it classifies real and fake news articles in Arabic, with fake articles written in a journalistic style, which causes only minor differences between them and real articles. To examine these differences, four textual features were analysed—part of speech (POS), emotion, polarity, and linguistics—that have been successful in identifying fake news in other languages, but have not been fully tested in Arabic fake news. Probing these textual features showed how influential each of them was in identifying fake news in Arabic. With the aid of NLP to extract the textual features combined with ML classifiers, this research compiled a model that reached an accuracy score of 77.2%. Moreover, the model correctly predicted 6 out of 10 articles within the same topic domain, the Hajj, and 17 out of 26 fake articles within another topic domain, COVID-19. The proposed model achieved promising results and it also successfully classified satire articles, as well as the articles’ country of origin within the same topic domain. The research concludes by recommending making use of the contributions provided to conduct this research and to work more on this topic using new methods.
Advisor / supervisor
  • Weir, George
Resource Type