Thesis

Classification of Arabic extremist web content through Arabic textual analysis

Creator
Rights statement
Awarding institution
  • University of Strathclyde
Date of award
  • 2020
Thesis identifier
  • T15578
Person Identifier (Local)
  • 201569559
Qualification Level
Qualification Name
Department, School or Faculty
Abstract
  • Many scholars have attempted to study the written and spoken word of terrorist groups and individuals to understand the underlying motivation for terrorist acts. However, until today, those scholars have not made use of automated linguistic analysis programs, especially those focusing on Arabic corpus, in an attempt to understand the mentality behind terrorist acts. A contribution in this regard will be made.The division and classification of texts is an important science of linguistics, whether Arabic or otherwise, it summarizes the effort and time consuming of humans to classify these language texts. The importance of this research stems from not only the importance of the classification itself, but also a classification based on the extreme orientation of these linguistic texts.In this research, the researcher has tried to prove his new methodology based on the division and classification of Arabic texts, a classification that distinguishes them from others according to the identity of the speakers, whether as extremists or against extremism or as neutral people who do not have any ideas belonging to any terrorist or counter category. This methodology is a numerical methodology that relies on dividing speech using two different tools and then analyzing the results using more than six algorithms in the Wiccan program. The researcher got very good results that make the judgment on this methodology a resounding success.This thesis aims to put forward a comprehensive and detailed classification system to categorize different Arabic-speaking website pages with unscrupulous intentions and questionable language. It uses three specific Arabic corpora, (Pro-terrorism, Anti-terrorism, and neutral), from more than 7000 Arabic text to construct corpus (1,000,000 words approx.) from different sites and sources.The division and classification of texts is an important science of linguistics, whether Arabic or otherwise, it summarizes the effort and time consuming of humans to classify these language texts. The importance of this research stems not only from the importance of the classification itself, but also a classification based on the extreme orientation of these linguistic texts.;In this research, the researcher has tried to prove his new methodology based on the division and classification of Arabic texts, a classification that distinguishes them from others according to the identity of the speakers, whether as extremists or against extremism or as neutral people who do not have any ideas belonging to any terrorist or counter category. This methodology is a numerical methodology that relies on dividing speech using two different tools and then analyzing the results using more than six algorithms in the WEKA program. The researcher got very good results that make the judgment on this methodology success.This thesis employs a quantitative approach by using different algorithms (supervised) to build a model for data classification by using manually categorized information. The classification algorithm used to construct the model uses quantitative information extracted by Posit or SAFAR textual analysis framework. This model functions with (58) features combined from Posit - n-grams and morphological SAFAR V2 POS tools. This model achieved more than (94 %) success in the level of precision.This model uses Posit method to make appropriate changes to the code so it can deal with Arabic content, secondly SAFAR V2, which is more suited to the domain of Arabic being based on analyzing the morphology of the word, and therefore, it can highlight all the essential features overlooked in Posit. This model has manual classification, pre-processing steps and can apply eight different experiments using WEKA APIs, a GUI (Graphical user interface) application.The research concludes that the best results reaching 94% precision have been achieved by combining Posit + SAFAR + (18 attributes Posit+ SAFAR N-Gram). Moreover, the most reliable results have been achieved by applying a Random Forest classification algorithm using regression. The research recommends working more on this topic and using new algorithms and techniques.
Advisor / supervisor
  • Weir, George
Resource Type
DOI
Date Created
  • 2020
Former identifier
  • 9912908693202996

Relations

Items