Classification of Arabic extremist web content through Arabic textual analysis

Alshahrani, Haya Mesfer

Thesis

Classification of Arabic extremist web content through Arabic textual analysis

Download PDF

Creator

Alshahrani, Haya Mesfer

Rights statement

Strathclyde Thesis Copyright

Awarding institution

University of Strathclyde

Date of award

2020

Thesis identifier

T15578

Person Identifier (Local)

201569559

Qualification Level

Doctoral (Postgraduate)

Qualification Name

Doctor of Philosophy (PhD)

Department, School or Faculty

Department of Computer and Information Sciences

Abstract

Many scholars have attempted to study the written and spoken word of terrorist groups and individuals to understand the underlying motivation for terrorist acts. However, until today, those scholars have not made use of automated linguistic analysis programs, especially those focusing on Arabic corpus, in an attempt to understand the mentality behind terrorist acts. A contribution in this regard will be made.The division and classification of texts is an important science of linguistics, whether Arabic or otherwise, it summarizes the effort and time consuming of humans to classify these language texts. The importance of this research stems from not only the importance of the classification itself, but also a classification based on the extreme orientation of these linguistic texts.In this research, the researcher has tried to prove his new methodology based on the division and classification of Arabic texts, a classification that distinguishes them from others according to the identity of the speakers, whether as extremists or against extremism or as neutral people who do not have any ideas belonging to any terrorist or counter category. This methodology is a numerical methodology that relies on dividing speech using two different tools and then analyzing the results using more than six algorithms in the Wiccan program. The researcher got very good results that make the judgment on this methodology a resounding success.This thesis aims to put forward a comprehensive and detailed classification system to categorize different Arabic-speaking website pages with unscrupulous intentions and questionable language. It uses three specific Arabic corpora, (Pro-terrorism, Anti-terrorism, and neutral), from more than 7000 Arabic text to construct corpus (1,000,000 words approx.) from different sites and sources.The division and classification of texts is an important science of linguistics, whether Arabic or otherwise, it summarizes the effort and time consuming of humans to classify these language texts. The importance of this research stems not only from the importance of the classification itself, but also a classification based on the extreme orientation of these linguistic texts.;In this research, the researcher has tried to prove his new methodology based on the division and classification of Arabic texts, a classification that distinguishes them from others according to the identity of the speakers, whether as extremists or against extremism or as neutral people who do not have any ideas belonging to any terrorist or counter category. This methodology is a numerical methodology that relies on dividing speech using two different tools and then analyzing the results using more than six algorithms in the WEKA program. The researcher got very good results that make the judgment on this methodology success.This thesis employs a quantitative approach by using different algorithms (supervised) to build a model for data classification by using manually categorized information. The classification algorithm used to construct the model uses quantitative information extracted by Posit or SAFAR textual analysis framework. This model functions with (58) features combined from Posit - n-grams and morphological SAFAR V2 POS tools. This model achieved more than (94 %) success in the level of precision.This model uses Posit method to make appropriate changes to the code so it can deal with Arabic content, secondly SAFAR V2, which is more suited to the domain of Arabic being based on analyzing the morphology of the word, and therefore, it can highlight all the essential features overlooked in Posit. This model has manual classification, pre-processing steps and can apply eight different experiments using WEKA APIs, a GUI (Graphical user interface) application.The research concludes that the best results reaching 94% precision have been achieved by combining Posit + SAFAR + (18 attributes Posit+ SAFAR N-Gram). Moreover, the most reliable results have been achieved by applying a Random Forest classification algorithm using regression. The research recommends working more on this topic and using new algorithms and techniques.

Advisor / supervisor

Weir, George

Resource Type

Doctoral thesis

DOI

10.48730/c6bk-nn58

Date Created

2020

Former identifier

9912908693202996

Relations

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	PDF of thesis T15578	2021-07-02	Public	Download

Classification of Arabic extremist web content through Arabic textual analysis

Downloadable Content

Relations

Items