أعضاء هيئة التدريس

عبدالباسط قويدر

السيرة الذاتية

agoweder@academy.edu.ly

عبدالباسط مصطفى عليه قويدر

رئيس قسم علوم الحاسوب

عضو هيئة تدريس قار

المؤهل العلمي: ماجستير

الدرجة العلمية: أستاذ

التخصص: الذكاء الاصطناعي - علوم حاسوب

قسم علوم الحاسوب - مدرسة العلوم الأساسية

المنشورات العلمية

Arabic Plurals Classification using Transformer

Conference paper

Abstract— Arabic language is characterized by its rich morphological structure, presenting unique challenges in Nat- ural Language Processing (NLP). The categorization of Arabic plurals is the subject of this study, which uses a trans- former-based model—more precisely, the pre-trained Arabic BERT architecture—and has never been studied previously. Given the complexities of Arabic language, particularly in pluralization which includes sound masculine, sound feminine, and irregular (broken) plurals, the research aims to enhance NLP capabilities in this area. By utilizing a dataset of 7,400 instances classified into four distinct categories, the study demonstrates the effectiveness of transfer learning in achieving high classification accuracy, with results indicating an accuracy of 97% across both validation and testing sets. Addition- ally, the model achieves high precision, recall, and F1-score metrics. A confusion matrix provides insights into classifica- tion performance, highlighting areas of misclassification. The findings underscore the potential of transformer models in overcoming the linguistic challenges posed by Arabic plural forms.

Abduelbaset Mustafa Alia Goweder, (04-2025), Hammamet, Tunisia.: Proceedings Book Series –PBS, 98-106

Publication link

A Survey of Machine Translation Approaches

Conference paper

Abstract— This survey explores different machine translation methods utilized in various systems and platforms for commercial and research purposes. These methods play a vital role in enabling global communication, enhancing accessibility, supporting business and trade, fostering intercultural understanding, facilitating travel and tourism, aiding education, delivering fast and efficient translations, contributing to humanitarian aid efforts, promoting research and collaboration, and preserving language and culture. The survey aims to equip software developers and researchers interested in machine translation with valuable insights into these methods. Its objective is to help them improve translation quality with great accuracy by providing them with the necessary knowledge and understanding of these approaches. The papers utilized in this survey were obtained from Open Access Journals and online databases. All these methods are essential and can differ based on the specific context, available resources, and the quality of the translation required. To achieve optimal translation results, researchers and practitioners commonly employ a combination of various methods and techniques.

Abduelbaset Mustafa Alia Goweder, (04-2025), Hammamet, Tunisia: Proceedings Book Series –PBS, 6-17

Publication link

A Survey of Techniques and Challenges in Arabic Named Entity Recognition

Journal Article

Abstract—Arabic Named Entity Recognition (NER) serves as a crucial facet within Natural Language Processing, given the intricacies of the Arabic language. This survey consolidates the current landscape of Arabic NER, covering methodologies, challenges, and advance- ments. The review encompasses an in-depth analysis of diverse approaches, from rule-based systems to modern deep learning techniques, highlighting their effectiveness and limitations. It also addresses the specific challenges inherent to Arabic NER, such as dialectal variations and limited annotated data, while exploring recent advancements and their applications in sentiment analysis, information retrieval, and other domains. This survey aims to provide a comprehensive overview, catering to researchers, practitioners, and enthusiasts in the field of Arabic NER and NLP.

Abduelbaset Mustafa Alia Goweder, (03-2025), On-line Journal, USA: Solid State Technology Journal, 1 (67), 101-115

Publication link

Arabic Speech Recognition using a Combined Deep Learning Model

Journal Article

Abstract— Speech recognition is a valuable tool in various industries; however, achieving high accuracy remains a major challenge, despite the rapid growth of the speech recognition market. Arabic in particular lags behind other languages in the field of speech recognition, requiring further attention and development. To address this issue, this research uses deep neural networks to develop an automatic Arabic speech recognition model based on isolated words technology. A hybrid model, which is originally developed by Radfar et al. [1] for English speech recognition, is adopted and adapted to be used for Arabic speech recognition. This model combines the strengths of recurrent neural networks (RNNs), which are critical in speech recognition tasks, with convolutional neural networks (CNNs) to form a hybrid model known as ConvRNN. A specific model for Arabic speech recognition which is referred to as “Arabic_ConvRNN” model has been developed based on “ConvRNN” model. The adopted model is trained using an Arabic speech publicly available dataset of isolated words, along with a custom-generated dataset specially prepared for this research. The performance of the built model has been evaluated using standard metrics, including word error rate (WER), accuracy, precision, recall, and F-measure (also referred to as f1 score). In addition, K-fold cross-validation method has been employed generalizability. to ensure robustness and The results demonstrated that Arabic_ConvRNN model achieved a high accuracy rate of 95.7% on unseen data, with a minimal WER of about 4.3%. These findings highlight the model's effectiveness in accurately recognizing Arabic speech with minimal errors. Comparisons with similar models from previous studies further Arabic_ConvRNN validated model. the superiority Overall, of the Arabic_ConvRNN model shows great promise for applications requiring accurate and efficient Arabic speech recognition. This research contributes to narrowing the gap in Arabic speech recognition technology, offering a robust solution for accurately converting Arabic speech into text.

Abduelbaset Mustafa Alia Goweder, (01-2024), Libyan Academy, Tripoli: Academy journal for Basic and Applied Sciences (AJBAS), 6 (3), 10-17

Publication link

Transfer Learning Model for Offline Handwritten Arabic Signature Recognition

Journal Article

Abstract— The verification of handwritten signatures is a significant area of research in computer vision and machine learning (ML). Handwritten signatures serve as unique biometric identifiers, making it essential to distinguish between genuine and forged signatures. This binary classification task is crucial in legal and financial contexts to prevent fraud and protect customers from potential losses. However, verifying offline handwritten signatures is challenging due to variations in handwriting influenced by factors such as mood, fatigue, writing surface, and writing instrument. This research paper focuses on recognizing offline handwritten Arabic signatures using deep learning (DL), specifically transfer learning (TL) technique which is called “Inception-V3 TL model”. Three distinct datasets are used to build a model for recognizing signatures. The first dataset is referred to as Dataset1. It is an English signature dataset called I. INTRODUCTION A signature is defined as a unique, individual, and personal sign. It is regarded as one of the biometric measurements that can be used for identification and verification. Handwritten signatures have been used in different practical areas of life for many centuries, for example, in contracts, financial operations, documents, identification documents such as passports, driver’s licenses, etc. Additionally, signatures are used in bank cheques and money transfers. However, with the great benefits of using a handwritten signature, came certain challenges for societies such as identity and fraud [1]. "CEDAR" which contains 1,320 genuine and 1,320 forged signatures. Dataset1 is publicly available at: https://www.kaggle.com/datasets/shreelakshmigp/cedard ataset .The second dataset is referred to as Dataset2. It is a new Arabic signature dataset created for this research which contains 1,320 genuine and 1,320 forged signatures. The third dataset is referred to as Dataset3. It is created by merging the English and Arabic signature datasets (Dataset1 and Dataset2). The Inception-V3 TL model is trained on these distinct datasets (Dataset1, Dataset2, and Dataset3). Both normal training and k-fold cross-validation (CV) methods are applied to evaluate the model’s performance, ensuring robustness and reliability. The Inception-V3 model achieved impressive accuracies of 97.48% on the Dataset1, 98.23% on Dataset2, and 97.85% on Dataset3, demonstrating its effectiveness in distinguishing between genuine and forged signatures.

Abduelbaset Mustafa Alia Goweder, (01-2024), Libyan Academy, Tripoli: Academy journal for Basic and Applied Sciences (AJBAS), 6 (3), 30-37

Publication link

Measurement System and its Suitability for Examining Indoor Millimeter Wave Propagation at (28–33GHz)

Conference paper

The purpose of this study is to determine the suitability of system measurements on indoor radio wave propagation at (28–33GHz) which might be used by 5G communication.

Ahmed Ben Alabish, Abduelbaset Mustafa Alia Goweder, (05-2021), IEEEAccess: 2021 IEEE 1st International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering MI-STA, 1-4

Publication link

Migration of RDBs into ORDBs and XML Data

Journal Article

Abstract— XML and relation database are two of the most important mechanisms for storing and transferring data. A reliable and flexible way of moving data between them is very desirable goal. The way data is stored in each method is very different which makes the translation process difficult. To try and abstract some of the differences away, a low–level common data model can be used to successfully move data from one model to another. A way of describing the schema is needed. To the best of our knowledge, there is no widely accepted way of doing this for XML.

Recently, XML Schema has taken on this role. On one hand, this paper takes XML conforming to XML schema definitions and transforms into relational database via the low–level modeling language HDM. On the other hand, a relational database is transformed into an XML Schema document and an XML instance document containing the data from the database. The transformations are done within the Auto med framework providing a sound theoretical basis for the work. A visual tool that represents the XML Schema in tree structure and allows some manipulation of the schema is also described.

Ali Sayeh Ahmed Elbekai, Abduelbaset Mustafa Alia Goweder, (04-2018), Faculty of Science, University of Tripoli: THE LIBYAN JOURNAL OF SCIENCE (An International Journal), 4 (21), 57-63

Publication link

Irregular Arabic Plural without Stemming.

Conference paper

Abstract— With the growth of digital Arabic documents specially in information retrieval (IR) and natural language processing (NLP) applications, identification of irregular plurals which are commonly called broken plurals (BP) in modern standard Arabic becomes very urgent issue. Broken plurals are formed by imposing interdigitating patterns on stems, and singular words cannot be recovered by standard affix stripping stemming techniques. Identifying broken plurals is an important and difficult problem which needs to be addressed. In information retrieval, deriving singulars from plurals is referred to as a stemming. The process of stemming can be achieved by removing the attached affixes from a given word. To the best of our knowledge, all existing Arabic stemmers are unreliable and still under research. Consequently, this paper proposes an approach which identifies broken plurals without the need to perform the stemming process on any given word. The well known decision tree system (WEKA J48) is applied to build a classifier (model) on a very huge Arabic corpus as a training data which is pre-processed and prepared as a piece of this work. The built classifier is evaluated using unseen test set. The obtained results reveal that a very promising broken plural recognizer could be designed and implemented for NLP applications.

Abduelbaset Mustafa Alia Goweder, (11-2016), Hammamet, Tunisia.: Proceedings of CEIT 2016, 1-6

Publication link

The Similarity Thesaurus for Expanding Arabic Queries

Conference paper

Abstract— Query expansion is the process of supplementing additional terms to the original query to improve the information retrieval (IR) performance. For heavily inflectional languages such as Arabic, query expansion is considered a difficult task. In this paper, the well-known approach: "The similarity thesaurus" is adopted to be applied on Arabic. Prior to applying this approach, first; datasets (three collections of Arabic documents) are pre-processed to create documents inverted index vocabularies, then, the normal indexing process is carried out. The thesaurus method is applied to create a modified (expanded) query of the original one and the target collection is indexed once more. To gauge the enhancement of retrieval process, the results of normal indexing and those of applying thesaurus approach are evaluated against each other using precision and recall measures. The results have shown that the thesaurus method has considerably enhanced the performance of the Arabic Information Retrieval (AIR) System. As the number of expansion terms increases up to a certain extent (35 terms), the performance has been improved. On the other hand, the performance will not be affected, or grow insignificantly as the number of expansion terms exceeds this limit.

Abduelbaset Mustafa Alia Goweder, (08-2014), University of Selcuk, Antalya, Turkey: Proceedings of ICAT 2014, 876-882

Publication link

XMLSchema-Driven Mapping of Architecture Components for Generating New Data.

Conference paper

Abstract— In this paper, the XMLSchema-driven mapping of architecture components for generating new data formats will be introduced and an investigation of how the XMLSchema can be stored in different ways will be carried out. In general, any application that has the capability to work with XML documents will need to display the structure of its related data in a different format specified for a particular occasion, due to its nature in working in heterogeneous environment. Accordingly, mapping document from one data structure to another is needed. Such a mapping process is essential, especially when dealing with XMLSchema. Actually, when the data are to be translated between XML and database there should be some means of mapping formulated for the data before they can be transferred either to the database or in the document. Most of the techniques use object relational mapping for transforming data between XML and the database. In this paper, we will present different types of mapping of XMLSchema such as tree-to-tree which means XMLSchema to another XMLSchema and XMLSchema to XHTML. Other mappings are XMLSchema to relation, XMLSchema to object relational, and XMLSchema to relational algebra. We also introduce general algorithms for many of the mapping types. The algorithms and the techniques show how XMLSchema drives the mapping of architecture components to generate a new data structure.

Ali Sayeh Ahmed Elbekai, Abduelbaset Mustafa Alia Goweder, (08-2014), University of Selcuk, Antalya, Turkey.: Proceedings of ICAT 2014, 889-895

Publication link

The Pseudo Relevance Feedback for Expanding Arabic Queries

Conference paper

Abstract With the explosive growth of the World Wide Web, Information Retrieval Systems (IRS) have recently become a focus of research. Query expansion is defined as the process of supplementing additional terms or phrases to the original query to improve the information retrieval performance. Arabic is highly inflectional and derivational language which makes the query expansion process a hard task. In this paper, the well known approach, Pseudo Relevance Feedback (PRF) is adopted to be applied on Arabic. Prior to applying PRF, first; datasets (three collections of Arabic documents) are pre-processed to create documents inverted index vocabularies, then, the normal indexing process is carried out. The PRF is applied to create a modified (expanded) query of the original one and the target collection is indexed once more. To judge the enhancement of retrieval process, the results of normal indexing and those of applying PRF are evaluated against each other using precision and recall measures. The results have shown that the PRF method has significantly enhanced the performance of the Arabic Information Retrieval (AIR) System. As the number of expansion terms increases up to a certain extent (35 terms), the performance has been improved. On the other hand, the performance will not be affected, or grow insignificantly as the number of expansion terms exceeds this limit.

Abduelbaset Mustafa Alia Goweder, (12-2013), Poznan, Poland.: Proceedings of 6th Language and Technology Conference, (LTC), 359-365

Publication link

CENTROID-BASED ARABIC CLASSIFIER

Conference paper

Abstract: Nowadays, enormous amounts of accessible textual information available on the Internet are phenomenal. Automatic text classification is considered an important application in natural language processing. It is the process of assigning a document to predefined categories based on its content. In this paper, the well-known Centroid-based technique developed for text classification is considered to be applied on Arabic text. Arabic language is highly inflectional and derivational which makes text processing a complex and challenging task. In the proposed work, the Centroid-based Algorithm is adopted and adapted to be applied to classify Arabic documents. The implemented algorithm is evaluated using a corpus containing a set of Arabic documents. The experimental results against a dataset of 1400 Arabic text documents covering seven distinct categories reveal that the adapted Centroid-based algorithm is applicable to classify Arabic documents. The performance criteria of the implemented Arabic classifier achieved roughly figures of 90.7%, 87.1%, 88.9%, 94.8%, and 5.2% of Micro-averaging recall, precision, F measure, accuracy, and error rates respectively.

Abduelbaset Mustafa Alia Goweder, (12-2013), Sudan University of Science and Technology, Khartoum, Sudan: Proceedings of ACIT 2013, 13-21

Publication link

Arabic Text Categorization using Rocchio Model

Conference paper

Abstract— Automatic text categorization is considered an important application in natural language processing. It is the process of assigning a document to predefined categories based on its content. In this research, some well-known techniques developed for classifying English text are considered to be applied on Arabic. This work focuses on applying the well-known Rocchio (Centroid-based) technique on Arabic documents. This technique uses centroids to define good class boundaries. The centroid of a class c is computed as center of mass of its members. Arabic language is highly inflectional and derivational which makes text processing a complex task. In the proposed work, first Arabic text is preprocessed using tokenization and stemming techniques. Then, the Rocchio Algorithm is adopted and adapted to be applied to classify Arabic documents. The implemented algorithm is evaluated using a corpus containing a set of actual documents. The results show that the adapted Rocchio algorithm is applicable to categorize Arabic text. Ratios of 92.2%, 92.7%, and 92.1% of Micro-averaging recall, precision, and F-measure respectively are achieved, against a data set of 500 Arabic text documents covering five distinct categories.

Abduelbaset Mustafa Alia Goweder, (10-2013), Zurich, Switzerland: Proceedings of International Conference on Advances in Computing, Electronics and Communication (ACEC), 71-78

Publication link

A Universal Lexical Steganography Technique International Journal of Computer and Communication Engineering

Journal Article

In this paper, English language will be used as an instance of natural languages as we will be concerned with the set of all natural language texts. this research tries to employ a set of all synonyms as a way to hide secret message inside a natural language text.

Ahmed Hassen ELjeealy Ben Alabish, Abduelbaset Mustafa Alia Goweder, (01-2013), International Journal of Computer and Communication Engineering: IJCCE, 2 (159), 153-157

Publication link

Design and Simulation of an Adaptive Intelligent Control System for Direct Current Drive

Conference paper

Abstract: The paper presents an adaptive intelligent control method to overcome effects of some indeterminate and undealt factors that a DC .drive is suffered. In the speed loop, we use a three-layer neural networks through a backpropagation (BP) algorithm out of line learning to realize the fuzzy-control tactics. We use unit neuron through Hebb algorithm on-line dynamic learning to realize adaptive mechanism. The simulation is based on a MATLAB6.0 neural networks toolbox with simulink. The results of the simulation show that adaptive intelligent control method enables the system to have good dynamic and stability performance. The proposed method develops the use of simulink in the field of electrical drive of adaptive intelligent control.

Abduelbaset Mustafa Alia Goweder, (10-2010), University of Tripoli, Tripoli, Libya.: Proceedings of The Libyan Arab International Conference on Electrical and Electronic Engineering (LAICEEE2010), 72-80

Network Security for QoS Routing Metrics

Conference paper

Abstract— Data security is an essential requirement, especially when sending information over a network. Network security has three goals called confidentiality, integrity and availability (or Access). Encryption is the most common technique used to achieve this goal. However, the computer society has not yet agreed on a standard method to measure data security. The ultimate goal of this study is to define security metrics based on different aspects of network security, and then demonstrate how these metrics could be used in Quality of Service (QoS) routing to find the most secure path connecting two distant nodes (source and destination) across an internetwork. Three security metrics are proposed in this document, these metrics have been derived from three important issues of network security, namely: authentication, encryption and traffic filtration techniques (firewalls and intrusion detection systems). The metrics follow different composition rules in that the first is binary, the second is either concave or additive and the last is multiplicative. Routing algorithms that make use of such metrics have been implemented in the C# programming language to test the viability of the proposed solution. Computational effort and blocking probability are the most commonly used performance measures were used to assess the behavior and the performance of these routing algorithms. Results obtained show that the algorithms were able to find feasible paths between communicating parties and helped in making reasonable savings in the computational effort needed to find an acceptable path. Consequently, higher blocking probabilities were encountered, which is thus the price to be paid for the savings.

Ibrahem Ali Mohammed Almerhag, Abduelbaset Mustafa Alia Goweder, (05-2010), The International Islamic University, Kuala Lumpur, Malaysia: Proceedings of ICCCE 2010, 151-157

Publication link

Unsupervised Sentence Boundary Detection Approach for Arabic.

Conference paper

ABSTRACT

Punkt (German for period) is a sentence boundary detection system that divides an English text into a list of sentences using an unsupervised algorithm developed by Kiss and Strunk (2006) [6]. This algorithm is based-on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified.

The Punkt system was adapted to support Arabic language. The modified Punkt is trained on Arabic Corpus to build a model for abbreviation words, collocations, and words that start sentences. An evaluation of the performance of the modified Punkt system has revealed that an accuracy rate close to 99% has been achieved for detecting Arabic sentence boundaries.

Abduelbaset Mustafa Alia Goweder, (12-2009), University of Science and Technology, Yemen.: Proceedings of ACIT 2009, 289-297

A General Technique for Graduating SQL Schema from XML Schema.

Conference paper

It is possible to generate an SQL schema from XML Schema manually; however automatically generating an SQL schema from XML Schema would generally be very beneficial. This paper presents XML Schema-driven generation architecture components with XSL Stylesheet. In this paper, an algorithm for this type of generation is presented. The inputs of the algorithm are XML Schema and XSL Stylesheet, and the output is an SQL schema. The proposed algorithm shows how this component can automatically be generated. An evaluation of the proposed algorithm is also presented by testing the algorithm with different examples.

Ali Sayeh Ahmed Elbekai, Abduelbaset Mustafa Alia Goweder, (12-2009), University of Science and Technology, Yemen.: Proceedings of ACIT 2009, 177-185

An Anti-Spam System using Artificial Neural Networks and Genetic Algorithms

Conference paper

Nowadays, e-mail is widely becoming one of the fastest and most economical forms of communication .Thus, the e-mail is prone to be misused. One such misuse is the posting of unsolicited, unwanted e-mails known as spam or junk e-mails. This paper presents and discusses an implementation of an Anti-spam filtering system, which uses a Multi-Layer Perceptron (MLP) as a classifier and a Genetic Algorithm (GA) as a training algorithm. Standard genetic operators and advanced techniques of GA algorithm are used to train the MLP. The implemented filtering system has achieved an accuracy of about 94% to detect spam e-mails, and 89% to detect legitimate e-mails.

Abduelbaset Mustafa Alia Goweder, (12-2008), University of Safax, Safax, Tunisia: Proceedings of ACIT2008, 177-185

Publication link

Arabic Broken Plural using a Machine Translation Technique

Conference paper

Abstract The Arabic language presents significant challenges to many natural language processing applications. The broken plu rals (BP) problem is one of these challenges especially for information retrieval applications. It is difficult to deal with Arabic broken plurals and reduce them to their associated singulars, because no obvious rules exist, and there are no standard stemming algorithms that can process them. This paper attempts to handle the problem of broken plural by de veloping a method to identify broken plurals in an unvowelised Arabic text and reducing them to their correct singular forms by incorporating the simple broken plural matching approach, with a machine translation system and an English stemmer as a new approach. A set of experiments has been conducted to evaluate the performance of the proposed method using a number of text samples extracted from a large Arabic corpus (AL-Hayat newspaper). The obtained re sults are analyzed and discussed.

Abduelbaset Mustafa Alia Goweder, (12-2008), University of Safax, Safax, Tunisia: Proceedings of ACIT2008, 64-71

Publication link

A Hybrid Method for Stemming Arabic Text

Conference paper

Abstract There are several stemming approaches that are applied to Arabic language, yet no a complete stemmer for this language is available. The existing stem-based stemmers for stemming Arabic text have a poor performance in terms of accuracy and error rates. In order to improve the accuracy rates of stemming, a hybrid method is proposed for stemming Arabic text to produce stems (not roots). The improvement of the accuracy of stemming will lead by necessity to the improvement of many applications very greatly, including: information retrieval, document classification, machine translation, text analysis and text compression. The proposed method integrates three different stemming techniques, including: morphological analysis, affix-removal and dictionaries.

Abduelbaset Mustafa Alia Goweder, (12-2008), University of Safax, Safax, Tunisia: Proceedings of ACIT2008, 125-132

Publication link

Identifying Broken Plurals in Unvowelised Arabic Text

Conference paper

Irregular (so-called broken) plural identification in modern standard Arabic is a problematic issue for information retrieval (IR) and language engineering applications, but their effect on the performance of IR has never been examined. Broken plurals (BPs) are formed by altering the singular (as in English: tooth→ teeth) through an application of interdigitating patterns on stems, and singular words cannot be recovered by standard affix stripping stemming techniques. We developed several methods for BP detection, and evaluated them using an unseen test set. We incorporated the BP detection component into a new light-stemming algorithm that conflates both regular and broken plurals with their singular forms. We also evaluated the new light-stemming algorithm within the context of information retrieval, comparing its performance with other stemming algorithms.

Abduelbaset Mustafa Alia Goweder, (07-2004), Barcelona, Spain: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 246-253

Publication link

Broken Plural Detection for Arabic Information Retrieval

Conference paper

Abstract

Due to the high number of inflectional variations of Arabic words, empirical results suggest that stemming is essential for Arabic information retrieval. However, current light stemming algorithms do not extract the correct stem of irregular (so-called broken) plurals, which constitute ~10% of Arabic texts and ~41% of plurals. Although light stemming in particular has led to improvements in information retrieval [5, 6], the effects of broken plurals on the performance of information retrieval systems has not been examined.We propose a light stemmer that incorporates a broken plural recognition component, and evaluate it within the context of information retrieval. Our results show that identifying broken plurals and reducing them to their correct stems does result in a significant improvement in the performance of information retrieval systems.

Abduelbaset Mustafa Alia Goweder, (07-2004), The University of Sheffield, UK: The 27th Annual International ACM SIGIR Conference, 566-567

Publication link

Assessment of a Significant Arabic Corpus

Conference paper

The development of Language Engineering and Information Retrieval applications for Arabic require availability of sizeable, reliable corpora of modern Arabic text. These are not routinely available. This paper describes how we constructed an 18.5 million word corpus from Al-Hayat newspaper text, with articles tagged as belonging to one of 7 domains. We outline the profile of the data and how we assessed its representativeness. The literature suggests that the statistical profile of Arabic text is significantly different from that of English in ways that might affect the applicability of standard techniques. The corpus allowed us to verify a collection of experiments which had, so far, only been conducted on small, manually collected datasets. We draw some comparisons with English and conclude that there is evidence that Arabic data is much sparser than English for the same data size.

Abduelbaset Mustafa Alia Goweder, (08-2001), Tolouse, France: Proceedings of ACL 2001, 71-78

Publication link