Segmental analysis based authorship discrimination between the holy Quran and prophet's statements

Segmental analysis based authorship discrimination between the holy Quran and prophet's statements

Halim Sayoud, University of Sciences and Technology Houari Boumediene (USTHB): halim.sayoud@uni.de

Peer-reviewed by: David Hoover, New York University; Atif Khalil, University of Lethbridge.

Lead Editor: Daniel Paul O'Donnell, University of Lethbridge.


Abstract / Résumé

Stylometry has got a lot of interest during these recent years because it has solved many authorship problems and disputes that were difficult to handle. Author discrimination consists in checking whether two texts are written by the same author or not. In this investigation, the author tries to make an author discrimination between the Quran (The holy words and statements of God in the Islamic religion) and the Hadith (statements said by the prophet Muhammad) in a segmental form. In fact, 14 text segments are extracted from the Quran book and 11 text segments are extracted from the Bukhari Hadith. These segments have more or less the same size in terms of words and the medium size is about 2080 words per text segment. The Quran is taken in its entirety, whereas for the Prophet's statements, we chose only the certified texts of the Bukhari book. That is, four series of experiments are done and commented. The first series of experiments concerns several experiments of Authorship attribution (AA) using different state of the art features and classifiers, the second series of experiments analyses the different texts by using a new parameter called COST, the third series of experiments consists in an authorship discrimination using the frequency of a particular word ("الذين" meaning those/who in English) and the fourth series of experiments performs a hierarchical clustering on the 25 text segments, in order to assess the real number of clusters (author styles) and to see if the hypothesis of a unique author is possible. This investigation, which represents the continuation of a previous research work on the same topic (Sayoud 2012a), has further clarified an old enigma, which was impossible to solve for fourteen centuries: all the results of this investigation show unanimously that the two books should have two different authors.

La stylométrie a montré beaucoup d'intérêts durant ces dernières années car elle a résolu beaucoup de problèmes et disputes qui étaient difficiles à manipuler. La discrimination d'auteur consiste à vérifier si deux textes sont écrits par le même auteur ou non. Dans cette étude, nous essayons d'exécuter une discrimination d'auteur entre le Coran (Les mots saints et déclarations de Dieu dans la religion Islamique) et le Hadith (déclarations prononcées par le Prophète Muhammad) sous une forme segmentale. En fait, 14 segments de texte sont extraits du Coran et 11 segments sont extraits du Hadith de Bukhari. Ces segments ont plus ou moins la même taille en terme de mots et la taille moyenne est d'environ 2080 mots par segment. Le Coran est pris entièrement, tandis que pour les déclarations du Prophète, nous avons choisi seulement les textes certifiés du livre de Bukhari. Ainsi, quatre séries d'expériences sont faites et commentées. La première série d'expériences concerne plusieurs expériences d'identification d'auteur utilisant différents classifieurs et caractéristiques de l'état de l'art ; la seconde série d'expériences analyse les différents textes en utilisant un nouveau paramètre appelé COST; la troisième série d'expériences consiste en une discrimination d'auteurs utilisant la fréquence d'un mot particulier ("الذين " signifiant Ceux/ Qui en Français) et la quatrième série d'expériences exécute un regroupement hiérarchique sur les 25 segments de texte, dans le but d'estimer le nombre réel de clusters (Styles d'auteurs) et de voir si l'hypothèse d'un auteur unique est possible. Cette étude, qui représente la suite d'un travail de recherche précédent sur le même sujet (Sayoud 2012a), a clarifié d'avantages une ancienne énigme, qui était impossible de résoudre durant quatorze siècles : en fait, tous les résultats de cette étude montrent que les deux livres devraient avoir deux auteurs différents.

KEYWORDS / MOTS-CLÉS

Stylometry, segmental text analysis, authorship attribution, authorship discrimination, religious books, Quran, prophet's statements, comparison of text segments



Introduction

A long history of linguistic and stylistic investigation, into authorship attribution, does exist (Holmes 1998) due to several not-solved authorship disputes and due to the fact that authors have different ways of speaking and writing (Corney 2003). Authorship discrimination is a research field of stylometry, which consists in checking if two different texts are written by the same author or not, by using some techniques of text mining. The longer the text is, the better the identification accuracy becomes.

Stylometry is part of a broader growth within computer science of identification technologies, including biometrics (retinal scanning, speaker recognition, etc.), cryptographic signatures, intrusion detection systems, and others (Madigan 2005). Stylometry (i.e. authorship recognition) can be divided into several related fields that are:

  • Authorship attribution or authorship identification: consists in identifying the author(s) of a set of different texts;
  • Authorship verification: consists in checking whether a piece of text is written or not by an author who claimed to be the writer;
  • Authorship discrimination: consists in checking if two different texts are written by the same author or not;
  • Plagiarism detection: in this research field, we look for sentences or paragraphs that are taken from another author;
  • Text indexing and text segmentation: one particular interest in stylometry is to segment multi-author texts (e.g. discussion forum) into mono-author segments (segmentation) by giving the name of the appropriate author in each text segment (indexing).

It is true that many works are reported for the English, Greek and Hebrew languages; however, the author of this paper has not found serious research works for the Arabic language where there exist several old books that are assumed to belong to some known authors and for which the authorship is sometimes put in doubt. That is, in this research work, we deal with a religious enigma, which has not been solved for fourteen hundred years (Sayoud 2010; 2012a). In fact, many attempts to find a human source for the Quran do exist assuming for instance that the Quran could be written by the prophet Muhammad (Al-Shreef 2009). Such disputes are very difficult to solve due to the delicacy of the problem, the religious sensitivity and because the texts were written a long time ago. Furthermore, these types of disputes can be found in several religious texts; for instance, for the Christian religion, several disputes have been reported about the origin of some Biblical texts (Kenny 1986).

One of the purposes of stylometry is authorship attribution, which is the determination of the author of a particular piece of text. This task is more particularly required when some religious authorship disputes appear (Mills 2003). Hence, it can be seen why Holmes (Mills 2003) pinpointed that the area of stylistic analysis is the main contribution of statistics to religious studies. For example, early in the nineteenth century, Schleiermacher disputed the authorship of the Pauline Pastoral Epistle-1 Timothy (Mills 2003). As a result, other German speaking theologians, namely F.C. Baur and H.J. Holtzmann, initiated similar studies of New Testament books (Mills 2003). Since then, several investigations have been done on different pieces of religious texts and with different analysis techniques. However, in such problems, it is crucial to use rigorous scientific tools and it is more important to interpret them very carefully. Hence, in this investigation, we try to make some experiments of author discrimination (Li 2006) between the Quran and Prophet's statements in order to check if really the Quran was not written by the Prophet Muhammad (i.e. it was only sent to him by God) (Sayoud 2012a). For this purpose, four series of experiments are made: the first series of experiments concerns several experiments of authorship attribution using different state of the art features and classifiers, the second series of experiments analyses the different texts by using a new parameter called COST, the third series of experiments consists in an Authorship discrimination using the frequency of a particular word: "الذين" (meaning those/who in English) and the fourth series of experiments performs a hierarchical clustering on the different texts.

The manuscript is organized as follows: Brief description of the old religious Books gives a description of the two books to be compared; Description of the experimental corpus describes the text dataset that is used in this experiment; Discrimination experiments using different authorship attribution techniques describes the different experiments of authorship discrimination and attribution; and, finally, an overall discussion is put at the end of the manuscript.

Brief description of the old religious Books

Herein, we will give a brief description on the two religious books that are investigated in our experiment, namely: the Quran and Hadith.

Description of the Quran

The Quran (in Arabic: القرآن) is the central religious text of Islam (Nasr 2013; Wiki 2011b). Muslims believe the Quran to be the book of divine guidance and direction for mankind (Ibrahim 1996; Izutsu 2002; Robinson 2004) (that has been written by God), and consider this Arabic book to be the final revelation of God. Islam holds that the Quran was written by Allah (i.e. God) and transmitted to Muhammad by the angel Gibraele (Gabriel) over a period of 23 years. The beginning of Quran apparition was in the year 610 (after the Birth of Christ).

Description of the Hadith

The Hadith (in Arabic: الحديث) is the oral statements and words said by the Islamic prophet Muhammad (Pbuh) (Islahi 1989; Wiki 2011a). Hadith collections are regarded as important tools for determining the Sunnah, or Muslim way of life, by all traditional schools of jurisprudence. In Islamic terminology, the term Hadith refers to reports about the statements or actions of the Islamic prophet Muhammad, or about his tacit approval of something said or done in his presence (Islahi 1989; Wiki 2011a). The text of the Hadith (matn) would most often come in the form of a speech, injunction, proverb, aphorism or brief dialogue of the Prophet whose sense might apply to a range of new contexts. The Hadith was recorded from the Prophet for a period of 23 years between 610 and 633 (after the Birth of Christ).

Was the prophet the author of the Quran?

Muslims believe that Muhammad was only the narrator who recited the sentences of the Quran as written by Allah (God), but not the author. See what Allah (God) says in the Quran book: "O Messenger (Muhammad)! Transmit (the Message) which has been sent down to you from your Lord. And if you do not, then you have not conveyed his Message…." However, some doubts on the origins of the Quran suppose that the Quran could be written by the prophet Muhammad as reported by Al-Shreef (2009).

That is, the main purpose of this investigation is to conduct a fair text-mining based investigation (i.e. authorship discrimination) in order to see if the two concerned books have the same or different authors (Mills 2003; Tambouratzis 2000, 2003) with a maximum of objectivity.

Description of the experimental corpus

Dimension of the two religious books

In a previous work, the author used the entire text of the Quran (something like 315 A4 pages) but a small collection of the Hadith (not exceeding 3 pages) only, due to the difficulty to find a book containing only the Prophet's sentences (without the comments of the narrators). In this context the author was strongly advised by some experienced stylometric researchers, who were working on Greek discourses, to try to increase the size of the Hadith text, in order to get a consistent comparison between the two investigated books. So, after a thorough investigation on the Hadith texts, the author managed to collect a confident and consistent dataset, which is organized in a form that is more convenient (book gathering pure Prophet statements, called Bukhari Hadith).

That is, the present section summarizes the size of the two new investigated books in terms of words, tokens, pages, etc. The statistical characteristics of these two books are summarized as follows:

  • Quran size in terms of tokens: approximately 87341
  • Hadith size in terms of tokens (Bukhari Hadith): 23068
  • Quran size in terms of different words: approximately 13473
  • Hadith size in terms of different words (Bukhari Hadith): 6225
  • Quran A4 pages in the Quran: 315 pages (subjective size)
  • Hadith A4 pages in the Hadith (Bukhari Hadith): 87 pages (subjective size)
  • Ratio of the Number of Quran Tokens / Number of Hadith Tokens: 3.79
  • Ratio of the Number of Quran Lines / Number of Hadith Lines: 3.61
  • Ratio of the Number of different Quran words / Number of different Hadith words: 2.16
  • Ratio of the Number of Quran A4 Pages / Number of Hadith A4 Pages: 3.62

According to these size details, the two religious books seem relatively consistent, since the average number of pages is 315 for the Quran book and 87 for the Hadith book. However, since the two books do not have the same size, it will be necessary and prudent to segment these two books into segments of more or less a same size, in order to avoid unbalanced results.

Segmentation

As quoted in Dimension of the two religious books, the author already conducted an authorship investigation (previous work) on the two religious books by considering the whole books entirely (Sayoud 2012a). In that approach, when comparing two books, it is difficult to know any part of the book is similar to the other one or different from it. That is why a judicious segmentation has been proposed and applied on the different books, which consists in segmenting those books into several text segments.

The sizes of the segments are more or less in the same range: there are 14 different text segments for the Quran and 11 different text segments for the Hadith, with approximately the same size. In case of machine learning based classification, these segments are organized as follows: three text segments are selected from every book to represent the training data and the remaining text segments are used during the testing step. In the other cases, all the text segments are used for classification/attribution. The segments have more or less the same size in terms of words as it is shown in Table 1. The medium size is about 2076 words per text. The problem with such a size is that AA systems are usually not accurate, since it has been shown that the minimum text size, for a good AA process, is at least 2500 words per size (Eder 2010; Signoriello 2005).

Table 1: Sizes of the different text segments.

Hadith text segments Size in terms of tokens Quran text segments Size in terms of tokens
H1 2035 Q1 2064
H2 2096 Q2 2071
H3 2053 Q3 2086
H4 2059 Q4 2085
H5 2081 Q5 2081
H6 2073 Q6 2080
H7 2031 Q7 2087
H8 2082 Q8 2074
H9 2088 Q9 2081
H10 2097 Q10 2079
H11 2083 Q11 2078
/ / Q12 2092
/ / Q13 2093
/ / Q14 2081

Word structure of the different segments

A graphical representation of the word length frequency has been made for every text segment, in order to see the overall structure of the used words in term of size. Figure 1 represents the smoothed word length frequency curves versus the number of characters per word. It shows that the words have more or less the same dimension frequency for both books, except for unigrams (1-character words), trigrams, tetragrams and octograms (8-character words), where we often distinguish a certain difference in their frequencies, but this observation cannot be used for objective discrimination purposes.

Figure 1: Word length frequency versus the word length (for all text segments)- curves are obtained by interpolation.

Word length frequency versus the word length (for all text
                                    segments)- curves are obtained by interpolation.

Discrimination experiments using different authorship attribution techniques

Herein, we will describe the four series of experiments that have been conducted on the two religious books for a purpose of authorship discrimination.

Experiments of authorship attribution using a hierarchical clustering

In order to represent the stylistic similitude between the different texts, in a graphical way, a hierarchical clustering (Sayoud 2012b), using cityblock distance, has been performed on all text segments by using the following features: COST parameter (see Experiments of authorship attribution using the COST parameter) and frequency of the word "الذين" meaning THOSE (or WHO in a plural form) in English. The resulting dendrogram is displayed in Figure 2, where we can see the different possible clusters and their costs (distances). The smaller the cost is, the more similar the segments are (in the same cluster).

Figure 2: Dendrogram of a hierarchical clustering corresponding to the 25 text segments.

Dendrogram of a hierarchical clustering corresponding to
                                    the 25 text segments.

As we can see in Figure 2, the segments have been automatically divided into 2 main clusters: "cluster Q" (in dark red) grouping all the text segments of the Quran and "cluster H" (in light blue) gathering all the text segments of the Hadith. We can notice that the last clustering into one cluster (big line at the top) is inconsistent for two reasons: first, because the corresponding distance of this last cluster is more than 4.5, which is relatively very large; and second, because we do not retrieve any link between heterogeneous segments at all (clusters grouping different label types such as Qj-Hk). This result shows that the different text segments should belong to 2 different authors, or at least 2 different author styles. It also shows that Quran texts are relatively similar (low intra-variability with distances less than 2) and that Hadith texts are relatively similar too (low intra-variability with distances less than 1).

Experiments of authorship attribution using different state of the art features and classifiers

This series of experiments, which consists in an authorship attribution (Sanderson 2006), analyses the two books in a segmental form by using several features (word n-grams, character n-grams and rare words) (Clement 2003) and several classifiers: Stamatatos distance, Canberra distance, Cosine distance, RN cross entropy distance, Intersection distance, Manhattan distance, SMO-SVM (Sequential Minimal Optimization based Support Vector Machine) classifier, Linear regression classifier, and MLP (Multi-Layer Perceptron) classifier.

Brief description of the different classifiers

Short definitions of the different classifiers are given below:

Manhattan distance

This distance (Sayoud 2012a) is very reliable in text classification. The Manhattan distance between two vectors f and g is given by the following formula:

(1)

where n is the length of the vector.

Cosine distance

Cosine similarity is a measure of similarity between two vectors that measures the cosine of the angle between them (Wiki 2013a). The technique is also used to compare documents in text mining. The cosine of two vectors can be derived by using the Euclidean dot product formula:

(2)

Given two vectors of attributes, f and g, the cosine similarity, cos (θ), is represented using a dot product and magnitude as:

(3)

where the double vertical bar denotes the magnitude of the vector and n is its length (Wiki 2013b).

Stamatatos distance

This distance was proposed by Stamatatos (Stamatatos 2007). The Stamatatos distance between two vectors f and g is given by the following formula:

(4)

where n is the length of the vector.

Canberra distance 

Canberra distance is a numerical measure of the distance between pairs of points in a vector space. It is more or less similar to Manhattan distance. It is mostly used for data scattered around the origin. The Canberra distance between two vectors f and g is given by the following formula:

(5)

where n is the length of the vector.

Cross entropy distance

The Cross entropy distance, where f and g are supposed independent (Juola 2006), is given by:

(6)

It has been widely used (improved version) by Juola (2006) in his released software.

Intersection distance

The Intersection distance, which measures the dissimilarity between two sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the intersection-to-union ratio from 1:

(7)

Multi-Layer Perceptron (MLP)

The MLP is a classic neural network classifier that uses the errors of the output to train the neural network (Sayoud 2003). The MLP can use different back-propagation schemes to ensure the training of the classifier. The MLP is trained by three texts for every author, whereas the remaining texts are used for the testing task. Usually the MLP is efficient in supervised classification, however in case of local minima, we could get some errors of classification.

Sequential Minimal Optimization based Support Vector Machine (SMO-SVM)

In machine learning, Support Vector Machines (SVMs) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, and are employed for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that assigns new examples into one category or the other. A SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. In addition to performing linear classification, SVMs can efficiently perform non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. The SVM is a very accurate classifier that uses bad examples to form the boundaries of the different classes (Witten 1999). The SMO algorithm is used to speed up the training of the SVM (Keerthi 2001).

Linear regression

The Linear regression is the oldest and most widely used predictive model. The method of minimizing the sum of the squared errors to fit a straight line to a set of data points was published by Legendre in 1805 and by Gauss in 1809 (Deng 2013). Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the "lack of fit" in some other norms (as with least absolute deviations regression), or by minimizing a penalized version of the least squares loss function as in ridge regression (Huang 2003; Sayoud 2003; Wiki 2013b).

Authorship attribution results

As quoted in Description of the experimental corpus, there are 25 different text segments of about 2080 words each, consisting of 11 Hadith segments and 14 Quran segments. In these experiments, 3 segments of the Hadith and 3 other segments of the Quran are used for the training and the remaining segments (8 Hadith segments and 11 Quran segments) are used for the testing. Therefore, there are 19 different segments to identify according to 2 referential Authors (Quran Author or Hadith Author).

In the following paragraphs, an attribution error of 0% means that all the Quran segments are classified as "Quran class" and all the Hadith segments are classified as "Hadith class," without any error of attribution. In fact the attribution error is defined as the ratio of the number of false attributions over the total number of testing segments (see equation 8).

(8)

Table 2: Attribution error in % for the different text segments. There are 11 segments for the Hadith (8 testing + 3 reference) and 14 for the Quran (11 testing + 3 reference).

  Feature
Classifier   Charac. Bi-gram Charac. Tri-gram Charac. Tetra-gram Word Bi-gram Word Tri-gram Word Tetra-gram Word Rare words

(freq=1.. 3)

Number of features All All All

50 most

freq.

50 most

freq.

50 most

freq.

All All
SMO-SVM 0% 0% 0% 0% 0% 0% 0% 0%
Linear regression 0% 0% 0% 0% 0% 0% 0% 0%
MLP 0%* 0%* 0%* 0% 0% 0% 0%* 0%*
Stamatatos distance 0% 0% 0% 0% 0% 5.3% 0% 0%
Canberra distance 0% 0% 0% 0% 0% 10.5% 0% 0%
Cosine distance 0% 0% 0% 0% 0% 0% 0% 0%
RN cross entropy 0% 0% 0% 0% 5.3% 0% 0% 0%
Intersection distance - 0% 0% 0% 0% 0% 5.3% 0%
Manhattan distance 0% 0% 0% 0% 0% 0% 0% 0%

* means that only the 500 most frequent features are employed

- means a classification failure

By observing the above table (Table 2), we can notice that all Quran segments are attributed to the referential "Quran Author" and all Hadith segments are attributed to the referential "Hadith Author." That is, the 19 different text segments are classified into 2 main classes: "Quran class" and "Hadith class," with 0% classification error. From this result, we can deduce that the 2 religious books should have 2 different authors (or at least 2 different writing styles) and that every book should be written by one author (or at least one writing style).

Experiments of authorship attribution using the COST parameter

What is the COST parameter?

Usually, when poets write a poem, they make a termination similarity between the neighboring sentences of the poem, such as a same final syllable or letter. To evaluate that termination similarity, a new parameter estimating the degree of text chain (in a text of several sentences) has been proposed by the author: it has been called COST parameter (Sayoud 2012a). Thus, the COST parameter of the jth sentence in a poem is computed by incrementing a counter of similarities acting between the sentence j, sentence (j-1) and sentence (j+1). This process is ensured by adding all the occurrence marks (values ranging between 0 and 4) between the sentence "j" and its neighboring sentences (sentence "j-1" and sentence "j+1"). In our case, the occurrence marks concern only the two last letters of the sentence. For instance, let us observe the following poem:

  • Never say it is the end when we do believe ➞ cost = 2
  • And never accept that you do not retrieve ➞ cost = 2
  • Life is so short to let things kill our mind ➞ cost = 2
  • What to do in such situations dear friend ➞ cost = 4
  • It is true that it is hard but victory will be in hand ➞ cost = 2
  • Do not hesitate to try if you can make any change ➞ cost = 1
  • Yes it is worth trying even if it is the last chance ➞ cost = 1

If we consider the fourth sentence (ending with "nd"), we notice that the previous and next sentences (sentence 3 and 5) are ended with the same last 2 characters (i.e. "nd").

So by counting the number of similar characters (i.e. (1+1) + (1+1) = 4), we get a COST value of 4. The same procedure is repeated for each sentence until the last one. For concreteness, here are the COST values for some Hadith sentences (see Table 3) and the COST values of some sentences located at the middle part of the Quran (see Table 4).

Table 3: COST values for some Hadith sentences.

Sentence No COST last 2 characters last word
1\ 0 دا أبدا
2\ 0 ون تظلمون
3\ 0 ية الجاهلية
4\ 0 له الله
5\ 0 ان شعبان
6\ 0 يه نبيه
7\ 0 غت بلغت
8\ 0 هد اشهد
9\ 0 كم لأنفسكم
10\ 1 يك عليك
11\ 1 سك لنفسك
12\ 0 نم جهنم
13\ 1 ته بركاته
14\ 2 نه أستعينه
15\ 1 نا أعمالنا
16\ 1 له له
17\ 1 غه أبلغه
18\ 0 كم عليكم

Table 4: COST values of some sentences located at the middle of the Quran.

Sentence No COST last 2 characters Word
3116\ 4 ين المسحرين
3117\ 4 ين الكاذبين
3118\ 3 ين الصادقين
3119\ 1 ون تعملون
3120\ 1 يم عظيم
3121\ 2 ين مؤمنين
3122\ 2 يم الرحيم
3123\ 3 ين العالمين
3124\ 4 ين الأمين
3125\ 4 ين المنذرين
3126\ 4 ين مبين
3127\ 3 ين الأولين
3128\ 2 يل إسرائيل
3129\ 3 ين الأعجمين
3130\ 4 ين مؤمنين
3131\ 3 ين المجرمين
3132\ 1 يم الأليم
3133\ 2 ون يشعرون
3134\ 4 ون منظرون
3135\ 3 ون يستعجلون
3136\ 2 ين سنين

According to the previous tables, we remark that for the Hadith text, there are many COST values equal to 0; and when the COST is non-null, it has very small values: the average COST is only 0.46. For the Quran, we notice that the COST is almost never null and the corresponding values are relatively high: the average COST of the Quran is approximately 2.52. This interesting fact suggests the application of this type of experiment on the different text segments in order to see if there exists a stylistic difference between these segments. The different average COST values are represented in Figure 3.

Figure 3: Average COST values for all the text segments.

Average COST values for all the text
                                    segments.

Figure 3 shows a sharp difference between the Quran segments, which present relatively high COST values, and the Hadith segments, for which the COST values are very small. This fact implies that the structures of Quran and Hadith are different. Consequently, and since we deal with the same topic (i.e. the two samples are both religious texts), the two books should have two different author styles.

Furthermore, in order to assess the significance of the previous results, a statistical investigation on the consistency of the discrimination between the two types of segments, is made by using the Fisher's statistical exact test (Lowry 2012). Results show a two-tailed P probability that is less than 0.0001. This result shows that the association between the style and COST parameter is statistically significant.

Experiments of authorship attribution using the frequency of the word "الذين"

This experiment investigates the use of some words that are very commonly used in only one of the books (Sayoud 2012a). In practice, we remarked that the word الذين (in English: THOSE or WHO in a plural form) is very commonly used in the Quran; whereas, in the Hadith, this word is rarely used, as we can see in the following figure. Its occurrence frequency is between 0.63% and 2.02% for Quran segments, but it is between 0% and 0.29% for Hadith segments (see Figure 4). Its average occurrence frequency is 1.3% for Quran segments and it is only 0.09% for Hadith segments (namely almost the 1/14th of the average Quran frequency).

These results show that the author of the Quran uses much more frequently this particular word than the Hadith author does.

Figure 4: Number of citations of the word "الذين".

Number of citations of the word "الذين".

As in the previous experiment, in order to evaluate the statistical significance of these results, a Fisher's statistical exact test (Lowry 2012) has been made to compute the discrimination consistency. We get a two-tailed P probability that is less than 0.0001. This result means that the association between style and citation number of "الذين" is considered to be statistically significant.

Discussion

As a continuation of a previous research work on the same topic (Sayoud 2012a), the present investigation performs a segmental analysis for the task of authorship discrimination (Tambouratzis 2000, 2003) between two old Arabic religious books: the Quran and Bukhari Hadith.

That is, four series of experiments have been made:

  • The first series of experiments consists in an authorship attribution task, which analyses the different text segments by using several state-of-the-art features and classifiers.
  • The second series of experiments concerns the new COST parameter, which appears non-null only in the Quran. This parameter estimates the degree of similarity between the endings of sentences.
  • The third series of experiments investigates the use of some words that are very commonly employed in only one of the books. In this research work, the word: الذين (in English, it is translated into: THOSE or WHO in its plural form) has been chosen and analysed in the different text segments.
  • The fourth series of experiments performs a hierarchical clustering on the 25 text segments, in order to see how many possible clusters really exist and if the hypothesis of a unique author is possible.

After observing all the experimental results and since the two books appear to have the same theme (i.e. the two books are both religious texts), it would be reasonable to deduce the following conclusions:

  • The two segmented books should have different authors (or at least two different author styles);
  • All the segments that have been extracted from a unique book (from the Quran only, or from the Hadith only) should probably belong to the same author.

According to these two important results, we should be able to extend our conclusions to the entire books from which the concerned segments were extracted. In fact the styles of these text segments represent the style of their corresponding original books (i.e. statistically). Consequently, it appears that the two investigated books should have different authors. Without entering in theological debates, the present investigation gives us a new scientific way to analyze and check the authorship authenticity of old or disputed documents.

Acknowledgements and apologies

The author wants to warmly thank all the persons who helped him during this research work and all the persons who contributed by their advices and generosity. The author is very grateful for the support he had received from them. He also welcomes all suggestions and comments of the readers. He particularly wish to thank the journal Editors and the reviewers for their pertinent comments. Finally, he apologizes for any unintentional mistake that could appear in the present paper.


Works Cited

Al-Shreef, Abd Al-Raheem. 2009. "Is the Holy Quran Muhammad's invention?" Encyclopedia of miracles in Quran and Sunnah. Accessed February 11, 2012. http://www.quran-m.com/firas/en1/index.php?option=com_content&view=article&id=294:is-the-holy-quran-muhammads-invention-&catid=51:prophetical&Itemid=105.

Clement, Ross, and David Sharp. 2003. "Ngram and Bayesian classification of documents for topic and authorship." Literary and Linguistic Computing 18.4: 423–447.

Corney, Malcolm. 2003. "Analysing e-mail text authorship for forensic purposes." Master thesis, Queensland University of Technology.

Deng, Hongyao, and Xiuli Song. 2013. "The theory and practice of linear regression." World Transactions on Engineering and Technology Education 11.4: 382-387.

Eder, Maciej. 2010. "Does size matter? Authorship attribution, short samples, big problem." Paper presented at Digital humanities 2010 conference, London, July 7-10, 132-135. http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-744.html.

Holmes, David. 1998. "The evolution of stylometry in humanities scholarship." Literary and Linguistic Computing 13.3: 111-117.

Huang, Xiaohong, and Wei Pan. 2003. "Linear regression and two-class classification with gene expression data." Bioinformatics 19.6: 2072-2078.

Ibrahim, I. A. 1996. A brief illustrated guide to understanding Islam. Texas: Darussalam. Accessed February 11, 2012. http://www.islam-guide.com/contents-wide.htm.

Islahi, Amin. 1989. Fundamentals of Hadith interpretation. Translated by T. M. Hashmi. Lahore: Al-Mawrid. Accessed February 15, 2012. www.monthly-renaissance.com/Download-Container.aspx?id=71.

Izutsu, Toshihiko, and Charles J. Adams. 2002. Ethico-religious concepts in the Qur'an. Montreal: McGill-Queen's University Press.

Juola, Patrick. 2006. "Authorship attribution." Foundations and Trends in Information Retrieval Journal 1.3: 233-334.

Juola, Patrick. 2009. "JGAAP: A system for comparative evaluation of authorship attribution." Paper presented at Chicago colloquium on digital humanities and computer science 1.1, Chicago, November 14-16. https://lucian.uchicago.edu/blogs/dhcs/past-colloquia/.

Keerthi, Sathiya, Shirish Shevade, Chiranjib Bhattacharyya, and Krk Murthy. 2001. "Improvements to Platt's SMO algorithm for SVM classifier design." Neural Computation 13: 637–649.

Kenny, Anthony. 1986. A stylometric study of the new testament, 1st ed. Oxford: Clarendon Press.

Li, Jiexun, Rong Zheng and Hsinchun Chen. 2006. "From fingerprint to writeprint." Communications of the ACM 49.4: 76-82.

Lowry, Richard. 2012. "T-tests and procedures." VassarStats: Website for statistical computation. Accessed February 20, 2012. http://faculty.vassar.edu/lowry/VassarStats.html.

Madigan, David, Alexander Genkin, David D. Lewis, Shlomo Argamon, Dimitriy Fradkin, and Li Ye. 2005. "Author identification on the large scale." Paper presented at Joint annual meeting of the interface and the classification society of North America (CSNA), Missouri, June 8-12. doi: 10.1.1.60.53.24.

Mills, Donna Eudora. 2003. "Authorship attribution applied to the Bible." Master thesis, Texas Tech University.

Nasr, Seyyed Hossein. 2013. Encyclopedia britannica: Quran. Accessed May 12, 2013. http://www.britannica.com/eb/article-68890/Quran.

Robinson, Neal. 2004. Discovering the Qur'an a contemporary approach to a veiled text. 2nd ed. Washington: Georgetown University Press. Accessed February 12, 2004. http://press.georgetown.edu/book/georgetown/discovering-quran#sthash.HcrTM8n4.dpuf.

Sanderson, Conrad, and Simon Guener. 2006. "Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation." Paper presented at International conference on empirical methods in natural language processing (EMNLP), Sydney, Australia, July 22-23, 482–491. http://nlp.stanford.edu/emnlp06/.

Sayoud, Halim. 2003. "Automatic speaker recognition – Connexionnist approach." PhD diss., University of Sciences and Technology Houari Boumediene.

Sayoud, Halim. 2010. "Investigation of author discrimination between two Holy Islamic books." Teknologia Journal 1.1: X-XII.

Sayoud, Halim. 2012a. "Author discrimination between the Holy Quran and prophet's statements." Literary and Linguistic Computing Journal 27.4: 427-444.

Sayoud, Halim. 2012b. "Authorship classification of two old arabic religious books based on a hierarchical clustering." Paper presented at LREC 2012 pre-conference workshop on LRE-Rel: Language resource and evaluation for religious texts, Istanbul, May 22. http://www.lrec-conf.org/proceedings/lrec2012/workshops/09.Lre-Rel%20Proceedings.pdf .

Signoriello, Domenic, Samant Jain, Matthew Berryman, and Derek Abbott. 2005. "Advanced text authorship detection methods and their application to biblical texts." Paper presented at Conference of complex systems (SPIE), Brisbane, Australia, December 11, 163–175. doi: 10.1117/12.639281.

Stamatatos, Efstathios. 2007. "Author identification using imbalanced and limited training texts." Paper presented at 4th International workshop on text-based information retrieval, Regensburg, Germany, September 3, 237-241. doi: 10.1109/DEXA.2007.5.

Tambouratzis, George, Stella Markantonatou, Nikolaos Hairetakis, Marina Vassiliou, Dimitrios Tambouratzis and George Carayannis. 2000. "Discriminating the registers and styles in the modern greek language." Paper presented at Workshop on comparing corpora at the 38th ACL meeting, Hong Kong, China, October 7, 35-42. http://www.itri.brighton.ac.uk/events/compcorp/programme.html.

Tambouratzis, George, Stella Markantonatou, Marina Vassiliou, and Dimitrios Tambouratzis. 2003. "Employing statistical methods for obtaining discriminant style markers within a specific register." Paper presented at Workshop on text processing for modern greek: From symbolic to statistical approaches, Rethymno, Greece, September 20, 1-10. http://www.philology.uoc.gr/conferences/6thICGL/ebook/ws/workshop@tambouratzis.pdf.

Wikipedia contributors. 2011a. "Hadith." The free encyclopedia. Wikipedia. Accessed July 22. http://en.wikipedia.org/wiki/Hadith

Wikipedia contributors. 2011b. "Quran." The free encyclopedia. Wikipedia. Accessed July 22. http://en.wikipedia.org/wiki/Quran

Wikipedia contributors. 2013a. "Cosine similarity." The free encyclopedia. Wikipedia. Accessed May 12. http://en.wikipedia.org/wiki/Cosine_similarity.

Wikipedia contributors. 2013b. "Linear regression." The free encyclopedia. Wikipedia. Accessed May 12. http://en.wikipedia.org/wiki/Linear_regression.

Witten, Ian, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo Cunningham. 1999. "Weka: Practical machine learning tools and techniques with Java implementations." In ICONIP/ANZIIS/ANNES'99 Workshop on emerging knowledge engineering and connectionist-based information systems, edited by Nikola Kasabov, and Kitty Ko, 192-196. New Zealand: Dunedin. doi: 10.1.1.44.4026.

Valid XHTML 1.0!



Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.